Year-to-Year Predictability of Pitcher Ball-in-Play Data

The introduction of batted-ball data, first to the clubs and then to the public, certainly has caused a revolution in player evaluation. While the entirety of HITf/x and now Statcast data isn’t likely to be available to the masses anytime soon, the portions that are, including fairly complete PITCHf/x data, have changed the way fans, analysts, club personnel, and, yes, even players look at the game.

As I have often written on these pages, this data needs to be placed into context to be fully understood. There are ongoing issues with data capture, and the simple fact that not all hard or softly hit baseballs are created equal adds levels of nuance that must be understood before meaningful conclusions can be drawn. Another concern expressed by many is the uncertain predictive value of the batted-ball data, particularly with regard to pitchers. Today, let’s take a look at how this data correlates from year to year, from the pitcher’s perspective.

It’s been fairly well established over the years that a “ground ball” or “pop up” pitcher is a real thing: ball-in-play (BIP) type frequencies correlate quite well from year to year. The same applies to strikeout (K) and walk (BB) rates, both from the hitter and pitcher’s perspective. How about batted-ball authority? To examine this issue, I identified the 45 starting pitchers who qualified for the ERA title in either league in both 2014 and 2015. (Players who were traded mid-year and qualified overall but not in either league specifically were omitted.)

For each of these pitchers, the following statistics were scaled to 100 both for the 2014 and 2015 seaosns, with correlation coefficients calculated thereafter.

First, the rate stats:

  • Strikeout rate
  • Walk rate
  • Pop-up rate
  • Fly-ball rate
  • Line-drive rate
  • Ground-ball rate

Then metrics concerning projected production allowed (based on BIP authority):

  • Fly ball/line drive combined
  • Ground ball authority
  • All ball-in-play authority

And, finally, runs-allowed measures/estimators:

  • Earned run average
  • Fielding independent pitching
  • “Tru” ERA (based on BIP frequency and authority)

One note on the middle line item above. Fly balls and line drives were combined when calculating and scaling the Projected Production Allowed data. This was done due to the different manner in which was data was captured in 2014 (Sportvision) and 2015 (Statcast). In 2014, balls in play were classified as fly balls or line drives based on available vertical exit angle data. In 2015, such data was not available, so Statcast’s subjective classifications of BIP as either fly balls or liners had to be accepted.

The usage of relative BIP authority data, scaled to 100, was a necessity, as the average velocities under the SportVision and Statcast systems are drastically different. Statcast didn’t get readings on over 23% of batted balls in 2015, and most of them were weakly hit; Sportvision’s “null” group was much smaller, close to 5%. Combined with the relatively “hot” measurements of the new system, the average velocity of a Sportvision batted ball is in the upper 70s, while Statcast’s is around 90. There is no perfect method to compare data between the two systems, but the best course is to project individual pitcher performance based on average MLB production given each pitcher’s BIP frequency and authority mix, and then scale it to 100.

Here then, are the correlation coefficients for the aforementioned statistical categories for the 45 2014-15 ERA-qualifying starting pitchers.

Correlation Coefficients, 2014-15 ERA Qualifiers
Metric Coefficient
K% 0.81
BB% 0.66
Pop% 0.53
Fly% 0.76
LD% 0.14
GB% 0.86
FL/LD 0.37
GB AUTH 0.25
ERA 0.45
FIP 0.65
TRU ERA 0.72

Without getting too deep, a 100% correlation (1.00 in the above table) is obtained when the two sets of data are totally identical. The closer to 1.00, the higher degree of correlation between the two data sets. Among the above statistics, these 45 pitchers’ grounder rates correlated the most closely from 2014 to 2015, with a 0.86 correlation coefficient. You’ll notice that the frequency statistics, with the exception of line drive rate (0.14), all correlate at 0.50 or higher.

Now look at the statistics that measure BIP authority: fly ball/line drive and overall BIP authority (which relies in part on BIP frequency mix) correlate at 0.37, and ground ball authority correlates at 0.25. While these marks are clearly much lower than the frequency figures, 0.37 in particular represents at least a modest degree of correlation.

Lastly, look at the three measures of run prevention. ERA, the old warhorse of these statistics, correlates the least from year to year (0.45 coefficient). The new run prevention statistic of choice, FIP, correlates quite a bit more from year to year, at 0.65, while my pet stat, “tru” ERA, which incorporates relative BIP authority into the mix, correlates even better at 0.72.

Bottom line, the lower the correlation coefficient, the more random the statistic, the less that “true talent” can be credited or blamed for player performance. Line drive rate allowed is quite random, and largely based on luck. BIP authority management has some randomness to it, but there is a degree of true talent involved. BIP frequency management, excepting line drives, is heavily tied to true talent.

Having established these correlations, tomorrow I’ll examine a collection of free agent pitchers and what their own ball-in-play data reveal about their respective futures.

Newest Most Voted
Inline Feedbacks
View all comments
6 years ago

Since FB% is lower than GB%, does that mean that, instead of making three buckets, we’d improve the predictability with just two buckets–essentially just GB’s and not-GB’s/Balls-in-air? Wouldn’t the rate for the latter just be 1 – GBs, and thus have an identical correlation coefficient (which, again, is higher than the FB correlation)?

Put another way, isn’t separating LD’s and FB’s actually just introducing noise? Of course, the nose is introduced for the sake a distinction we’re interested in, but it still may be meaningful to look at it both ways (bucket LDs and FBs both together and separately).

6 years ago
Reply to  MH

I mean, it’s good to know that LD% doesn’t correlate well, right? So putting that into a bucket and saying that the models shouldn’t expect those to be consistent seems like useful information. I don’t know if that introduces any noise in a predictive model.

Completely reducing noise is what happens when you look simply at TTO and just disregard any results in the field. But of course, it’s good to have more information because there are other things going on when it comes to pitchers. FIP is nice and all, but not necessarily grasping the whole picture.

6 years ago
Reply to  Bronnt

That’s not my point (nor am I sure what you’re saying is necessarily a great one, if I’m understanding correctly). Reducing noise is always what predictive models want to do. The problem with TTOs isn’t that they somehow over-correct for noise, it’s that the noise that remains in TTO-only models probably can be accounted for in stronger models (which is what this is trying to get at). The noise is TTO-only models isn’t strictly random, it’s biased and thus we should be looking for better models.

My point is that if you group LD’s and FB’s together, into a single bucket, that bucket would be more predictive than either of them separately. Although I will admit that I didn’t realize how small this sample was–I had assumed there was more data. With 45 pitchers across two years, a 10% difference between GB’s and FB’s isn’t enormous (though it may still be significant, it’s hard to know without doing the actual tests).

If we do assume the 10% drop holds, one implication of this is that LD’s and FB’s are more similar than either are to GB’s. If you just have the two buckets, LD+FB% vs GB%, they would produce the same correlation coefficient (in this case, 0.86). Past LD+FB% would actually do better at predicting future LD+FB% than past FB% at predicting future FB% (LD% alone does almost nothing to predict future LD%, so on it’s own, it tells us virtually nothing).

And again, that doesn’t mean we should ignore FB% and LD% separately–looking at them both separately and together are not mutually exclusive propositions. But given how low the predictivity of LD% is, a model that estimates future rates by looking at a combined LD%+FB% bucket, then separates them back out on the projection side based on a simple weight for LD% might do better than trying to predict them both separately.

That’s all assuming the gap between GB% and FB% doesn’t narrow substantially with more data though, which again, it very well might.

Jeff Zimmermanmember
6 years ago
Reply to  MH

As I will state in an upcoming THT article, I just look at GB%. If it is low enough, you are a flyball pitcher.

GB’s quick stabilizing point is why it is the only input to SIERA.

6 years ago
Reply to  MH

Awesome, thanks. Good to know and absolutely makes sense.

Tony Blengino to Dave Stewart to Jack Zduriencik and Back...
6 years ago
Reply to  Bronnt

I absolutely do not face these figures should be placed circumstances, what was written in the often not understood. Dingers. Data, such as acquisition and baseball bats of equivalent progressively hard or soft hit they all means not about can before the Creator. Taters.Soft simple fact with the problems still there, he continues that understand understand must. RBI’s. Especially related to pot. Ball – have expressed other concerns about the uncertainty of estimates data batted. Pot, from today’s standpoint, we look at how to link these numbers every year. Steaks ad Rib-Eyes.

6 years ago
Reply to  MH

Since LD have a huge babip, and FB has a very low one, then no, you should not lump these together.

6 years ago
Reply to  joe

This is a different (unrelated) point. I’m not saying we should take 2015 FB+LD% and use that single term to predict, say, 2015 ERA. I’m saying you use 2015 FB+LD% to predict a total figure for 2016 FB+LD%, then you separate that projected FB+LD% term into the individual LD% and FB% terms based on a simple weight for LD%, since it’s essentially random. This theoretically should predict 2016 FB% and LD% rates better than using 2015 FB% and LD% separately (Or as Jeff Zimmerman said above, you can really just look at GB alone, it’s the same thing).

You could also do the same thing within-year to create a component of a DIPS-style stat (SIERA does just this, according to Jeff). Since LD% is essentially random, and GB% is more stable than FB%, you’d essentially be be using an xLD% figure and xFB% figure that’s just LD+FB% times an LD-weight value (for example, if LD’s are typically 35% of total LD+FB%, you multiply actual LD+FB% by .35 for xLD%, then subtract xLD% from LD+FB% for xFB%).