The Limitations Of The 2015 StatCast Data

August 6, 2015

With two-thirds of the 2015 season in the books, an innumerable number of indelible on-field moments have been permanently imprinted upon us. An almost unprecedented influx of young talent has taken hold, a second straight trading deadline swap meet has taken place, and the races for playoff berths are heating up. Off the field, however, 2015 will be remembered, at least in part by those who read these pages, as the year of the arrival of StatCast. The utilization of granular data has been infiltrating the mainstream for a few years now with the milestone arrivals of Hit Tracker and Pitch and Hit f(x), but StatCast takes it to another level, with public availability of batted-ball information, implementation of fielder-specific data on national telecasts, and from the clubs’ perspective, the introduction of detailed spin data on batted balls.

All of that said, there are some clear limitations to the usefulness of the publicly available Year I data. Before taking that information as gospel, it is vital that users place such information into the proper context, and be aware of some primary areas of concern.

With the full StatCast data set publicly unavailable, I figured that the best way to test it would be take a representative sample of the data available on Baseball Savant. During the All Star break, I downloaded all of the granular batted-ball data available for each pitcher who qualified for the ERA title of that date. This encompassed 98 starting pitchers, and 31,557 batted balls allowed.

To determine whether this would indeed be a representative sample of the whole, I compared the AVG and SLG of my sample to the MLB averages at the break; my sample’s .317 AVG and .494 SLG exactly matched the MLB AVG and SLG at that date, passing the test.

The first limitation of the StatCast sample quickly became apparent as I sorted the data by BIP type; popups, fly balls, line drives, ground balls, and the “null” group, representing batted balls for which no exit velocity was recorded. In 2014, only 4.9% of batted balls fit into the “null” group; as of the 2015 All Star break, a whopping 25.4% of all batted balls had no exit velocity recorded. That’s right; the average exit velocity information that has been disseminated through various print, television, radio and online media this season is based on a sample that is far from complete, especially compared to the recent historical norm.

In the table below, you will see the “null” percentage in each ballpark, in both 2014 and 2015:

NULL %	2015	2014
ATL	35.3%	2.5%
AZ	30.7%	2.2%
BAL	25.3%	7.0%
BOS	24.5%	2.2%
CIN	30.4%	3.2%
CLE	27.1%	2.1%
COL	29.4%	3.0%
CUB	22.2%	1.8%
CWS	22.4%	9.8%
DET	23.6%	5.4%
HOU	17.5%	2.1%
KC	35.2%	1.7%
LAA	21.2%	9.7%
LAD	22.8%	5.2%
MIA	26.7%	1.9%
MIL	26.7%	3.7%
MIN	21.8%	4.0%
NYM	20.9%	2.4%
NYY	21.6%	6.5%
OAK	22.8%	21.2%
PHL	23.4%	13.9%
PIT	24.4%	3.1%
SD	24.2%	2.9%
SEA	26.4%	5.6%
SF	37.6%	9.5%
STL	21.5%	3.4%
TB	23.3%	3.6%
TEX	30.3%	3.0%
TOR	21.1%	1.8%
WAS	21.8%	6.2%
ALL	25.4%	4.9%

In 2014, you’ll note that the “null” percentage was lower than the overall MLB average of 4.9% in the vast majority of ballparks. Technical issues in Oakland and Philadelphia, and to a lesser extent in a handful of other parks, were responsible for the percentage reaching even that modest level. In the vast majority of parks, over 95% of batted balls registered the key data needed for detailed analysis.

It’s a much different story in 2015. Houston is the only ballpark with a “null” percentage under 20% as of the All Star break, and six parks yielded velocity reading on less than 70% of batted balls, with San Francisco getting the fewest readings, with a 37.6% “null” percentage. This tells us that there was at least a short-term systemic problem across all 30 parks. There is an awful long way to go just to get to where we were in 2014.

Now, some growing pains are to be expected with any new technology, and these percentages are sure to improve over time. We need to get a better understanding, however, for the types of batted balls that are being missed, as this affects the overall analytical value of the data set in the short term. Let’s take a look at the AVG and SLG on all BIP, the “null” group, and the remaining sample for which velocity readings were received:

	15 AVG	15 SLG	14 AVG	14 SLG
ALL	0.317	0.494	0.318	0.489
LESS NULL	0.248	0.366	0.224	0.359
W/VELO	0.341	0.537	0.323	0.496

In both seasons, similarly mediocre production was generated by the “null” BIP group. In 2015, however, the dramatic expansion in size of that group more significantly impacts the production generated by the BIPs that did generate velocity readings. In 2014, the difference between .318 AVG-.489 SLG and .323 AVG-.496 SLG was minimal; this year, the difference between .317 AVG-.494 SLG and .341 AVG-.537 SLG is quite huge. The data in the StatCast sample simply is not very representative of what has actually been going on the field this season.

In 2014, most of the batted balls, excepting the Oakland, Philadelphia, etc., group with park-specific issues, that were “missed” were popups and very weak ground balls. In 2015, that is again the case, but the population of the BIP being missed is much more extensive, and the velocity thresholds below which they are more likely to be missed have risen. 53.4% of the batted balls in the 2015 “null” group are ground balls; that is well higher than their 47.3% of the entire population. Similarly, 15.2% of the batted balls in the “null group” were classified as popups, compared to 6.8% in the entire population.

The popups are particularly a big deal; 56.3% of popups generated no velocity reading through the All Star break, and therefore didn’t find their way into the granular data. This does no favors to the extreme popup generators of the world, the Jered Weavers and Marco Estradas, when doing detailed analysis upon this data set.

Ditto the weak grounder generators. Velocity readings have been generated on only 71.4% of grounders this season. How do we know it’s the weaker ones that are being missed? Well, hitters have batted .253 AVG-.277 SLG on the ones for which we have readings, and just .212 AVG-.225 SLG on the “null” group. That’s pretty strong evidence. So, the pitchers who have shown an ability to generate weak ground ball contact, the Dallas Keuchels, Garrett Richardses and Johnny Cuetos, don’t get enough credit for that talent.

Then there’s the difference between the velocities generated by the new and old equipment measuring the data. The StatCast equipment runs quite a bit “hotter” than the previous Hit f(x) system. Below, you’ll see the average batted-ball velocity in the three major BIP categories for both 2014 and 2015, as well the overall averages:

	2015	2014
FLY	90.0	83.2
LD	92.5	87.9
GB	85.6	69.7
ALL	88.1	77.9

That’s a pretty massive difference. The largest change is in the ground ball category, and that is likely due in part to the scrubbing of bunts from the data. Still, the increases in average velocity are pretty staggering. Now this is not a problem, per se, as seasonal analysis is based on performance relative to one’s peers. It does change some of previously developed notions as to what constitutes a well hit ball, the nature of the fly ball and ground ball dead zones, etc..

For instance, only 205 fly balls were hit at 105 MPH or harder in 2014, using the previous measuring equipment. Through the 2015 All Star break, 203 fly balls had already been hit that hard. In 2014, 424 liners were hit at 105 MPH or harder; through the 2015 break, 860 had already been hit that hard, and 147 had been hit at 110 MPH or harder. In 2014, 372 grounders were hit at 105 MPH or harder; through the 2015 break, 766 had already been hit that hard, and 142 had been hit at 110 MPH or harder.

We’ve discussed the fly ball “donut hole” in the past. In 2014, MLB hitters batted .077 AVG-.148 SLG on fly balls hit between 75-90 MPH. Well, the upper boundary of that donut hole has moved higher; through the 2015 break, MLB hitters batted just .045 AVG-.092 SLG on fly balls hit between 75-94 MPH.

There’s a similar grounder dead zone. In 2014, MLB hitters batted .116 AVG-.128 SLG on grounders hit at 70 MPH or lower. This constituted 23.2% of ground balls. In 2015, MLB hitters are batting .120 AVG-.128 SLG on grounders hit at 75 MPH or lower, constituting 23.3% of a grounder total that we have shown is understated, as many more weakly hit grounders are being missed.

There are other issues as well. When I first started working with BIP data sets, I compared exit angle data to box score classifications, and came up with the following exit angle categories: 50+ is a popup, 20-50 is a fly ball, 5-20 is a line drive, and < 5 is a ground ball. The StatCast category labels are way out of whack with these assumptions; many, many well hit 20+ degree exit angle batted balls are being categorized as liners. Using my BIP exit angle breakdowns, only 176 line drive homers were hit in 2014; StatCast says that 411 line drive homers were hit in my limited sample of pre-All Star break activity alone. Clubs have access to the detailed exit angle data, but without it, the public faces this significant blurring of the fly ball/line drive categories, making it more difficult to do detailed research. Don't get me wrong.....StatCast is a great thing, and we are only scratching the surface of what it can eventually become. The sample generated for this article yielded some benchmarks which will serve as the foundation for some analysis you will see here in the coming weeks. Still, when one is faced with a data set, one must put it into some sort of context, while acknowledging its limitations. In many of my previous articles here, I have warned readers to never take pure average velocity data at face value; launch angles, BIP type frequencies, pull percentages, etc., significantly affect hitter and pitcher performance, and can be easily be overlooked. For this year, at least, we should additionally be aware of the StatCast data set's unique shortcomings, which adjust the context within which analysis takes place.