This is my first post at FanGraphs, and I would like to thank David Appelman for inviting me onboard. I have previously written for Seamheads.com and StatSpeak.net, and frequent “The Book” blog. If you’d like to know some more about my background, check out this article I wrote a few months ago.
Today I am going to start off by climbing up on my soapbox to address one of my pet peeves, the use of Line Drive rates as a predictor for Batting Average on Balls in Play (BABIP). The standard practice is to estimate BABIP by LD/Balls in Play + .12. It is claimed that LD rateas are more stable than BABIP from year to year, and that when the actual observed BABIP varies from the predicted by a large margin, this indicates a future regression to the mean.
I’m in the process of updating my park factors for 2008, along with adding in 1999, 1955 and 1953 that the folks at RetroSheet have included in their most recent release. I’ve added a couple more categories, foul flies and line drives. Now, I’ve never heard anyone mention park factors when using LD rates, but in fact they are quite large. I might guess that there could different opinions of what is a line drive from one ballpak to another, or maybe it’s the air or the hitting background. I limited my LD factors to 2003-2008, when the RetroSheet data has complete information on whether a ball is a line drive, ground ball, fly ball or popup on every batted ball, including hits. In Arlington, a batter is 18% more likely to have a batted ball coded as a LD, which may have helped Milton Bradley to have the 2nd highest LD rate in 2008 – while in Minneapolis, it’s 20% less likely. Four of the lowest six LD rates belong to Michael Bourn, Geoff Blum, Ty Wigginton and Hunter Pence, and Minute Maid Park has the second lowest LD park factor at 0.82. This is not saying that Houston batters hit fewer line drives – it’s that Houston and it opponents both have 18% fewer balls scored as liners in Houston than they do on the road.
PARK_ID PARK_NAME First Last PAw LDf
PHI12 Veterans Stadium 2003 2003 4768 1.23
ARL02 Ballpark Arlington 2003 2008 26850 1.18
TOK01 Tokyo Dome 2004 2008 283 1.13
CIN09 Great American 2003 2008 28827 1.11
DEN02 Coors Field 2003 2008 29158 1.10
STL10 Busch Stadium III 2006 2008 13967 1.09
KAN06 Kauffman Stadium 2003 2008 27530 1.09
WAS11 Nationals Park 2008 2008 4790 1.09
TOR02 Rogers Centre 2003 2008 27513 1.08
SFO03 Phone Co Park 2003 2008 29439 1.07
MON02 Stade Olympique 2003 2004 7684 1.07
STL09 Busch Stadium II 2003 2005 14280 1.06
STP01 Tropicana Field 2003 2008 27830 1.06
DET05 Comerica Park 2003 2008 28008 1.06
PHI13 Citizens Bank Park 2004 2008 24640 1.06
MIL06 Miller Park 2003 2008 29354 1.06
WAS10 RFK Stadium 2005 2007 14885 1.05
OAK01 Oakland Coliseum 2003 2008 26719 1.03
SEA03 Safeco Field 2003 2008 26683 1.01
CHI12 Comiskey Park II 2003 2008 28644 1.00
NYC16 Yankee Stadium 2003 2008 28722 1.00
MIA01 Dolphin Stadium 2003 2008 29849 1.00
CLE08 Jacobs Field 2003 2008 28136 0.99
BAL12 Camden Yards 2003 2008 29103 0.99
PIT08 P.N.C. Park 2003 2008 27652 0.98
PHO01 Bank One Ballpark 2003 2008 28810 0.98
SJU01 Hiram Bithorn 2003 2004 2598 0.98
SAN01 Jack Murphy 2003 2003 4943 0.98
LOS03 Dodger Stadium 2003 2008 29555 0.98
CHI11 Wrigley Field 2003 2008 28663 0.96
SAN02 PetCo Park 2004 2008 24432 0.95
NYC17 Shea Stadium 2003 2008 29299 0.92
BOS07 Fenway Park 2003 2008 28311 0.86
ATL02 Turner Field 2003 2008 29016 0.86
ANA01 Anaheim Stadium 2003 2008 26490 0.86
HOU03 Minute Maid Park 2003 2008 28271 0.82
MIN03 Metrodome 2003 2008 28048 0.80
Point Two – are line drives really more predictive? It’s said that if a player’s BABIP is not close to his LD+.12, that it’s becuse of luck, and this should be expected to correct itself next season. Expect the overachiever to come back to Earth.
For all the batters from 2003-2008, in non-bunt plate appearances, I added up the base hits, line drives, ground ball, fly balls and popups. I compared the predicted BABIP to the observed one in each season, which showed a root mean square (RMS) error of .045. Then I compared each years predicted value to the next years observed, and the RMS was .048 – slightly larger. For pitchers, the RMS was .039 in the same season, .039 in the next. I don’t see the evidence of future regression.
Complete line drive data is only available since 2003, and for a few seasns in the 1990s. In the seasons when it was not available, a “true talent level” of BABIP can be estimated by using a rolling weighted mean of past data, commonly referred to as Marcel. I used a seasonal weight of 0.7 – the most recent season is weighted at 1.00, the one before that at 0.70, two seasons back at 0.49, etc, each previous year 0.7 times the next. In this test, I did not use any regression to the league mean. The RMS of LD+.12 compared to the Marcel for the same season was .048 for batters, .046 for pitchers. The Marcel compared to the observed BABIP in the NEXT season was .041 for batters, .039 for pitchers. Historical BABIP data is better than the current season’s LD rate.
If LD data is available, so are GB, FB & PU. I tried a more complex model using .15*FB+.24*GB+.73*LD to estimate BABIP. This worked much abtter at reducing the mean errors, even surpassing historical BABIP. For batters, the yearly RMS came down from .048 to .036, for pitchers from .041 to .031.
Still, you can’t assume that every batter has the same rate of hits on their ground balls. Some batters hit more balls to the left side than the right, some run fast and some run slow. Instead of trying to profile each batter on each type of batted ball, I will continue to use Marcel to weight each batter’s historical BABIP in my projections.
On the other hand, DIPS theory states that a pitcher has little control over the outcome once a ball has been put into play. There is clearly an ability to be a flyball or groundball pitcher. Line drives are considered mistakes, and that may be evidenced ny looking at the six-year totals which show the lowest LD rates nelonging to Mariano Rivera, Fausto Carmona and Derek Lowe, while the highest belong to guys like John Van Benschoten, Edwin Jackson and Tony Armas Jr. Using the FB-FB-LD estimator on the six-year totals drops the pitchers RMS all the way down to .016.
Even so, some pitchers consistently defy the estimates. Roger Clemens, Brian Bannister, Chien-Ming Wang, Carlos Zambrano, Dan Haren, Brandon Webb, Chris Young and Greg Maddux all do at least .020 better than estimated. On the other end, Zach Duke, Sidney Ponson and Glendon Rusch all under perform by at least .020. Is it the ballpark? Is it their defense? The batters they faced? Or is it their own skill or lack of it?
Here’s my plan (I won’t have the answers next week) I want to compile park factors for each type of batted ball in each ballpark – what is the normalized rate of hits for flyballs to left in Dodger Stadium? Then do a WOWY analysis of fielders, showing the rate that each fielder allows more or fewer hits than expected on each groundball, flyball, linedrive and popup. Finally, each batter’s rates. Then go back and look at how many times each pitcher faced each batter, and with which fielders, and in which ballparks. Once those are controled, see how many hits, plus or minus, are left over for each pitcher.