New SIERA, Part Four (of Five): Testing
SIERA’s updated version was unveiled Monday at FanGraphs, and as part of the release, I’ve been taking readers through its ERA-estimation process. I’ve written about SIERA’s ability to predict BABIP and HR/FB, and then I broke down its formula and structure. But like any good analysis, it needed to be tested against other estimators.
So far, the results are pretty conclusive. In fact, SIERA might be the best tool yet to help us understand and better-interpret pitching performance.
I did a round of SIERA testing at Baseball Prospectus in January before making my newest adjustments here. I showed in that article that SIERA and xFIP were by far the best readily available estimators, and that SIERA did slightly better at predicting ERA than xFIP did. This involved a number of testing procedures.
First, I used Root Mean Square Error and saw that from 2003 to 2010, SIERA was closer to next season’s ERA by an average of .021. FIP was .063 behind — though it was only .033 behind when predicting non-park-adjusted ERAs for pitchers on the same team. That figure suggests park effects are biasing and exaggerating the difference between FIP and xFIP.
I also found that the difference between xFIP and SIERA was persistent, with SIERA closer to next-year’s ERA six out of the seven testable years. It also correlated slightly higher (.398 vs. .352) with next season’s ERA and was closer to next season’s ERA 54.6% of the time. Similarly, SIERA also was closest to the previous season’s ERA, to the ERA two seasons later, to the ERA two seasons prior, to the ERA three seasons later, and to the ERA three seasons prior. Weighting didn’t affect the ranking of ERA estimators.
This time, I incorporated two statistics that Tom Tango suggested during the original SIERA discussion: kwERA (defined as kwERA = constant + 11*(UBB+HBP-SO)/PA) and bbFIP (defined as bbFIP= constant + 11*(UBB+HBP+LD-SO-IFFB)/PA +3*(OFFB-GB)/PA).
I weighted the estimator’s square error by the number of innings pitched and excluded all pitchers with fewer than 40 innings pitched — either during the season in question, or during the following season. This table shows the estimators’ ranking:
ERA Estimator (N =2096) | RMSE |
SIERA (new) | 1.075 |
SIERA (old) | 1.079 |
kwERA | 1.081 |
bbFIP | 1.083 |
xFIP | 1.089 |
FIP | 1.141 |
FRA (scaled to ERA) | 1.171 |
tRA (scaled to ERA) | 1.215 |
ERA (park-adj) | 1.308 |
With a standard deviation of the square error terms of about 2 — and a literal .999 correlation between xFIP square error and SIERA square error — a standard error would be about .002. That means the differences are statistically significant.
I also adjusted the weighting using the geometric average of IP in both the season in question and in the following season (in which the ERA was estimated). The ranking was the same:
ERA Estimator (N =2096) | RMSE |
SIERA (new) | 1.040 |
SIERA (old) | 1.045 |
kwERA | 1.053 |
xFIP | 1.057 |
bbFIP | 1.057 |
FIP | 1.118 |
FRA (scaled to ERA) | 1.150 |
tRA (scaled to ERA) | 1.202 |
ERA (park-adj) | 1.289 |
Note that SIERA is not overfit to ERA. All of these estimations of RMSE compared to the following season come from regressing ERA against same-season ERA. Regressing it against next season’s ERA would be cheating, of course, but it would have a lower RMSE (actually, it would be 0.920). This SIERA version, which is fitted to next season’s ERA, is called SIERA*:
Variable | SIERA coefficient | SIERA* coefficient |
(SO/PA) | -15.518 | -15.219 |
(SO/PA)^2 | 9.146 | 12.746 |
(BB/PA) | 8.648 | -0.385 |
(BB/PA)^2 | 27.252 | 10.671 |
(netGB/PA) | -2.298 | -2.844 |
(netGB/PA)^2 | -4.920 | -2.232 |
(SO/PA)*(BB/PA) | -4.036 | 15.421 |
(SO/PA)*(netGB/PA) | 5.155 | 5.226 |
(BB/PA)*(netGB/PA) | 4.546 | 10.150 |
Constant | 5.534 | 5.952 |
Year coefficients (versus 2010 for SIERA, versus 2009 for SIERA*) | From -0.020 to +0.289 | From 0.000 to 0.426 |
% innings as SP | 0.367 | 0.246 |
Year | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 |
SIERA Coefficient | -.020 | +.093 | +.154 | +.037 | +.289 | +.226 | +.116 | +.103 | .000 |
SIERA* Coefficient | +.140 | +.181 | +.106 | +.426 | +.204 | +.133 | +.116 | .000 | n/a |
Take SIERA* with a grain of salt, though, since the metric is calibrated to model in-sample data. And because it’s directly minimizing the square error in estimating next season’s statistic by design, this is an ERA predictor (a projection metric), rather than an ERA estimator (which recreates ERA net of luck and defense). If it’s a projection metric, it should be compared to Oliver, ZIPS, PECOTA, Marcel and the rest. If the goal is to take non-peripheral luck and defense out of the equation, then SIERA is more useful than SIERA*.
I also tested the ERA estimators against same-year ERA. Naturally, the statistics that directly factored in a pitcher’s home run rate did the best (except for tRA) — and bbFIP did better than new SIERA. That’s probably because line drives raise a pitcher’s bbFIP, while scoring bias might arbitrarily inflate that pitcher’s line-drive total when he suffers bad luck. (Note that line-drive rate is not persistent year to year when measured per batted ball. The way to give up fewer line drives is to strike out more batters. In other words, a pitcher needs to avoid batted balls in the first place.) Although kwERA did pretty well with next-season ERA testing, it did poorly with same-year estimation.
ERA Estimator (N =3328) | RMSE |
FIP | .740 |
FRA (scaled to ERA) | .779 |
bbFIP | .852 |
SIERA (new) | .870 |
xFIP | .875 |
tRA (scaled to ERA) | .886 |
SIERA (old) | .890 |
kwERA | .913 |
Since the 2002 SIERA is not calculable because of the unavailability of Retrosheet batted ball data from that year, old SIERA seemed like it might be unfairly damaged. So I removed 2002 estimators from the sample and recalculated for the 2003 through the 2009 seasons’ estimators’ predictions of 2004 through 2010 park-adjusted ERA.
ERA Estimator (N =1838) | RMSE |
SIERA (new) | 1.075 |
SIERA (old) | 1.083 |
kwERA | 1.086 |
bbFIP | 1.088 |
xFIP | 1.092 |
FIP | 1.149 |
FRA (scaled to ERA) | 1.181 |
tRA (scaled to ERA) | 1.217 |
ERA (park-adj) | 1.313 |
The ranking remained the same, anyway.
As I did last time, I checked whether FIP did better with pitchers who stayed on the same team. To determine that, I tested against un-park-adjusted ERA and compared how other statistics predicted park-adjusted ERA for pitchers who stayed on the same teams. The results show an improvement for FIP, but not enough to out-predict xFIP.
ERA Estimator (N =1353) | RMSE |
SIERA (new) | 1.066 |
SIERA (old) | 1.068 |
xFIP | 1.071 |
kwERA | 1.073 |
bbFIP | 1.078 |
FIP (un-adjusted ERA) | 1.114 |
FRA (scaled to ERA) | 1.157 |
tRA (scaled to ERA) | 1.205 |
ERA (park-adj) | 1.273 |
While Root Mean Square Errors are useful when estimating the average difference between an ERA estimator and an ERA, sometimes it’s easier to simply compare statistics head-to-head.
In the next table, I checked how often each statistic matched with the next season’s ERA. The new version of SIERA was better in head-to-head matches versus other estimators. Both bbFIP and kwERA beat xFIP; and kwERA even beat the old SIERA; xFIP was close behind and outlasted all other estimators. All the differences are statistically significant, unless they’re italicized.
% of times row closer (N=2096, so St. Dev. = 1.1%) | SIERA (new) | SIERA (old) | xFIP | FIP | bbFIP | kwERA | FRA (adj) | tRA (adj) | ERA (adj) |
SIERA (new) | — | 50.4% | 52.7% | 55.0% | 52.9% | 53.0% | 56.2% | 56.4% | 60.6% |
SIERA (old) | — | 53.4% | 54.6% | 52.1% | 48.1% | 55.0% | 58.0% | 59.9% | |
xFIP | — | 53.2% | 49.8% | 48.6% | 54.5% | 57.3% | 58.7% | ||
FIP | — | 46.4% | 46.3% | 51.6% | 55.7% | 59.5% | |||
bbFIP | — | 49.4% | 53.9% | 58.0% | 60.1% | ||||
kwERA | — | 54.4% | 57.9% | 59.4% | |||||
FRA (adj) | — | 53.6% | 58.2% | ||||||
tRA (adj) | — | 53.6% | |||||||
ERA (adj) | — |
What I found particularly useful about SIERA is that it doesn’t regress a pitcher’s performance as far back toward the mean as other estimators. The standard deviation for SIERA is about .75, while the standard deviation of xFIP is .68. Even though peripheral performance regresses to the mean — which makes an estimator that regresses ERA to the mean better, according to RMSE — SIERA still holds its own against xFIP.
Basically, SIERA does what ERA estimators are supposed to do. It estimated what ERA would have been — net of defense and sequencing/BABIP/HR luck — while allowing the effects of their performances on peripherals to shine through.
From many angles, it seems that SIERA is the leading ERA estimator available. Still, its advantage over xFIP is small. One can see that, along with bbFIP and kwERA, these statistics are in a tightly packed league of their own and both are useful when predicting next season’s ERA and isolating pitcher performance. Looking at peripherals alone can be useful, but finding a meaningful way to incorporate all of these statistics is valuable. Because of that, each of these statistics has its strengths.
What is particularly interesting, given their very high correlation, is how often SIERA and xFIP still differ. The frequent differences between these two estimators should make it clear why I use both when evaluating a pitching matchup.
So what ideas weren’t incorporated into the new SIERA? Tomorrow, you’ll find out.
Matt writes for FanGraphs and The Hardball Times, and models arbitration salaries for MLB Trade Rumors. Follow him on Twitter @Matt_Swa.
I just wanted to say I have really enjoyed reading this whole series.
This is probably the most well-researched, thoroughly-defended piece I’ve ever read on fangraphs.