New SIERA, Part Four (of Five): Testing

SIERA’s updated version was unveiled Monday at FanGraphs, and as part of the release, I’ve been taking readers through its ERA-estimation process. I’ve written about SIERA’s ability to predict BABIP and HR/FB, and then I broke down its formula and structure. But like any good analysis, it needed to be tested against other estimators.

So far, the results are pretty conclusive. In fact, SIERA might be the best tool yet to help us understand and better-interpret pitching performance.

I did a round of SIERA testing at Baseball Prospectus in January before making my newest adjustments here. I showed in that article that SIERA and xFIP were by far the best readily available estimators, and that SIERA did slightly better at predicting ERA than xFIP did. This involved a number of testing procedures.

First, I used Root Mean Square Error and saw that from 2003 to 2010, SIERA was closer to next season’s ERA by an average of .021. FIP was .063 behind — though it was only .033 behind when predicting non-park-adjusted ERAs for pitchers on the same team. That figure suggests park effects are biasing and exaggerating the difference between FIP and xFIP.

I also found that the difference between xFIP and SIERA was persistent, with SIERA closer to next-year’s ERA six out of the seven testable years. It also correlated slightly higher (.398 vs. .352) with next season’s ERA and was closer to next season’s ERA 54.6% of the time. Similarly, SIERA also was closest to the previous season’s ERA, to the ERA two seasons later, to the ERA two seasons prior, to the ERA three seasons later, and to the ERA three seasons prior. Weighting didn’t affect the ranking of ERA estimators.

This time, I incorporated two statistics that Tom Tango suggested during the original SIERA discussion: kwERA (defined as kwERA = constant + 11*(UBB+HBP-SO)/PA) and bbFIP (defined as bbFIP= constant + 11*(UBB+HBP+LD-SO-IFFB)/PA +3*(OFFB-GB)/PA).

I weighted the estimator’s square error by the number of innings pitched and excluded all pitchers with fewer than 40 innings pitched — either during the season in question, or during the following season. This table shows the estimators’ ranking:

ERA Estimator (N =2096) RMSE
SIERA (new) 1.075
SIERA (old) 1.079
kwERA 1.081
bbFIP 1.083
xFIP 1.089
FIP 1.141
FRA (scaled to ERA) 1.171
tRA (scaled to ERA) 1.215
ERA (park-adj) 1.308

With a standard deviation of the square error terms of about 2 — and a literal .999 correlation between xFIP square error and SIERA square error — a standard error would be about .002. That means the differences are statistically significant.

I also adjusted the weighting using the geometric average of IP in both the season in question and in the following season (in which the ERA was estimated). The ranking was the same:

ERA Estimator (N =2096) RMSE
SIERA (new) 1.040
SIERA (old) 1.045
kwERA 1.053
xFIP 1.057
bbFIP 1.057
FIP 1.118
FRA (scaled to ERA) 1.150
tRA (scaled to ERA) 1.202
ERA (park-adj) 1.289

Note that SIERA is not overfit to ERA. All of these estimations of RMSE compared to the following season come from regressing ERA against same-season ERA. Regressing it against next season’s ERA would be cheating, of course, but it would have a lower RMSE (actually, it would be 0.920). This SIERA version, which is fitted to next season’s ERA, is called SIERA*:

Variable SIERA coefficient SIERA* coefficient
(SO/PA) -15.518 -15.219
(SO/PA)^2 9.146 12.746
(BB/PA) 8.648 -0.385
(BB/PA)^2 27.252 10.671
(netGB/PA) -2.298 -2.844
(netGB/PA)^2 -4.920 -2.232
(SO/PA)*(BB/PA) -4.036 15.421
(SO/PA)*(netGB/PA) 5.155 5.226
(BB/PA)*(netGB/PA) 4.546 10.150
Constant 5.534 5.952
Year coefficients (versus 2010 for SIERA, versus 2009 for SIERA*) From -0.020 to +0.289 From 0.000 to 0.426
% innings as SP 0.367 0.246


Year 2002 2003 2004 2005 2006 2007 2008 2009 2010
SIERA Coefficient -.020 +.093 +.154 +.037 +.289 +.226 +.116 +.103 .000
SIERA* Coefficient +.140 +.181 +.106 +.426 +.204 +.133 +.116 .000 n/a

Take SIERA* with a grain of salt, though, since the metric is calibrated to model in-sample data. And because it’s directly minimizing the square error in estimating next season’s statistic by design, this is an ERA predictor (a projection metric), rather than an ERA estimator (which recreates ERA net of luck and defense). If it’s a projection metric, it should be compared to Oliver, ZIPS, PECOTA, Marcel and the rest. If the goal is to take non-peripheral luck and defense out of the equation, then SIERA is more useful than SIERA*.

I also tested the ERA estimators against same-year ERA. Naturally, the statistics that directly factored in a pitcher’s home run rate did the best (except for tRA) — and bbFIP did better than new SIERA. That’s probably because line drives raise a pitcher’s bbFIP, while scoring bias might arbitrarily inflate that pitcher’s line-drive total when he suffers bad luck. (Note that line-drive rate is not persistent year to year when measured per batted ball. The way to give up fewer line drives is to strike out more batters. In other words, a pitcher needs to avoid batted balls in the first place.) Although kwERA did pretty well with next-season ERA testing, it did poorly with same-year estimation.

ERA Estimator (N =3328) RMSE
FIP .740
FRA (scaled to ERA) .779
bbFIP .852
SIERA (new) .870
xFIP .875
tRA (scaled to ERA) .886
SIERA (old) .890
kwERA .913

Since the 2002 SIERA is not calculable because of the unavailability of Retrosheet batted ball data from that year, old SIERA seemed like it might be unfairly damaged. So I removed 2002 estimators from the sample and recalculated for the 2003 through the 2009 seasons’ estimators’ predictions of 2004 through 2010 park-adjusted ERA.

ERA Estimator (N =1838) RMSE
SIERA (new) 1.075
SIERA (old) 1.083
kwERA 1.086
bbFIP 1.088
xFIP 1.092
FIP 1.149
FRA (scaled to ERA) 1.181
tRA (scaled to ERA) 1.217
ERA (park-adj) 1.313

The ranking remained the same, anyway.

As I did last time, I checked whether FIP did better with pitchers who stayed on the same team. To determine that, I tested against un-park-adjusted ERA and compared how other statistics predicted park-adjusted ERA for pitchers who stayed on the same teams. The results show an improvement for FIP, but not enough to out-predict xFIP.

ERA Estimator (N =1353) RMSE
SIERA (new) 1.066
SIERA (old) 1.068
xFIP 1.071
kwERA 1.073
bbFIP 1.078
FIP (un-adjusted ERA) 1.114
FRA (scaled to ERA) 1.157
tRA (scaled to ERA) 1.205
ERA (park-adj) 1.273

While Root Mean Square Errors are useful when estimating the average difference between an ERA estimator and an ERA, sometimes it’s easier to simply compare statistics head-to-head.

In the next table, I checked how often each statistic matched with the next season’s ERA. The new version of SIERA was better in head-to-head matches versus other estimators. Both bbFIP and kwERA beat xFIP; and kwERA even beat the old SIERA; xFIP was close behind and outlasted all other estimators. All the differences are statistically significant, unless they’re italicized.

% of times row closer (N=2096, so St. Dev. = 1.1%) SIERA (new) SIERA (old) xFIP FIP bbFIP kwERA FRA (adj) tRA (adj) ERA (adj)
SIERA (new) 50.4% 52.7% 55.0% 52.9% 53.0% 56.2% 56.4% 60.6%
SIERA (old) 53.4% 54.6% 52.1% 48.1% 55.0% 58.0% 59.9%
xFIP 53.2% 49.8% 48.6% 54.5% 57.3% 58.7%
FIP 46.4% 46.3% 51.6% 55.7% 59.5%
bbFIP 49.4% 53.9% 58.0% 60.1%
kwERA 54.4% 57.9% 59.4%
FRA (adj) 53.6% 58.2%
tRA (adj) 53.6%
ERA (adj)

What I found particularly useful about SIERA is that it doesn’t regress a pitcher’s performance as far back toward the mean as other estimators. The standard deviation for SIERA is about .75, while the standard deviation of xFIP is .68. Even though peripheral performance regresses to the mean — which makes an estimator that regresses ERA to the mean better, according to RMSE — SIERA still holds its own against xFIP.

Basically, SIERA does what ERA estimators are supposed to do. It estimated what ERA would have been — net of defense and sequencing/BABIP/HR luck — while allowing the effects of their performances on peripherals to shine through.

From many angles, it seems that SIERA is the leading ERA estimator available. Still, its advantage over xFIP is small. One can see that, along with bbFIP and kwERA, these statistics are in a tightly packed league of their own and both are useful when predicting next season’s ERA and isolating pitcher performance. Looking at peripherals alone can be useful, but finding a meaningful way to incorporate all of these statistics is valuable. Because of that, each of these statistics has its strengths.

What is particularly interesting, given their very high correlation, is how often SIERA and xFIP still differ. The frequent differences between these two estimators should make it clear why I use both when evaluating a pitching matchup.

So what ideas weren’t incorporated into the new SIERA? Tomorrow, you’ll find out.

Matt writes for FanGraphs and The Hardball Times, and models arbitration salaries for MLB Trade Rumors. Follow him on Twitter @Matt_Swa.

Newest Most Voted
Inline Feedbacks
View all comments
12 years ago

I just wanted to say I have really enjoyed reading this whole series.

12 years ago
Reply to  Andrew

This is probably the most well-researched, thoroughly-defended piece I’ve ever read on fangraphs.