New SIERA, Part Four (of Five): Testing

by Matt Swartz

July 21, 2011

SIERA’s updated version was unveiled Monday at FanGraphs, and as part of the release, I’ve been taking readers through its ERA-estimation process. I’ve written about SIERA’s ability to predict BABIP and HR/FB, and then I broke down its formula and structure. But like any good analysis, it needed to be tested against other estimators.

So far, the results are pretty conclusive. In fact, SIERA might be the best tool yet to help us understand and better-interpret pitching performance.

I did a round of SIERA testing at Baseball Prospectus in January before making my newest adjustments here. I showed in that article that SIERA and xFIP were by far the best readily available estimators, and that SIERA did slightly better at predicting ERA than xFIP did. This involved a number of testing procedures.

First, I used Root Mean Square Error and saw that from 2003 to 2010, SIERA was closer to next season’s ERA by an average of .021. FIP was .063 behind — though it was only .033 behind when predicting non-park-adjusted ERAs for pitchers on the same team. That figure suggests park effects are biasing and exaggerating the difference between FIP and xFIP.

I also found that the difference between xFIP and SIERA was persistent, with SIERA closer to next-year’s ERA six out of the seven testable years. It also correlated slightly higher (.398 vs. .352) with next season’s ERA and was closer to next season’s ERA 54.6% of the time. Similarly, SIERA also was closest to the previous season’s ERA, to the ERA two seasons later, to the ERA two seasons prior, to the ERA three seasons later, and to the ERA three seasons prior. Weighting didn’t affect the ranking of ERA estimators.

This time, I incorporated two statistics that Tom Tango suggested during the original SIERA discussion: kwERA (defined as kwERA = constant + 11*(UBB+HBP-SO)/PA) and bbFIP (defined as bbFIP= constant + 11*(UBB+HBP+LD-SO-IFFB)/PA +3*(OFFB-GB)/PA).

I weighted the estimator’s square error by the number of innings pitched and excluded all pitchers with fewer than 40 innings pitched — either during the season in question, or during the following season. This table shows the estimators’ ranking:

ERA Estimator (N =2096)	RMSE
SIERA (new)	1.075
SIERA (old)	1.079
kwERA	1.081
bbFIP	1.083
xFIP	1.089
FIP	1.141
FRA (scaled to ERA)	1.171
tRA (scaled to ERA)	1.215
ERA (park-adj)	1.308

With a standard deviation of the square error terms of about 2 — and a literal .999 correlation between xFIP square error and SIERA square error — a standard error would be about .002. That means the differences are statistically significant.

I also adjusted the weighting using the geometric average of IP in both the season in question and in the following season (in which the ERA was estimated). The ranking was the same:

ERA Estimator (N =2096)	RMSE
SIERA (new)	1.040
SIERA (old)	1.045
kwERA	1.053
xFIP	1.057
bbFIP	1.057
FIP	1.118
FRA (scaled to ERA)	1.150
tRA (scaled to ERA)	1.202
ERA (park-adj)	1.289

Note that SIERA is not overfit to ERA. All of these estimations of RMSE compared to the following season come from regressing ERA against same-season ERA. Regressing it against next season’s ERA would be cheating, of course, but it would have a lower RMSE (actually, it would be 0.920). This SIERA version, which is fitted to next season’s ERA, is called SIERA*:

Variable	SIERA coefficient	*SIERA coefficient**
(SO/PA)	-15.518	-15.219
(SO/PA)^2	9.146	12.746
(BB/PA)	8.648	-0.385
(BB/PA)^2	27.252	10.671
(netGB/PA)	-2.298	-2.844
(netGB/PA)^2	-4.920	-2.232
(SO/PA)*(BB/PA)	-4.036	15.421
(SO/PA)*(netGB/PA)	5.155	5.226
(BB/PA)*(netGB/PA)	4.546	10.150
Constant	5.534	5.952
Year coefficients (versus 2010 for SIERA, versus 2009 for SIERA*)	From -0.020 to +0.289	From 0.000 to 0.426
% innings as SP	0.367	0.246

Year	2002	2003	2004	2005	2006	2007	2008	2009	2010
SIERA Coefficient	-.020	+.093	+.154	+.037	+.289	+.226	+.116	+.103	.000
SIERA* Coefficient	+.140	+.181	+.106	+.426	+.204	+.133	+.116	.000	n/a

Take SIERA* with a grain of salt, though, since the metric is calibrated to model in-sample data. And because it’s directly minimizing the square error in estimating next season’s statistic by design, this is an ERA predictor (a projection metric), rather than an ERA estimator (which recreates ERA net of luck and defense). If it’s a projection metric, it should be compared to Oliver, ZIPS, PECOTA, Marcel and the rest. If the goal is to take non-peripheral luck and defense out of the equation, then SIERA is more useful than SIERA*.

I also tested the ERA estimators against same-year ERA. Naturally, the statistics that directly factored in a pitcher’s home run rate did the best (except for tRA) — and bbFIP did better than new SIERA. That’s probably because line drives raise a pitcher’s bbFIP, while scoring bias might arbitrarily inflate that pitcher’s line-drive total when he suffers bad luck. (Note that line-drive rate is not persistent year to year when measured per batted ball. The way to give up fewer line drives is to strike out more batters. In other words, a pitcher needs to avoid batted balls in the first place.) Although kwERA did pretty well with next-season ERA testing, it did poorly with same-year estimation.

ERA Estimator (N =3328)	RMSE
FIP	.740
FRA (scaled to ERA)	.779
bbFIP	.852
SIERA (new)	.870
xFIP	.875
tRA (scaled to ERA)	.886
SIERA (old)	.890
kwERA	.913

Since the 2002 SIERA is not calculable because of the unavailability of Retrosheet batted ball data from that year, old SIERA seemed like it might be unfairly damaged. So I removed 2002 estimators from the sample and recalculated for the 2003 through the 2009 seasons’ estimators’ predictions of 2004 through 2010 park-adjusted ERA.

ERA Estimator (N =1838)	RMSE
SIERA (new)	1.075
SIERA (old)	1.083
kwERA	1.086
bbFIP	1.088
xFIP	1.092
FIP	1.149
FRA (scaled to ERA)	1.181
tRA (scaled to ERA)	1.217
ERA (park-adj)	1.313

The ranking remained the same, anyway.

As I did last time, I checked whether FIP did better with pitchers who stayed on the same team. To determine that, I tested against un-park-adjusted ERA and compared how other statistics predicted park-adjusted ERA for pitchers who stayed on the same teams. The results show an improvement for FIP, but not enough to out-predict xFIP.

ERA Estimator (N =1353)	RMSE
SIERA (new)	1.066
SIERA (old)	1.068
xFIP	1.071
kwERA	1.073
bbFIP	1.078
FIP (un-adjusted ERA)	1.114
FRA (scaled to ERA)	1.157
tRA (scaled to ERA)	1.205
ERA (park-adj)	1.273

While Root Mean Square Errors are useful when estimating the average difference between an ERA estimator and an ERA, sometimes it’s easier to simply compare statistics head-to-head.

In the next table, I checked how often each statistic matched with the next season’s ERA. The new version of SIERA was better in head-to-head matches versus other estimators. Both bbFIP and kwERA beat xFIP; and kwERA even beat the old SIERA; xFIP was close behind and outlasted all other estimators. All the differences are statistically significant, unless they’re italicized.

% of times row closer (N=2096, so St. Dev. = 1.1%)	SIERA (new)	SIERA (old)	xFIP	FIP	bbFIP	kwERA	FRA (adj)	tRA (adj)	ERA (adj)
SIERA (new)	—	50.4%	52.7%	55.0%	52.9%	53.0%	56.2%	56.4%	60.6%
SIERA (old)		—	53.4%	54.6%	52.1%	48.1%	55.0%	58.0%	59.9%
xFIP			—	53.2%	49.8%	48.6%	54.5%	57.3%	58.7%
FIP				—	46.4%	46.3%	51.6%	55.7%	59.5%
bbFIP					—	49.4%	53.9%	58.0%	60.1%
kwERA						—	54.4%	57.9%	59.4%
FRA (adj)							—	53.6%	58.2%
tRA (adj)								—	53.6%
ERA (adj)									—

What I found particularly useful about SIERA is that it doesn’t regress a pitcher’s performance as far back toward the mean as other estimators. The standard deviation for SIERA is about .75, while the standard deviation of xFIP is .68. Even though peripheral performance regresses to the mean — which makes an estimator that regresses ERA to the mean better, according to RMSE — SIERA still holds its own against xFIP.

Basically, SIERA does what ERA estimators are supposed to do. It estimated what ERA would have been — net of defense and sequencing/BABIP/HR luck — while allowing the effects of their performances on peripherals to shine through.

From many angles, it seems that SIERA is the leading ERA estimator available. Still, its advantage over xFIP is small. One can see that, along with bbFIP and kwERA, these statistics are in a tightly packed league of their own and both are useful when predicting next season’s ERA and isolating pitcher performance. Looking at peripherals alone can be useful, but finding a meaningful way to incorporate all of these statistics is valuable. Because of that, each of these statistics has its strengths.

What is particularly interesting, given their very high correlation, is how often SIERA and xFIP still differ. The frequent differences between these two estimators should make it clear why I use both when evaluating a pitching matchup.

So what ideas weren’t incorporated into the new SIERA? Tomorrow, you’ll find out.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG