Two years ago, the Baltimore Orioles gave a middle finger to the concept of regression to the mean. For six months, they won game after game by a single run, relying on a bullpen that posted the highest WPA in history to make the postseason despite the skepticism of sabermetric writers everywhere, including here on FanGraphs. The story of their season was essentially told in two numbers: 93-69 record, +7 run differential. To O’s fans, it was a fantastic season, but to writers like those found here, it was essentially a fluke.
The 2012 Orioles were a decent team that managed to distribute their runs in about the most effective manner possible, but there’s just no evidence to suggest that this is a repeatable skill over significant periods of time. And sure enough, after going 29-9 in one run contests in 2012, the 2013 Orioles went 20-31 in games decided by a lone run. For one year, the Orioles defied the odds, but as we’d expect, they couldn’t get that to carry over into the next season, and they won eight fewer games despite playing basically at the same level as the previous year.
But now, it’s 2014, and the Orioles are doing it again, though not quite to the same degree. Their 24-17 record in one run games isn’t quite so crazy, but they are outperforming what context-neutral models would suggest based on their overall performance to date. As Jeff noted, the Orioles are #2 in Clutch performance this year, winning five more games than their underlying statistics would have suggested. And once again, their bullpen leads the Majors in WPA, though it’s not quite the historical performance of two years ago.
And while Orioles fans may have been able to accept random variation as the explanation for 2012, the fact that they’re doing it again just two years later leads to suspicion that perhaps the Orioles — or maybe just Buck Showalter — have figured out how to game the system. A few comments I received yesterday, both in my chat here and on Twitter.
Comment From baltic fox
Re: bullpen talent. Maybe most guys can’t predict bullpen talent in advance, but Buck Showalter apparently can. Having a lot of contacts in baseball and lots of experience at evaluating players isn’t something that can be measured with metrics.
Comment From King Flops
Is there any amount of differentiation from predictive models that would make you question the models over simply reciting, “luck/coinflips/randomness” like a SABR doll whose string was just pulled
Std deviation is obviously going to occur, but when you’re to the point where the O’s have outperformed their projections 3 years in a row (significantly), don’t you maybe have to think there is a potential flaw? I love game stats, but maybe there are still some things about the game that numbers don’t solve. I understand that is Sacrilege here, but still. Perhaps a fantastic back end of the bullpen and #1 defense instill confidence in SPs who then outperform their projections? Who knows. But being so narrow minded to definitively find std deviation…?
If you are lucky for a long period of time doesn’t that mean it’s no longer luck. Doesn’t that mean there is talent or skill involved?
The sentiment expressed in those last two comments is not at all uncommon, and is actually a perfectly rational statement. At some point, when reality diverges from a model’s expectations for long enough, it is entirely correct to question whether the model works. So, let’s actually dive into the data, and look at whether the Orioles are actually evidence that run estimators are missing something.
While we only have the BaseRuns Standings page here on FanGraphs for 2014, David Appelman was kind enough to send me the data for all years back to 2002. To see how unusual this Orioles run is, I looked at every rolling three year window for every team in baseball from 2004 through 2014. This gives us a total of 330 data points, and should allow us to answer some questions about the relationship between a team’s three year BaseRuns numbers and their corresponding Win-Loss record.
With those 330 data points, I looked at the cumulative difference in winning percentage between a team’s actual record and their BaseRuns expected record. Between 2012, 2013, and 2014, the Orioles have beaten their BaseRuns expected record by a combined .129 winning percentage, or .043 per season. Over 162 games, that’s a difference of seven wins, and again, this is over three seasons. Of the 330 three-year windows we’re looking at, that ranks as the ninth largest difference, confirming what we already knew; what the Orioles are doing is unusual.
As you’ll note from that table, there’s a lot of overlap, which makes sense, because one outlier season can be counted in multiple rolling time frames. For instance, the 2008 Angels — who beat their expected record by 108 points, easily the most of any team in the sample — are included in three of the top four data points in that table. The 2010 Astros were the ninth-highest single season overachiever, but that three-year window makes the list because it also includes the 2008 Astros in the rolling total, and the 2008 Astros had the fifth highest single season gap between actual record and expected record.
But this is just what Orioles fans are suggesting; there are examples of teams who have beaten BaseRuns not just once, but followed it up by doing it over several years. The Angels, in fact, beat their BaseRuns expected record five years in a row, averaging 60 points of winning percentage per season over those years. That’s 10 wins per year over what BaseRuns expected, for five consecutive seasons. Clearly, we have to acknowledge that it is possible to consistently win more games than the model suggests, at least over a five year stretch. It’s happened, and not all that long ago.
But here’s the thing; the existence of an outlier does not prove that a model is broken. In fact, the existence of the right amount of outliers is actually evidence that the model works really well. The question isn’t whether we can find outliers in the data; the question is whether there are more outliers than we’d expect given a normal distribution.
The normal distribution essentially states that, in a sample of data with a given mean, the results will be distributed around that mean in a way that isn’t biased one direction or the other. Most of the results will be closer towards the mean, with fewer and fewer examples as we get further away from that average, and roughly an equal proportion on both sides. The normal distribution is often called a bell curve, because, well, it takes the shape of a bell.
Statistically, the normal distribution has a rule that suggests that 68% of the results will fall within one standard deviation, 95% will fall within two standard deviations, and 99.7% will fall within three standard deviations. In order to see whether the presence of teams like the Angels, Astros, and Orioles prove that BaseRuns is missing something, we can measure just many standard deviations from the mean they have been over these three year windows, and what the overall distribution of all 330 data points is.
Let’s start with the chart of all 330 data points.
It’s not perfectly distributed, but that looks very close to a normal distribution. If you prefer to see it in a curve rather than a histogram, here’s the same data, just presented with the drawn line.
That’s a bell curve, with a very slight skew to the left, as more teams have underachieved than overachieved over the sample we’re looking at. If we had 3,300 data points instead of 330, it’s likely that skew would go away.
But, while the chart certainly makes it look like the data is distributed normally, do the numbers actually match up to the 68-95-99 rule? Well, here’s a table so you can see for yourself.
|1 SD||2 SD||3 SD|
Remember the rule targets are 68.2%, 95.5%, and 99.7%. Yeah, I’d say it’s safe to call this a normal distribution. In other words, there are exactly as many outliers as we’d expect to find given this many data points. The existence of the Angels, Astros, and Orioles isn’t evidence that BaseRuns is broken; it’s evidence that the model follows the normal distribution, and the fact that there aren’t more examples like them suggests that the model works pretty darn well.
For the record, the Orioles 2012-2014 record is 2.02 standard deviations from the mean. In other words, we’d expect to find an example of a team performing this well over a three year rolling window one time with only 20 data points, and so we shouldn’t be too shocked that the Yankees have beaten their BaseRuns expected record by nearly an identical amount over the last three years. What the Orioles have done isn’t actually all that crazy, and it doesn’t come anywhere near the level of suggesting that they have figured out how to exploit a flaw in the model that can be sustained over the long term.
Now, again the presence of the normal distribution does not mean that BaseRuns is a perfect model. I’m not attempting to assert that the model is beyond reproach. What I will say, however, is that if we’re going to identify flaws in the model, we cannot use the existence of the 2012-2014 Orioles as evidence. It isn’t evidence of that.
And really, the evidence also pushes back against this being a Buck Showalter effect. After all, these aren’t Showalter’s first three years as an MLB manager. From 2003 to 2006, he managed the Texas Rangers; in three of those four years, the Rangers lost more games than expected, and his overall average winning percentage in Texas was 16 points lower than the BaseRuns model. In his first two years in Baltimore, the team outperformed, but just slightly so, average 12 points of winning percentage more than the expectation. Of the nine seasons managed by Buck Showalter in the years for which I have BaseRuns data, his average bump in winning percentage amounts to 1 point of winning percentage per year.
Really, if there’s a manager that could have staked a claim to figuring out the way to beat BaseRuns, it was Mike Scioscia. From 2002 to 2011, Scioscia’s Angels beat their BaseRuns expectated records by an average of 41 points per year, winning more games than expected in eight of those ten years. The 2007-2009 Angels were 3.7 standard deviations from the mean in terms of performance over expected record, nearly twice as far from the mean as the 2014 Orioles. After a full decade of beating expectations, certainly Scioscia should be the one guy we should expect to do it, right?
Well, the 2012 Angels won fewer games than BaseRuns expected, and so did the 2013 Angels, and so have the 2014 Angels. After beating the model for five straight years, Scioscia is now on a three year losing streak. This doesn’t erase what he’s done previously, but if we’re to explain how Scioscia figured out how to beat BaseRuns, we have to also explain why he forgot how to do it a couple of years ago, and hasn’t remembered since. And those Angels are the most extreme outlier. If they couldn’t keep doing this, there’s no reason to think anyone else can either. And, for the record, the Astros and Twins — the two other franchises who regularly beat their expected record during our sample — are also on similar losing streaks since the end of their runs.
I understand why Orioles fans are frustrated with being told that their team isn’t as good as their quality record for the second time in three years. We don’t like accepting randomness as an answer, and when someone tells us that regression is coming and then it doesn’t come as soon as they said it should, confirmation bias kicks in, allowing us to believe that the prediction was wrong all along. It is difficult for human beings to observe a repeated event over any real length of time and not find a cause for the results.
But this is why we should be skeptical of our abilities as observers rather than of models that actually work really well in a great majority of the cases. In a competition where the spread in talent level is not that large, randomness is going to play a significant role in the outcome. In Major League Baseball, a team can control, to a large degree, how many and what types of baserunners they create and allow, but there’s just not any evidence that converting those baserunners into runs at a higher than expected level is a real skill, or that distributing the runs a team scores or allows in an advantageous way is something that teams can control. As simple as it might sound, the best way to evaluate a team’s performance is by simply counting up the value of the individual plays and mostly ignoring the order in which they occur.
Sometimes, the ball will bounce your way more often than others, and the small spread in talent among teams means that context-specific performance can skew the standings by as many as 17 wins, though +/- 10 wins is more normal for an outlier within a given season. When we see a team that wins 10 more games than BaseRuns suggests, we shouldn’t conclude that BaseRuns is stupid and wrong; we should conclude that yep, that’s baseball.
And it’s part of what makes the game great. If there wasn’t any variance, and every team won exactly as many games as expected, the sport would be rather boring. We should celebrate the variance that exists, allowing surprising teams to rise up and reward their fans with exciting and unexpected wins. We just shouldn’t allow those exciting rare wins to make us think that the rules apply to everyone except for our favorite team. Embrace randomness, but embrace it for what it is, and don’t try to turn it into something it isn’t.
Dave is the Managing Editor of FanGraphs.