WAR: Imperfect but Useful Even in Small Samples

This morning, Jon Heyman noted an odd thing on Twitter:

He was quoting Baseball-Reference’s WAR calculation, and the two are indeed tied at +1.7 WAR on B-R. Here, we have Bryce Harper (+1.5 WAR) ahead of Starling Marte (+1.2 WAR), but the point still basically stands; WAR thinks Harper (1.200 OPS) and Marte (.835 OPS) have both been pretty great this year, with just a small (or no) difference between them. What Harper has done with the bat, WAR believes that Marte has mostly made up with his legs in baserunning (+3 run advantage) and defense (+3 run advantage), as well a slight bump from getting 12 extra plate appearances.

There’s no question that Harper has been a better offensive player, but there are questions about the defensive valuations, because defensive metrics aren’t as refined at this point as offensive metrics are. It is much easier to prove that Harper has been +10 runs better with the bat this year than it is to prove that Marte has been +3 runs better defensively by UZR, or +7 runs better defensively by DRS. There are more sources for error in the defensive metrics, and Heyman’s tweet led to a discussion on Twitter about the usefulness of including small sample defensive metrics in WAR.

I’ve written before about the strong correlation between team WAR and team winning percentage, and others have followed up with similar analysis more recently. However, all those articles have focused on full season or multi-season data samples, and since the question was raised and I hadn’t yet seen it answered, I became curious about whether WAR would actually correlate better at this point in the year if we just assumed every player in baseball was an average defender.

Essentially, if we just removed defensive metrics from the equation, and evaluated teams solely on their hitting and pitching, how would our WAR calculation compare to team winning percentage? And how does WAR correlate to team winning percentage based on just April 2013 data, when we’re dealing with much smaller sample sizes?

To answer that question, I turned to the numbers. To convert team WAR into an expected winning percentage, I just added position player and pitcher WAR to get total team WAR, divided by games played, multiplied WAR per game by 162 to get a full season total, and then added +47.7 wins — the replacement level assumption — to extrapolated team WAR. I then divided that total by 162 games to get a team’s expected winning percentage based solely on their WAR total, and compared that winning percentage to their actual current winning percentage. Here’s the table showing that comparison.

Team Winning% WARWin% Correlation R squared
Red Sox 0.720 0.706 0.880 0.775
Rangers 0.640 0.630    
Braves 0.625 0.578    
Yankees 0.625 0.561    
Diamondbacks 0.600 0.574    
Orioles 0.600 0.550    
Pirates 0.600 0.490    
Rockies 0.600 0.610    
Royals 0.591 0.608    
Cardinals 0.583 0.494    
Tigers 0.565 0.660    
Athletics 0.538 0.552    
Reds 0.538 0.591    
Twins 0.524 0.461    
Brewers 0.522 0.438    
Giants 0.520 0.570    
Nationals 0.520 0.486    
Dodgers 0.500 0.515    
Rays 0.480 0.530    
Phillies 0.462 0.437    
Mets 0.435 0.477    
White Sox 0.417 0.428    
Indians 0.409 0.517    
Mariners 0.407 0.413    
Angels 0.375 0.386    
Cubs 0.375 0.432    
Padres 0.375 0.328    
Blue Jays 0.346 0.391    
Astros 0.280 0.350    
Marlins 0.240 0.242    

The correlation between actual team winning percentage and expected team winning percentage based on WAR is .88, which is almost exactly what Glenn DuPaul found when testing the correlation between full season WAR and team winning percentage last summer. It’s higher than what I got when I compared WAR to team winning percentage back in 2009, before we added things like baserunning to improve the formula. With about 15% of the season completed, current team WAR explains 78% of current team winning percentage.

Considering that WAR doesn’t include any kind of situational context, and we know the sequencing of hits and runs can have a major impact on a team’s win-loss record, that’s still a very robust correlation. That correlation suggests that WAR is doing a lot of things right in terms of measuring the results that lead to wins and losses.

It is almost certainly doing some things wrong as well, and it is theoretically possible that what WAR is getting right is hitting and pitching, and the defensive component is weakening what would be an even stronger correlation if the fielding metrics weren’t included. So, let’s check that out. Here’s a table of team winning percentage compared to a WAR-based winning percentage that assumes every player in baseball has played average defense this season. This is WAR with UZR removed, essentially.

Team Winning% NoFldWARWin% Correlation R squared
Red Sox 0.720 0.666 0.824 0.679
Rangers 0.640 0.608    
Braves 0.625 0.545    
Yankees 0.625 0.559    
Diamondbacks 0.600 0.524    
Orioles 0.600 0.510    
Pirates 0.600 0.476    
Rockies 0.600 0.605    
Royals 0.591 0.557    
Cardinals 0.583 0.515    
Tigers 0.565 0.709    
Athletics 0.538 0.608    
Reds 0.538 0.562    
Twins 0.524 0.570    
Brewers 0.522 0.431    
Giants 0.520 0.512    
Nationals 0.520 0.475    
Dodgers 0.500 0.508    
Rays 0.480 0.485    
Phillies 0.462 0.463    
Mets 0.435 0.510    
White Sox 0.417 0.445    
Indians 0.409 0.482    
Mariners 0.407 0.419    
Angels 0.375 0.398    
Cubs 0.375 0.440    
Padres 0.375 0.363    
Blue Jays 0.346 0.428    
Astros 0.280 0.366    
Marlins 0.240 0.297    

Get rid of those crappy small sample useless defensive metrics that are full of errors and bias and you end up with a lower correlation to team wins and losses. The r squared now explains just 68% of a team’s winning percentage. A month into the 2013 season, WAR explains less about team performance without UZR than it does with it.

The original source of Heyman’s tweet, though, wasn’t UZR. He was quoting B-R’s WAR calculation, which uses Defensive Runs Saved as its fielding metric. UZR isn’t nearly as bullish on Starling Marte’s defensive performance as DRS, so maybe it’s BIS’ fielding metric that’s the problem here? To check, I swapped out UZR for DRS in our WAR calculation and re-ran the numbers again. One more table.

Team Winning% DRSWARWin% Correlation R squared
Red Sox 0.720 0.687 0.898 0.806
Rangers 0.640 0.663    
Braves 0.625 0.567    
Yankees 0.625 0.572    
Diamondbacks 0.600 0.588    
Orioles 0.600 0.544    
Pirates 0.600 0.562    
Rockies 0.600 0.631    
Royals 0.591 0.553    
Cardinals 0.583 0.489    
Tigers 0.565 0.653    
Athletics 0.538 0.496    
Reds 0.538 0.607    
Twins 0.524 0.488    
Brewers 0.522 0.500    
Giants 0.520 0.495    
Nationals 0.520 0.488    
Dodgers 0.500 0.556    
Rays 0.480 0.519    
Phillies 0.462 0.459    
Mets 0.435 0.491    
White Sox 0.417 0.410    
Indians 0.409 0.482    
Mariners 0.407 0.395    
Angels 0.375 0.328    
Cubs 0.375 0.418    
Padres 0.375 0.309    
Blue Jays 0.346 0.393    
Astros 0.280 0.375    
Marlins 0.240 0.224    

Well, that’s not it. WAR with DRS comes up with basically the same correlation to team winning percentage as WAR with UZR, and both do better than WAR without any defensive component.

Now, maximizing correlation to team winning percentage should not be the goal of WAR. If it was, we’d just make the inputs RBIs and RBIs allowed, and the correlation would be something like .99. It wouldn’t be a better metric simply because it was more highly correlated with winning percentage. This test is essentially a sanity check to make sure that WAR is actually measuring things that impact team wins and losses. The inputs of WAR were chosen to try and identify context-neutral individual player performance, and it’s a good sign that things chosen for those reasons end up correlating well to team wins and losses. It tells us that WAR is working pretty well, even in small samples. Even with imperfect inputs. Even with defensive inputs that are best used in the largest sample you can possibly get.

WAR is not imperfect, nor is it precise. It is best used in whole numbers, with any fractional difference being seen as marginal gaps at best, especially if that difference is based mostly on the defensive components. I wouldn’t say that Starling Marte has been Bryce Harper’s equal so far, because I doubt that’s true. You shouldn’t take two dozen games worth of WAR at face value.

But you shouldn’t take two dozen games worth of anything at face value. The major league leader in ERA is currently Jake Westbrook, at 0.98. If you took ERA at face value, you’d have to argue that Jake Westbrook has been the best pitcher in baseball, and is on pace to have the best pitcher season in the history of the game. No one actually believes that, and no one is arguing that, because everyone knows that in a month’s worth of games, you’re going to see some funky results. Funky results in 24 games do not invalide a metric.

Just like every statistic under the sun, WAR is better when used in large samples. But, despite the beatings it takes on a regular basis, WAR actually does its job pretty well — not perfectly, because it is not a perfect model — even just based on April data alone.

And, getting back to the original point, I’ll note that WAR is very good at spotlighting players like Starling Marte, who deserve recognition but probably aren’t getting it due to the continuing focus on the triple crown statistics in the mainstream media. Today, because of WAR, a lot of people learned that Starling Marte is having a pretty great April. I’ll call that a win, even if the calculation might off by a few runs here or there.





Dave is the Managing Editor of FanGraphs.

73 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Eric
10 years ago

Out of curiosity, would you expect to get a better correlation by using offensive WAR and pitcher WAR based on RA9 instead of FIP? All defensive performance would be bundled into the pitcher stat so you wouldn’t have to worry about the imperfections of UZR and DRS. Obviously this wouldn’t be super useful for evaluating individual performance, but it would be interesting to see on the the team level.

Eric
10 years ago
Reply to  Dave Cameron

Interesting. Thanks!