WAR: Imperfect but Useful Even in Small Samples

by Dave Cameron

April 29, 2013

This morning, Jon Heyman noted an odd thing on Twitter:

i am not a hater of WAR stat, but if someone can explain to me how starling marte & bryce harper are both 1.7, please do

— Jon Heyman (@JonHeymanCBS) April 29, 2013

He was quoting Baseball-Reference’s WAR calculation, and the two are indeed tied at +1.7 WAR on B-R. Here, we have Bryce Harper (+1.5 WAR) ahead of Starling Marte (+1.2 WAR), but the point still basically stands; WAR thinks Harper (1.200 OPS) and Marte (.835 OPS) have both been pretty great this year, with just a small (or no) difference between them. What Harper has done with the bat, WAR believes that Marte has mostly made up with his legs in baserunning (+3 run advantage) and defense (+3 run advantage), as well a slight bump from getting 12 extra plate appearances.

There’s no question that Harper has been a better offensive player, but there are questions about the defensive valuations, because defensive metrics aren’t as refined at this point as offensive metrics are. It is much easier to prove that Harper has been +10 runs better with the bat this year than it is to prove that Marte has been +3 runs better defensively by UZR, or +7 runs better defensively by DRS. There are more sources for error in the defensive metrics, and Heyman’s tweet led to a discussion on Twitter about the usefulness of including small sample defensive metrics in WAR.

I’ve written before about the strong correlation between team WAR and team winning percentage, and others have followed up with similar analysis more recently. However, all those articles have focused on full season or multi-season data samples, and since the question was raised and I hadn’t yet seen it answered, I became curious about whether WAR would actually correlate better at this point in the year if we just assumed every player in baseball was an average defender.

Essentially, if we just removed defensive metrics from the equation, and evaluated teams solely on their hitting and pitching, how would our WAR calculation compare to team winning percentage? And how does WAR correlate to team winning percentage based on just April 2013 data, when we’re dealing with much smaller sample sizes?

To answer that question, I turned to the numbers. To convert team WAR into an expected winning percentage, I just added position player and pitcher WAR to get total team WAR, divided by games played, multiplied WAR per game by 162 to get a full season total, and then added +47.7 wins — the replacement level assumption — to extrapolated team WAR. I then divided that total by 162 games to get a team’s expected winning percentage based solely on their WAR total, and compared that winning percentage to their actual current winning percentage. Here’s the table showing that comparison.

Team	Winning%	WARWin%	Correlation	R squared
Red Sox	0.720	0.706	0.880	0.775
Rangers	0.640	0.630
Braves	0.625	0.578
Yankees	0.625	0.561
Diamondbacks	0.600	0.574
Orioles	0.600	0.550
Pirates	0.600	0.490
Rockies	0.600	0.610
Royals	0.591	0.608
Cardinals	0.583	0.494
Tigers	0.565	0.660
Athletics	0.538	0.552
Reds	0.538	0.591
Twins	0.524	0.461
Brewers	0.522	0.438
Giants	0.520	0.570
Nationals	0.520	0.486
Dodgers	0.500	0.515
Rays	0.480	0.530
Phillies	0.462	0.437
Mets	0.435	0.477
White Sox	0.417	0.428
Indians	0.409	0.517
Mariners	0.407	0.413
Angels	0.375	0.386
Cubs	0.375	0.432
Padres	0.375	0.328
Blue Jays	0.346	0.391
Astros	0.280	0.350
Marlins	0.240	0.242

The correlation between actual team winning percentage and expected team winning percentage based on WAR is .88, which is almost exactly what Glenn DuPaul found when testing the correlation between full season WAR and team winning percentage last summer. It’s higher than what I got when I compared WAR to team winning percentage back in 2009, before we added things like baserunning to improve the formula. With about 15% of the season completed, current team WAR explains 78% of current team winning percentage.

Considering that WAR doesn’t include any kind of situational context, and we know the sequencing of hits and runs can have a major impact on a team’s win-loss record, that’s still a very robust correlation. That correlation suggests that WAR is doing a lot of things right in terms of measuring the results that lead to wins and losses.

It is almost certainly doing some things wrong as well, and it is theoretically possible that what WAR is getting right is hitting and pitching, and the defensive component is weakening what would be an even stronger correlation if the fielding metrics weren’t included. So, let’s check that out. Here’s a table of team winning percentage compared to a WAR-based winning percentage that assumes every player in baseball has played average defense this season. This is WAR with UZR removed, essentially.

Team	Winning%	NoFldWARWin%	Correlation	R squared
Red Sox	0.720	0.666	0.824	0.679
Rangers	0.640	0.608
Braves	0.625	0.545
Yankees	0.625	0.559
Diamondbacks	0.600	0.524
Orioles	0.600	0.510
Pirates	0.600	0.476
Rockies	0.600	0.605
Royals	0.591	0.557
Cardinals	0.583	0.515
Tigers	0.565	0.709
Athletics	0.538	0.608
Reds	0.538	0.562
Twins	0.524	0.570
Brewers	0.522	0.431
Giants	0.520	0.512
Nationals	0.520	0.475
Dodgers	0.500	0.508
Rays	0.480	0.485
Phillies	0.462	0.463
Mets	0.435	0.510
White Sox	0.417	0.445
Indians	0.409	0.482
Mariners	0.407	0.419
Angels	0.375	0.398
Cubs	0.375	0.440
Padres	0.375	0.363
Blue Jays	0.346	0.428
Astros	0.280	0.366
Marlins	0.240	0.297

Get rid of those crappy small sample useless defensive metrics that are full of errors and bias and you end up with a lower correlation to team wins and losses. The r squared now explains just 68% of a team’s winning percentage. A month into the 2013 season, WAR explains less about team performance without UZR than it does with it.

The original source of Heyman’s tweet, though, wasn’t UZR. He was quoting B-R’s WAR calculation, which uses Defensive Runs Saved as its fielding metric. UZR isn’t nearly as bullish on Starling Marte’s defensive performance as DRS, so maybe it’s BIS’ fielding metric that’s the problem here? To check, I swapped out UZR for DRS in our WAR calculation and re-ran the numbers again. One more table.

Team	Winning%	DRSWARWin%	Correlation	R squared
Red Sox	0.720	0.687	0.898	0.806
Rangers	0.640	0.663
Braves	0.625	0.567
Yankees	0.625	0.572
Diamondbacks	0.600	0.588
Orioles	0.600	0.544
Pirates	0.600	0.562
Rockies	0.600	0.631
Royals	0.591	0.553
Cardinals	0.583	0.489
Tigers	0.565	0.653
Athletics	0.538	0.496
Reds	0.538	0.607
Twins	0.524	0.488
Brewers	0.522	0.500
Giants	0.520	0.495
Nationals	0.520	0.488
Dodgers	0.500	0.556
Rays	0.480	0.519
Phillies	0.462	0.459
Mets	0.435	0.491
White Sox	0.417	0.410
Indians	0.409	0.482
Mariners	0.407	0.395
Angels	0.375	0.328
Cubs	0.375	0.418
Padres	0.375	0.309
Blue Jays	0.346	0.393
Astros	0.280	0.375
Marlins	0.240	0.224

Well, that’s not it. WAR with DRS comes up with basically the same correlation to team winning percentage as WAR with UZR, and both do better than WAR without any defensive component.

Now, maximizing correlation to team winning percentage should not be the goal of WAR. If it was, we’d just make the inputs RBIs and RBIs allowed, and the correlation would be something like .99. It wouldn’t be a better metric simply because it was more highly correlated with winning percentage. This test is essentially a sanity check to make sure that WAR is actually measuring things that impact team wins and losses. The inputs of WAR were chosen to try and identify context-neutral individual player performance, and it’s a good sign that things chosen for those reasons end up correlating well to team wins and losses. It tells us that WAR is working pretty well, even in small samples. Even with imperfect inputs. Even with defensive inputs that are best used in the largest sample you can possibly get.

WAR is not imperfect, nor is it precise. It is best used in whole numbers, with any fractional difference being seen as marginal gaps at best, especially if that difference is based mostly on the defensive components. I wouldn’t say that Starling Marte has been Bryce Harper’s equal so far, because I doubt that’s true. You shouldn’t take two dozen games worth of WAR at face value.

But you shouldn’t take two dozen games worth of anything at face value. The major league leader in ERA is currently Jake Westbrook, at 0.98. If you took ERA at face value, you’d have to argue that Jake Westbrook has been the best pitcher in baseball, and is on pace to have the best pitcher season in the history of the game. No one actually believes that, and no one is arguing that, because everyone knows that in a month’s worth of games, you’re going to see some funky results. Funky results in 24 games do not invalide a metric.

Just like every statistic under the sun, WAR is better when used in large samples. But, despite the beatings it takes on a regular basis, WAR actually does its job pretty well — not perfectly, because it is not a perfect model — even just based on April data alone.

And, getting back to the original point, I’ll note that WAR is very good at spotlighting players like Starling Marte, who deserve recognition but probably aren’t getting it due to the continuing focus on the triple crown statistics in the mainstream media. Today, because of WAR, a lot of people learned that Starling Marte is having a pretty great April. I’ll call that a win, even if the calculation might off by a few runs here or there.

73 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Eric

12 years ago

Out of curiosity, would you expect to get a better correlation by using offensive WAR and pitcher WAR based on RA9 instead of FIP? All defensive performance would be bundled into the pitcher stat so you wouldn’t have to worry about the imperfections of UZR and DRS. Obviously this wouldn’t be super useful for evaluating individual performance, but it would be interesting to see on the the team level.

Dave CameronMember since 2018

Reply to Eric

Yes, if you use RA9 instead of FIP and some fielding metric to account for defense, you’ll come up with a higher correlation. But you’ll also then be including context into the metric, because runs allowed include sequencing. If you did that, you’d have a metric that included sequencing for pitching but not for hitting, unless you also swapped out batting runs for RE24. In other words, you wouldn’t be measuring pitchers and hitters the same way anymore.

What inputs you use all depend on what questions you’re asking. There are times when you want to include sequencing and times you don’t.

Reply to Dave Cameron

Interesting. Thanks!

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG