Putting WAR in Context: A Response to Bill James

Nine years ago next month, we introduced a new stat to the pages of FanGraphs. We called it Win Values, and on the player pages and leaderboards, it went by the acronym WAR. We wouldn’t actually start calling it that, or use the words for which the acronym stood (Wins Above Replacement) for a little while, since we thought Win Values sounded cooler. And as the people who bring you WPA/LI and RE24, we’re clearly the experts on statistical naming coolness.

Over the last nine years, WAR has become something of a flagship metric, not just for us, but for the analytical community at large. Baseball-Reference introduced their own version, while Baseball Prospectus modernized their version of WARP — their version adds the word player to the name, thus the P — to provide something that scaled a bit more like what was presented here and at B-R. Because WAR is a framework for combining a number of different metrics into a single-value stat, there are also quite a few other versions of WAR out there, each with their own calculations.

But while everyone uses different inputs — and therefore arrives at slightly different results — almost all of the regularly updated WAR metrics are built on some version of linear weights, which assigns an average run value to each event in which a player is involved, regardless of what actually happened on the play. If you hit a single, you get credit for hitting a single. It’s worth some fraction of a run, regardless of whether you hit it with two outs and the bases empty in a the first inning of an eventual blowout, or whether it was a walk-off two-run single to give your team the lead. In most versions of WAR, the value of a player’s contribution is calculated independent of the situation in which it occurred.

Bill James is not a fan of that decision.

We come, then, to the present moment, at which some of my friends and colleagues wish to argue that Aaron Judge is basically even with Jose Altuve, and might reasonably have been the Most Valuable Player. It’s nonsense. Aaron Judge was nowhere near as valuable as Jose Altuve. Why? Because he didn’t do nearly as much to win games for his team as Altuve did. It is NOT close. The belief that it is close is fueled by bad statistical analysis—not as bad as the 1974 statistical analysis, I grant, but flawed nonetheless. It is based essentially on a misleading statistic, which is WAR. Baseball-Reference WAR shows the little guy at 8.3, and the big guy at 8.1. But in reality, they are nowhere near that close. I am not saying that WAR is a bad statistic or a useless statistic, but it is not a perfect statistic, and in this particular case it is just dead wrong. It is dead wrong because the creators of that statistic have severed the connection between performance statistics and wins, thus undermining their analysis.

James strongly believes that the metric falls apart by building up from runs, rather than working backwards from wins, since the context-neutral nature of the metric means that what WAR estimates a group of players are worth won’t add up to how many wins their team actually won. In his mind, the decision to make WAR context-neutral isn’t a point on which reasonable people can disagree; it’s just a mistake.

James, of course, is a pioneer in this field, and FanGraphs probably doesn’t exist today if not for the work he did in the 1970s and 1980s, laying the groundwork for most everything that has come since. So when a person with his resume suggests that WAR isn’t just imperfect — which it absolutely is, no matter which version you use — but instead wrongly constructed, I think it’s worth responding to. So let me take a few minutes to talk about the relationship between context and value, and how that informs people’s preferences for building out WAR in this way.

My primary guiding principle on the usefulness of a metric is twofold:

1. What question is it answering, and is the answer to that question interesting? If yes, proceed. If no, ignore.

2. Does the metric answer that question accurately?

There are plenty of statistics that keep an accurate count of things that happened. In many cases, however, these numbers answer only trivia questions. Who got the most hits on Tuesdays in 2017? There’s an accurate and measurable answer to that question, but I have no idea what it is, because it doesn’t matter in any tangible way.

WAR, on the other hand, attempts to address a question that a lot of people seem interested in answering. If the WAR leaderboards were posed as a question, they might be written as something like this:

“What did each player do, as an individual, to help his team try to win games?”

Wins are a team accomplishment, an amalgamation of performances from a large number of different players. The way WAR has generally been constructed means that it attempts to isolate the player’s contributions towards winning games. And when it comes down to assigning value to individuals for the events in which they’re involved, the general consensus in the sabermetric community has been that we want to reward (or penalize) hitters for what they can control. And the context of the situations in which they play is just not something players can create.

To be clear, this decision wasn’t made solely when WAR models started getting calculated online for people to track. Pretty much every single baseball statistic that is used on a daily basis, regardless of how analytically inclined the user is, is designed to be context-neutral.

Batting average, on-base percentage, slugging percentage, home runs, stolen bases, walks, and strikeouts: all of it is counted without regard to the number of runners on base, the inning, or the score. Outside of runs and RBIs, pretty much every mainstream measure of individual hitter evaluation has been designed to ignore the context of the situation in which it occurred. And runs and RBIs only consider baserunner situation, not number of outs, the score, or the inning in which they occurred. On the pitching side of things, ERA includes baserunner performance, but not inning or score.

There are context-dependent metrics, of course, and we’ve tried to do our part to promote Win Probability and Leverage Index as tools to be included in the discussion of value. We have a number of stats that measure different levels of context-specific performance, ranging from RE24 (baserunner/out context included, inning/score excluded) up to WPA, which includes everything. We keep a statistic called Clutch to track the difference in win values between a player’s context-neutral performance and his context-specific performance.

So why don’t we build WAR off of one of these numbers instead of a linear-weights-based method? Would WAR be better if it included the context of the events, rather than just an estimate of the player’s contribution to the result based on historical averages?

I think the answer is that it depends on how you’re using WAR. In the case of MVP voting, I do think there is a case to be made for looking at the circumstances under which a player performed, and I did use context-dependent metrics when I was an MVP voter. WAR is an imperfect tool, and it’s particularly imperfect for things like the MVP award, which is why even those of us who host sites that promote WAR fairly extensively suggest not relying solely on its results when filling out a ballot.

But if one wanted to build a complete and thorough version of WAR that tied back perfectly to a team’s win total, the reality is that it would not be particularly useful for answering many other questions. Because assigning an individual player with the true contextual value of his performance requires far more adjustments than the simple proposed fix in James’s article.

For instance, one of the main reasons we use BaseRuns instead of linear weights as the underpinnings of our expected won-loss record calculations is that run scoring isn’t linear. As you stack more and more good hitters together, the increase in run production will go up faster than a simple addition of run values would suggest. So, if you wanted to do a true valuation of a player’s performance, you’d have to account for his teammates’ performances, as well, and how the combination of the two translated into runs scored.

But if you do that, you’re explicitly giving a player win-value credit for having better teammates. Jose Altuve is great on his own, but what do we do with a metric that says he’s even better because he has Carlos Correa hitting behind him? That metric is no longer measuring each player’s performance but the specific value he created in that specific lineup, based on what other players did in the aggregate.

You can drill down even further if you want to take the context argument to its logical conclusion. If we want to reduce Aaron Judge’s WAR by the amount of wins he cost his team by performing poorly with men on base, do we want to also adjust his WAR (and everyone else’s) by how his teammates did after he reached base? If we’re primarily interested in tying performance back to team wins, so that a bases-loaded hit is worth more than a bases-empty hit, the same logic would suggest that a base hit in front of a home run is worth more than a base hit in front of a double play. In either case, the player wasn’t responsible for what came before or after his individual contribution, but the results were wildly different, and a hit in front of a double play doesn’t contribute to a win any more than a ground out would have.

And that’s just baserunner/out context. Score and inning context is even thornier ground, because no one really wants to conclude that a home run with your team down 10-0 is worthless. If an individual player metric cannot reward an individual player for his own performance because his teammates were so bad that the game was already out of hand, then it is no longer an individual performance metric.

We’ve written a lot of articles over the years about the pros and cons of context-neutral and context-dependent metrics, and offer a range of stats that attempt to measure value in nearly every phase of the context scale.

But WAR became popular as a metric largely because it attempts to isolate just a player’s individual contribution to a team’s wins and losses, and once you start adding in some context, there’s no real reason to stop until you’re at WPA, which tells you that Melky Cabrera‘s 98 wRC+ resulted in a better offensive season than Jose Ramirez’s 148 wRC+. And even then, WPA ignores the after-he-hit events, so even that isn’t really telling you the full story about the value of Melky’s offensive inputs as they relate to wins.

Once you adjust for the full context of a player’s input into wins and losses, you’re left with a version of WAR that is so far removed from his own contributions that I don’t know what question it would answer. And if you just include some context but not others, then you have to justify only going part way? For instance, James’s Pythagorean expected record was used for decades as a shorthand for “team luck,” because it stripped out the sequencing that turned runs into wins. But it didn’t do anything to strip out the sequencing that turned individual events into runs, so it only measured part of a team’s sequencing value. Why? Because any single metric can’t measure all things at all times.

Like every other metric in existence, WAR measures some things well and other things poorly. It is useful to answer some questions, but not all questions.

For the MVP voting, perhaps WAR is less useful than James would like it to be. On that point, I agree, and I used other metrics when filling out my MVP ballot when I was assigned to be a voter. But for many cases, the questions people are attempting to answer by using WAR are better answered by a context-neutral metric. It might not answer those questions perfectly, but it at least aligns with the questions about which fans are curious.

Is there a place for a context-dependent version of WAR? Perhaps. But then again, we’re already accused of undermining the model’s credibility by having multiple popular methods of calculation. And as James himself found when developing Win Shares, tying individual player performance to team wins isn’t quite as easy as one might hope.

I have no problem admitting that WAR as a model contains a number of flaws, or that our specific implementation of the framework is also flawed. There are a lot of areas for improvement. Forcing it to account precisely for the exact number of wins with which each team finished, though, would probably make it less useful overall as a measure of individual performance.





Dave is the Managing Editor of FanGraphs.

177 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
aschrag83member
6 years ago

Is it impolite to acknowledge that if he hadn’t been a pioneer in the field — which is important; popularizers are important! — nobody would care at all what James has to say these days? On merit, his ideas aren’t really interesting anymore. The world passed him by a while ago.

Dominikk85member
6 years ago
Reply to  aschrag83

Actually Tom tango (who definitely isn’t a dinosauer) has sided with bill on this one. it isnt about right or wrong, it is about what you want to measure: skill or result

Bill is not senile, he understands that basic sabermetric stuff very well, his argument is that clutch affects real wins on the field, so why not factor it in.

now there is a problem with that that clutch isn’t really repeatable year to year and could be considered random and it is also affected by team context (albeit you could easily normalize for that).

however we also use other random factors. many still use ERA with pitchers and even FIP includes HR/FB luck.

and many consider wRC+ objective but it is affected big time by luck (babip, hr/fb…). even war is still based on those luck influenced outcomes, chris taylor wouldnt have had 5 war without outperforming his xWOBA by 30 points.

bills argument is very valid: if we use other context dependent stats (facing billy hamilton or mccutchen in center or head or tailwind…) why not go full results based?

I don’t agree with bill here however but I also don’t think that wRC+ or OBP are objective stats, in fact I would prefer to go full xSTats to judge players.

I actually submitted an article on this in the community section last night:).

aschrag83member
6 years ago
Reply to  Dominikk85

Whether his argument is right isn’t really my point, though.

2wins87
6 years ago
Reply to  aschrag83

It kind of seems like it is though when you say that his ideas no longer have any merit.

aschrag83member
6 years ago
Reply to  2wins87

When’s the last time he wrote something original that changed how you thought about baseball? Performance vs value certainly isn’t a new thought — we’ve been kicking around these ideas for years.

I dunno, the answer to my original “is it impolite” question is, apparently, still yes. So I’ll be over here in my corner, quietly mystified about why we should really care what he thinks about anything anymore.

aguinness
6 years ago
Reply to  aschrag83

James has been working for the Red Sox since 2003, so my guess is most of his original writings are proprietary.

But the answer to your original question is partly in the question, that he is a pioneer in the field. He has probably forgotten more about baseball and statistical analysis than most people will ever know, plus he has a ton of credibility as a pioneer and current employee of a baseball team. Bill James has more than enough credibility that we should care about what he thinks, as opposed to some random cynical internet poster who goes by a mythological Greek creature.

RonnieDobbs
6 years ago
Reply to  aschrag83

I don’t think you have a point – just a progressive narrative.

Knoblaublah
6 years ago
Reply to  aschrag83

To dismiss someone’s argument by saying he doesn’t merit an opinion is the cheapest way of attacking that person’s point without addressing it.

jfree
6 years ago
Reply to  Dominikk85

Bill’s argument is correct but not imo because of ‘clutch’ or other batting-context stuff. It is important because position players have two very different functions – batting/baserunning (scoring) and fielding/defense (preventing scoring). As an aside – the former is also mostly individual and the latter is very much ‘team’ as well.

The context of the game dramatically changes the relative importance of one function vs the other. Sometimes offense is important, sometimes defense is important – and context may also change that relative weighting by position during the game too.

The main problem with WAR imo is that it assumes constancy in the relative weighting of defense v offense for each player throughout the game – and for each position relative to other positions throughout the game. That is a simple error – not a ‘disagreement’.

That particular context is a)repeatable and b)skills-based and c)is the best way of resolving the disconnect between runs and wins.

JoeGarrison
6 years ago
Reply to  jfree

The main problem with WAR is that it gives too much weight to the position listed on the front of the baseball card. If Rfield was measured correctly, then a position adjustment would not be needed.

Rotoholicmember
6 years ago
Reply to  Dominikk85

I feel like most people are only interested in PROJECTED stats. Projected WAR, projected wRC+, projected ERA, etc. Projected stats are, in my opinion, the best measure of talent. Or at least the most useful and interesting measures of talent.

Even retroactively to determine what actually happened (ie: MVP awards) rather than projecting future performance, I think you should look at what you would project them for based on the data from that time period as if the season was played out all over again in the same amount of games played, at the same age. And I think WAR is the best measure of that, by far. Much better than win shares or WPA. Incorporating luck into the equation just is not interesting at all, to me, and it should be removed. Obviously WAR itself includes BABIP-related luck, HR/FB% luck, strand rate luck, and lots and lots of forms of luck. But projected WAR does not. The issue here is that although the most recent season will be most heavily weighted in a WAR projection, it also takes into account prior seasons. So, maybe we make a ZiPS or Steamer or whatever model that uses ONLY one season of data as if the season in question was to be replayed.

2wins87
6 years ago
Reply to  aschrag83

He has a point though, which does seem to be missed by many, which is the difference between talent and value. Yes they have a strong positive correlation, especially over a very long sample, but they need not be that closely related over the course of one season.

WAR (especially the fangraphs version) is much closer to a measure of talent than it is of value, but lots of stat-cognizant fans talk about it as if it’s measuring value. You could certainly argue that it’s silly to even have an MVP award, but as long as we do it should be voted on based on value not talent.

My initial go-to if I had an MVP vote would be a version of WAR which replaces the batting and baserunning wins with WPA, while leaving the defense, positional, etc. components the same. It provides a lot of context while not doing something silly like comparing a 1B and SS as apples-to-apples. Using this metric we wouldn’t even be talking about Altuve (6 wins) and Judge (4.6 wins), but Trout and Betts at about 7 wins each.

thetoddfather
6 years ago
Reply to  2wins87

I think his underlying point is valid, and Dave expressed that he does, too.

People are missing that point because of the harsh, dismissive, and confrontational tone that Bill used. He doesn’t actually say “there is a difference between talent and value” he uses phrases like “dead wrong”, “bad statistical analysis”, and “nonsense”…it’s the tone that makes him sound like a senile old man that has lost touch with reality, whether it’s true or not.

scooter262
6 years ago
Reply to  thetoddfather

I would love to see a live/recorded debate between Dave and Bill on this topic. Just from reading Mr. James’ comments in the article, he does come across as a bit surly in a get-off-my-lawn kind of way (maybe just my impression). I have seen Bill on MLB TV in discussions with Brian Kenney: maybe he could moderate a debate between Mr. James and Mr. Cameron.

RonnieDobbs
6 years ago
Reply to  scooter262

It wouldn’t be a debate – it would be a slaughter. James has lived this life for decades. I doubt Dave knows anything about this that Bill does not. Bill certainly has a much deeper understanding of all the issues involved. I don’t think Dave has a particularly deep understanding of the big picture. No particular offense intended, Dave. Dave is an author about sabermetrics – he is not really a baseball guy from what I gather from his articles. Bill James is both. We should all pay attention when Bill James speaks.

tramps like us
6 years ago
Reply to  thetoddfather

Bill’s pretty much always written that way. Part of his charm, he bruises egos and doesn’t care.

Cool Lester Smoothmember
6 years ago
Reply to  2wins87

WPA doesn’t tell us anything about value, though, because it argues that a run in the first is worth less than a run in the ninth of a 1-0 game.

Its only valid application is as a fun fact.

mikejuntmember
6 years ago

That’s actually true in the context of the game, however: a run when you have few remaining chances to score, or your opponent does, is in fact more valuable than a run when all your outs remain.

It is only more valuable due to sequencing, not difficulty (minus details about relievers, platoon advantages, etc), but that is still factually true.

Cool Lester Smoothmember
6 years ago
Reply to  mikejunt

And that’s why it’s such a great fun fact!

I love “Biggest WPA swings” articles…but it doesn’t tell us much of anything about how much a player helped their team win.

robodew
6 years ago

I love WPA. It definitely tells you a lot of information in the narrative sense, but that doesn’t necessarily mean it’s a good predictive or valuation metric. It’s important to remember what question you’re trying to answer when you read a stat rather than just trying to find the “best” or “most bottom line” stat.

robodew
6 years ago
Reply to  mikejunt

Sure, but event value is not equal to player value credited for the event if the player isn’t totally in control of all factors associated with the event.

Winning the lottery is more valuable than saving for a 401K, but if I’m evaluating someone’s financial aptitude I’ll credit them more for the 401K.

2wins87
6 years ago

I don’t know what your definition of value is, but it takes all of a team’s actual wins and losses and metes them out to each of the players according to what they actually did in a game. The fact that it accounts for what happened in the past but not what will happen in the future is a feature to me, not a bug. It aligns with how we experience the game but does so in a methodical fashion.

If we had two players with identical stats but we knew that one of them had hit two walk-off homers during the season we would reasonably argue that the player with the walk-offs had actually provided more value over the course of the season. Now we might not remember that the other player had two leadoff homers in 1-0 games, but it’s not like WAR fixes this problem, because it just assumes that all of the home runs hit by each of the players were done in some average on base environment that never actually exists in a game. Over the course of a season most regulars will have close to the same amount of high-leverage opportunities, so I’m not that concerned about the swings that come from big plays.

There are other flaws with WPA, like the fact that it assigns all of the credit and blame for what happens on defense to the pitchers while it should be split with the fielders (I would expect someone to come up with a version that attempts this soon given the recent advances in ball & player tracking). So yeah it’s not a perfect measure of value, but if it’s not measuring value at all then I don’t know what is.

mikejuntmember
6 years ago
Reply to  2wins87

The argument is that because of that, and Judge being Not-Great in those situations, it wasn’t a contest, or at least that’s what James said; that Altuve was probably twice as valuable as Judge in actual wins and therefore should have easily won the award.

Jimmy Dugan
6 years ago
Reply to  2wins87

“I don’t know what your definition of value is”

Exactly!

Why do we pretend there is a right answer here?

Player A hits 2 home runs in a loss

Player B goes 1-4 with one home run in a 1-0 win

Player C goes 1-1 with one home run in a pinch hit appearance in a 1-0 win

Player D hits a single with men on second and third and two outs in the bottom of the ninth inning of a game that was 0-1 before the plate appearance started.

Player E pitches a complete game shutout and allows 14 baserunners

Player F pitches a no-hitter in a 9 inning 0-1 loss with 0 Ks and 5 walks.

Player G pitches a no-hitter in a 9 inning 0-1 loss with 27 Ks, his team commits 2 errors.

Player H pitches a no-hitter in a 9 inning win with 27Ks, HE commits 2 errors

etc…
Which player was most “valuable”?

(and these are only some of the variables contained within one game situations, let alone an entire season)

JoeGarrison
6 years ago
Reply to  Jimmy Dugan

WAR suggests the player with the most value will be the middle infielder who hits like a corner outfielder.

cheekfullofchew
6 years ago
Reply to  JoeGarrison

lol, exactly.

Cool Lester Smoothmember
6 years ago
Reply to  2wins87

And what if the other player hit two homers in one-run games…but hit those homers in the third or seventh?

WPA, like any statistic which includes LI, is a great storytelling tool, but useless as a measure of value added.

sadtrombonemember
6 years ago

Actually, WPA is pretty interesting if you are trying to figure out the likelihood of someone winning the game. That’s what it is good for–as a game-level statistic.

It is NOT useful as an individual player statistic, for all of the reasons mentioned here and everywhere else. And on this, I agree with CLS wholeheartedly.

Terencemember
6 years ago
Reply to  2wins87

If your stat for who won games on the field has Altuve significantly behind Trout, you’re doing it wrong. Trout was on the field for 57 wins this year. Altuve was on the field for 94 wins this year.

Captain Tenneal
6 years ago
Reply to  Terence

Base coaches for MVP

Terencemember
6 years ago

I mean, the Angels were 57-58 with Mike Trout on the field. Astros were 94-59 when Altuve played. I can believe that Mike Trout was more valuable than Altuve this season, but not through the method he’s discussing.

I’m not advocating Carlos Beltran for MVP here, Altuve hit .441/.529/.661 in late and close situations. He hit .379/.443/.599 in his 450 PA that occurred when the game was within two runs for a team that won 101 games. If your stats are context dependent instead of context neutral, I’m trying to figure out how Altuve takes a significant step backwards.

rosen380
6 years ago
Reply to  Terence

That’s fine I guess. Here is your 2017 leaderboard:

#1 Francisco Lindor [#11]
#2T Yasiel Puig [#65]
#2T Edwin Encarnacion [#76]
#4 Carlos Santana [#61]
#5 Alex Bregman [#41]
#6T Corey Seager {#12]
#6T Enrique Hernandez [unranked, not enough PA]
#6T Jose Altuve [#2]
#9T Chris Taylor [#22]
#9T Jose Ramirez [#8]

fWAR rank in brackets — man fWAR sucks, not even close to the truth!

2001 Bret Boone, greatest season 1913 onward.
1998 Chad Curtis with his 90 OPS+ and 2.4 rWAR is tied for 4th.

Jimmy Dugan
6 years ago
Reply to  Terence

This is reductive. “If the data analysis comes to a different conclusion than I expected, the analysis must be wrong”.

Rotoholicmember
6 years ago
Reply to  aschrag83

I also find it more than a bit disingenuous to say he was just being polite all these years and didn’t want to pull up the ladder behind him. Maybe that made sense 10 years ago, but most of the sabermetric world has climbed past him on that ladder a long time ago.

tz
6 years ago
Reply to  aschrag83

I think James is getting stuck on nomenclature about the “wins” part, which is understandable. WAR is a metric used to estimate a player’s relative contributions to other players with different skill sets across different positions. It is scaled to wins, but is not designed to explain the attribution of a team’s actual wins.

RonnieDobbs
6 years ago
Reply to  aschrag83

Way to hold yourself in higher esteem than the person that founded this community’s set of values. Very progressive stuff!

I think sabermetrics are useful – thanks Bill James – but misuse of WAR is a problem – thanks Bill James. It is not very progressive to disregard the opinion of a highly informed, experienced opinion… or is it because it is not blind support of a new idea?