Putting WAR in Context: A Response to Bill James by Dave Cameron November 20, 2017 Nine years ago next month, we introduced a new stat to the pages of FanGraphs. We called it Win Values, and on the player pages and leaderboards, it went by the acronym WAR. We wouldn’t actually start calling it that, or use the words for which the acronym stood (Wins Above Replacement) for a little while, since we thought Win Values sounded cooler. And as the people who bring you WPA/LI and RE24, we’re clearly the experts on statistical naming coolness. Over the last nine years, WAR has become something of a flagship metric, not just for us, but for the analytical community at large. Baseball-Reference introduced their own version, while Baseball Prospectus modernized their version of WARP — their version adds the word player to the name, thus the P — to provide something that scaled a bit more like what was presented here and at B-R. Because WAR is a framework for combining a number of different metrics into a single-value stat, there are also quite a few other versions of WAR out there, each with their own calculations. But while everyone uses different inputs — and therefore arrives at slightly different results — almost all of the regularly updated WAR metrics are built on some version of linear weights, which assigns an average run value to each event in which a player is involved, regardless of what actually happened on the play. If you hit a single, you get credit for hitting a single. It’s worth some fraction of a run, regardless of whether you hit it with two outs and the bases empty in a the first inning of an eventual blowout, or whether it was a walk-off two-run single to give your team the lead. In most versions of WAR, the value of a player’s contribution is calculated independent of the situation in which it occurred. Bill James is not a fan of that decision. We come, then, to the present moment, at which some of my friends and colleagues wish to argue that Aaron Judge is basically even with Jose Altuve, and might reasonably have been the Most Valuable Player. It’s nonsense. Aaron Judge was nowhere near as valuable as Jose Altuve. Why? Because he didn’t do nearly as much to win games for his team as Altuve did. It is NOT close. The belief that it is close is fueled by bad statistical analysis—not as bad as the 1974 statistical analysis, I grant, but flawed nonetheless. It is based essentially on a misleading statistic, which is WAR. Baseball-Reference WAR shows the little guy at 8.3, and the big guy at 8.1. But in reality, they are nowhere near that close. I am not saying that WAR is a bad statistic or a useless statistic, but it is not a perfect statistic, and in this particular case it is just dead wrong. It is dead wrong because the creators of that statistic have severed the connection between performance statistics and wins, thus undermining their analysis. James strongly believes that the metric falls apart by building up from runs, rather than working backwards from wins, since the context-neutral nature of the metric means that what WAR estimates a group of players are worth won’t add up to how many wins their team actually won. In his mind, the decision to make WAR context-neutral isn’t a point on which reasonable people can disagree; it’s just a mistake. James, of course, is a pioneer in this field, and FanGraphs probably doesn’t exist today if not for the work he did in the 1970s and 1980s, laying the groundwork for most everything that has come since. So when a person with his resume suggests that WAR isn’t just imperfect — which it absolutely is, no matter which version you use — but instead wrongly constructed, I think it’s worth responding to. So let me take a few minutes to talk about the relationship between context and value, and how that informs people’s preferences for building out WAR in this way. My primary guiding principle on the usefulness of a metric is twofold: 1. What question is it answering, and is the answer to that question interesting? If yes, proceed. If no, ignore. 2. Does the metric answer that question accurately? There are plenty of statistics that keep an accurate count of things that happened. In many cases, however, these numbers answer only trivia questions. Who got the most hits on Tuesdays in 2017? There’s an accurate and measurable answer to that question, but I have no idea what it is, because it doesn’t matter in any tangible way. WAR, on the other hand, attempts to address a question that a lot of people seem interested in answering. If the WAR leaderboards were posed as a question, they might be written as something like this: “What did each player do, as an individual, to help his team try to win games?” Wins are a team accomplishment, an amalgamation of performances from a large number of different players. The way WAR has generally been constructed means that it attempts to isolate the player’s contributions towards winning games. And when it comes down to assigning value to individuals for the events in which they’re involved, the general consensus in the sabermetric community has been that we want to reward (or penalize) hitters for what they can control. And the context of the situations in which they play is just not something players can create. To be clear, this decision wasn’t made solely when WAR models started getting calculated online for people to track. Pretty much every single baseball statistic that is used on a daily basis, regardless of how analytically inclined the user is, is designed to be context-neutral. Batting average, on-base percentage, slugging percentage, home runs, stolen bases, walks, and strikeouts: all of it is counted without regard to the number of runners on base, the inning, or the score. Outside of runs and RBIs, pretty much every mainstream measure of individual hitter evaluation has been designed to ignore the context of the situation in which it occurred. And runs and RBIs only consider baserunner situation, not number of outs, the score, or the inning in which they occurred. On the pitching side of things, ERA includes baserunner performance, but not inning or score. There are context-dependent metrics, of course, and we’ve tried to do our part to promote Win Probability and Leverage Index as tools to be included in the discussion of value. We have a number of stats that measure different levels of context-specific performance, ranging from RE24 (baserunner/out context included, inning/score excluded) up to WPA, which includes everything. We keep a statistic called Clutch to track the difference in win values between a player’s context-neutral performance and his context-specific performance. So why don’t we build WAR off of one of these numbers instead of a linear-weights-based method? Would WAR be better if it included the context of the events, rather than just an estimate of the player’s contribution to the result based on historical averages? I think the answer is that it depends on how you’re using WAR. In the case of MVP voting, I do think there is a case to be made for looking at the circumstances under which a player performed, and I did use context-dependent metrics when I was an MVP voter. WAR is an imperfect tool, and it’s particularly imperfect for things like the MVP award, which is why even those of us who host sites that promote WAR fairly extensively suggest not relying solely on its results when filling out a ballot. But if one wanted to build a complete and thorough version of WAR that tied back perfectly to a team’s win total, the reality is that it would not be particularly useful for answering many other questions. Because assigning an individual player with the true contextual value of his performance requires far more adjustments than the simple proposed fix in James’s article. For instance, one of the main reasons we use BaseRuns instead of linear weights as the underpinnings of our expected won-loss record calculations is that run scoring isn’t linear. As you stack more and more good hitters together, the increase in run production will go up faster than a simple addition of run values would suggest. So, if you wanted to do a true valuation of a player’s performance, you’d have to account for his teammates’ performances, as well, and how the combination of the two translated into runs scored. But if you do that, you’re explicitly giving a player win-value credit for having better teammates. Jose Altuve is great on his own, but what do we do with a metric that says he’s even better because he has Carlos Correa hitting behind him? That metric is no longer measuring each player’s performance but the specific value he created in that specific lineup, based on what other players did in the aggregate. You can drill down even further if you want to take the context argument to its logical conclusion. If we want to reduce Aaron Judge’s WAR by the amount of wins he cost his team by performing poorly with men on base, do we want to also adjust his WAR (and everyone else’s) by how his teammates did after he reached base? If we’re primarily interested in tying performance back to team wins, so that a bases-loaded hit is worth more than a bases-empty hit, the same logic would suggest that a base hit in front of a home run is worth more than a base hit in front of a double play. In either case, the player wasn’t responsible for what came before or after his individual contribution, but the results were wildly different, and a hit in front of a double play doesn’t contribute to a win any more than a ground out would have. And that’s just baserunner/out context. Score and inning context is even thornier ground, because no one really wants to conclude that a home run with your team down 10-0 is worthless. If an individual player metric cannot reward an individual player for his own performance because his teammates were so bad that the game was already out of hand, then it is no longer an individual performance metric. We’ve written a lot of articles over the years about the pros and cons of context-neutral and context-dependent metrics, and offer a range of stats that attempt to measure value in nearly every phase of the context scale. But WAR became popular as a metric largely because it attempts to isolate just a player’s individual contribution to a team’s wins and losses, and once you start adding in some context, there’s no real reason to stop until you’re at WPA, which tells you that Melky Cabrera‘s 98 wRC+ resulted in a better offensive season than Jose Ramirez’s 148 wRC+. And even then, WPA ignores the after-he-hit events, so even that isn’t really telling you the full story about the value of Melky’s offensive inputs as they relate to wins. Once you adjust for the full context of a player’s input into wins and losses, you’re left with a version of WAR that is so far removed from his own contributions that I don’t know what question it would answer. And if you just include some context but not others, then you have to justify only going part way? For instance, James’s Pythagorean expected record was used for decades as a shorthand for “team luck,” because it stripped out the sequencing that turned runs into wins. But it didn’t do anything to strip out the sequencing that turned individual events into runs, so it only measured part of a team’s sequencing value. Why? Because any single metric can’t measure all things at all times. Like every other metric in existence, WAR measures some things well and other things poorly. It is useful to answer some questions, but not all questions. For the MVP voting, perhaps WAR is less useful than James would like it to be. On that point, I agree, and I used other metrics when filling out my MVP ballot when I was assigned to be a voter. But for many cases, the questions people are attempting to answer by using WAR are better answered by a context-neutral metric. It might not answer those questions perfectly, but it at least aligns with the questions about which fans are curious. Is there a place for a context-dependent version of WAR? Perhaps. But then again, we’re already accused of undermining the model’s credibility by having multiple popular methods of calculation. And as James himself found when developing Win Shares, tying individual player performance to team wins isn’t quite as easy as one might hope. I have no problem admitting that WAR as a model contains a number of flaws, or that our specific implementation of the framework is also flawed. There are a lot of areas for improvement. Forcing it to account precisely for the exact number of wins with which each team finished, though, would probably make it less useful overall as a measure of individual performance.