Statcast and the Future of WAR
Over the weekend, I had the fortune of attending the Sloan Sports Analytics Conference, and participating on the baseball panel with Mike Petriello, Harry Pavlidis, Patrick Young, and Brian Kenny, which was a lot of fun. While the baseball panel was my only actual obligation at the conference, Petriello was doing double duty, having just presented — along with Greg Cain, one of the lead engineers at MLBAM — the latest update to Statcast, and introducing two new public metrics for 2017, Catch Probability and Hit Probability. These are the kinds of numbers people have been hoping for, and are one of the first steps in moving from collecting interesting single data points into providing more valuable calculations based on the combination of factors the system is measuring.
To help promote the new metrics, Jeff Passan wrote a piece on Statcast over at Yahoo, focusing mostly on what Statcast could do in the future.
Sometime soon, there is going to be a new version of Wins Above Replacement available, and its goal, aside from encapsulating a player’s value into one tidy number, is simple: Don’t be scary. The plan does not involve dumbing down the metric that serves as the flashpoint between those who yearn for a catch-all and those who lament it. On the contrary, as with almost everything it does, Major League Baseball Advanced Media wants to make it so smart people can’t help but like it.
…
That’s part of the excitement: Defensive WAR has been more guesswork than exact science. Statcast exists for exactitude. Even better, Statcast takes only 10 to 12 seconds to give a play’s precise details, meaning before the next pitch anyone who cares to will be able to contextualize just how good – or at least rare – a catch really was. BAM’s data warehouse then can be queried to provide context, and highlight clips of similar or better catches can be compared and contrasted on demand.
This is, undoubtedly, an exciting future, and the idea of a Statcast-based WAR system is very intriguing. The current versions of WAR still struggle with the difficulty in separating run prevention credit (and thus value) between the pitcher and the fielder. Statcast’s tools seem likely to bridge that gap, and with hit probability and catch probability — though it should be noted, the latter is outfield only right now, as infield calculations are more complicated — we are now closer than ever to being able to build metrics that directly measure the quality of contact a pitcher allowed, and adjust both the pitcher and the fielder’s contributions to the play made (or not made) based on that important variable.
So, yeah, Statcast is going to improve WAR calculations in a significant way, and should allow us to move past the FIP/ERA divide in the not-too-distant future. But perhaps more interesting is Passan’s mention that the guys at MLBAM are dreaming of their own WAR metric, and what that might look like down the line. The potential of a Statcast-based WAR model brings up a fascinating question; how granular should WAR get?
As Tom Tango has said on a number of occasions, WAR is a framework, giving the basic building blocks of adding hitting, baserunning, pitching, and fielding together in a systematic way. But the actual values that go into those components can be determined in a number of different ways, depending on the type of calculation that is being attempted, and more importantly, the question being asked.
Different numbers answer different questions, and have varying uses for determining what happened in the past or what might happen in the future. Often times, metrics are grouped into “descriptive” and “predictive” buckets, depending on whether they are trying to account for what did happen or what we may expect going forward. WAR, generally, is a descriptive metric; it is trying to measure the value a player produced in a given season, not tell you what his value will be next season.
Which makes the idea of a Statcast-based WAR model pretty interesting, because a lot of the presumed value of Statcast’s data is to allow us to say “okay, that result happened, but based on the more granular data, we’d have expected this other thing to happen.” Hit probability, for instance, is going to let us say that a particular batted ball might have been caught by a diving outfielder’s spectacular effort, but that 85% of the time, that ball lands, and the hitter got robbed by something out of his control. Right now, every publicly available version of WAR simply records the play as a negative-value event for the hitter, despite the fact that he just got screwed by a great defensive effort.
Certainly, knowing that the ball normally lands for a hit is valuable information in evaluating that player’s performance. But is there any value produced to a team in hitting a ball that probably should have, but didn’t, land for a hit? Do we want to credit a hitter for what was just under his control, or do we want to credit him with what happened on the field during his at-bat?
Or, let’s think about it from the opposite perspective. During his presentation on Saturday, Petriello showed this Andre Ethier home run from last year’s NLCS.
According to the new Statcast Hit Probability calculation, a ball hit at that exit velocity and launch angle is an out 95% of the time. Because of a favorable wind and the fact that the ball was hit in one of the few stadiums where a 353-foot fly ball to left center would clear the fence, Ethier actually got his first home run off a left-hander in three years.
In terms of value, what do you do with that play? Ethier’s home run put a run on the board for the Dodgers, so — ignoring the fact that we don’t have postseason WAR right now — our calculation would give him the same credit for that play as if he had launched a 500-foot bomb onto Waveland Avenue. And our pitching WAR would correspondingly crush Jon Lester, who gave up the home run, even though Lester did his job and induced contact that is almost always an out.
The easy answer is to say a home run is a home run, and we don’t care about what should have happened, only what did happen, and what did happen is that Ethier rounded the bases. In general, WAR is an attempt to isolate player performance from the influence of his teammates, but it is not designed to strip luck out of the equation. And if that’s what we want WAR to do, then perhaps a Statcast-based version wouldn’t be so dramatically different from what is already out there, beyond separating pitching and defense in a more accurate way, anyway.
But it’s actually more complicated than that easy answer would suggest, because right now, there isn’t a publicly available of WAR that really is calculating “what really happened”. The versions published here, at Baseball Reference, and at Baseball Prospectus all use context-neutral run values at the event level, so while Ethier’s home run really added one run to the Dodgers ledger, he’d get 1.4 runs worth of credit in WAR for hitting that home run, because we don’t think it’s his fault that there weren’t any runners on base when he hit the home run, and in general, the average home run produces about 1.4 runs worth of value for an offense.
If we took the “measure what really happened” argument to its logical conclusion, then there’s a good argument to be made that something like RE24 — which gets run values from the base/out state, not the overall average — should be the foundation for the hitting component of WAR. And once you go to base/out context included, you can continue down that path to including inning and score, and argue for WPA instead, since if we’re giving a hitter credit for the situations he hits in, a walk-off grand slam does more to help a team win than a solo homer down by 10.
The reality is that, with almost every component in WAR, you have to decide how much situational context you want to include, and the more context you include, the more credit you give to a player for something he had nothing to do with. And that brings us back to the Ethier home run. He didn’t really have control over the wind carrying his weak fly ball into the seats. So there is some logical consistency in saying that if we’re not including base/out/inning/score context because those things are out of the player’s control, perhaps we don’t want to measure a player’s contribution based on luck-based outcomes.
So an entirely Statcast-based WAR, that measured value solely on the granular data and probability that we think are within the realm of the player’s control, could be fascinating. I don’t know how popular a model that gave Ethier negative value for hitting a home run in a playoff game would be, but it would be a really interesting departure from every other version of WAR out there. And it would be the only model that stripped luck out of the picture, giving us perhaps the best view of a player’s actual contribution to an outcome.
But it would also be a radical departure from what people have said they generally want WAR to be. In responding to the question about how much value to give to a player who hits a triple and gets stranded versus a triple where the next batter drives him in, 95% of our readers said they wanted those two triples to be given the same value, and they didn’t want the run value to be based on the future sequencing after the event occurs. Given that Tom Tango, who now works at MLBAM, ran that series, I would imagine he’s going to be influenced by those ideas when helping craft MLB’s version of WAR, whenever that is.
This tension between what happened, how it affected the scoreboard, and how much credit to give to players for things they don’t control is a difficult thing to resolve. And now that we’re getting even more granular metrics about a player’s contribution to outcomes, these questions are going to continue to be relevant. Knowing the guys at MLBAM on a personal level, I’m pretty comfortable with the fact that they’ll handle these questions thoughtfully, and if (or when?) they do produce an MLB.com WAR model, it will be with all of these questions answered as well as they feel they can.
But while Statcast holds a lot of promise for improving the pitching and defensive sides of the components, getting ever-more granular hitting data might force us to again ask what we want WAR to be, and what the goal of the model is. There is no obvious right answer here, and that’s one of the reasons there will always be multiple ways of calculating WAR.
Dave is the Managing Editor of FanGraphs.
It would be really nice if they’d tell us how they’re actually calculating these statcast-based “probability” stats. Early last year, back when I had more time on my hands, I had a go at this and wrote a blog post that covers most of the concerns with what they’re doing:
http://45percentmental.blogspot.com/2016/05/on-development-of-expected-ball-in-play.html
It’s not clear to me what the choice of kernel or bandwidth is on these stats – that’s kind of important. The descriptions make it seem as if it’s constant-bandwidth, which is…not the best choice, given the uneven distribution of batted balls in trajectory-space.
(I’d love to go calculate my own version of these stats with a full year of data, now, but unfortunately calculating the bandwidths for the kernel regression on my desktop would take something like two weeks – O(n^2) problems are unforgiving!)
From the article on “Hit Probability”:
“What is the percentage based on?
It’s based on the Major League average for the combination of exit velocity and launch angle over the two seasons of Statcast™, and it includes a smoothing process to include larger samples. For example, for the individual pairing of “100 mph and 30 degrees,” we looked at all balls within 4 mph and 4 degrees, with proportionately greater weight to those balls closer to the 100/30 pairing.”
Yes, I saw that.
So, this tells us that the bandwidths at that particular point are 8mph/8degrees, but doesn’t tell us if they’re constant over the whole trajectory-space, or how they came to those bandwidths. Worst-case scenario is they just “guessed” at a pair of constant bandwidths, which is a pretty terrible way to do it – bandwidths should always be chosen through some systematic data-driven algorithm, and variable bandwidth is quite clearly the way to go here.
“Proportionally bigger” could be interpreted in two ways – either the kernel is triangular (i.e. there’s a linear increase in weight up to the point itself), or else it falls off like 1/d where d is the distance. The former is sort of an odd choice of kernel, but should work reasonably well. The latter would be *extremely* strange. I’m pretty sure they mean the former.
I believe you’re over-thinking this one. It’s likely as simple as assigning a 0 for hit and a 1 for a catch for every play in the database. Then just setting bins, likely in 0.1 second increments and 1 ft increments for distance traveled, and wah-la divide the sum by the total for each bin. Code it or make 2 pivot tables in excel and determine the probabilities in 10 minutes. Of course we don’t have the distance-traveled (due to OF starting point) or hang time as part of the public data set. Of course, just like you did, we can sub in angle and velocity and do a similar analysis. Personally I think the angle/velocity catch probability is a better measure than this one using hang-time and distance anyways. Since hang-time is determined by angle/velocity (might as well have the insight 2 variable offers instead of their combination in hang-time) and OF distance really just adds in more problems than it fixes (who is responsible for OF starting position and how does that impact the play? I.e. he’s so good at positioning he never had any long runs or he’s only good at positing because it’s a team determined thing and that OF, on his own, would not position himself in that way).
Naively tiling the space with bins is a really, really bad way of doing it for a number of reasons – you get nonsensical discontinuities at the boundaries between the bins, for one, and if you use a uniform bin size you’re almost guaranteed to horrendously overfit in regions where the data are sparse.
It is as if you are speaking a different language.
That’s a pretty common phenomenon when reading stuff about technical fields one isn’t familiar with.
you asked what they did. I’d wager it’s only slightly more complicated than I outlined for you. You can disagree with that methodology for any number of reasons, you’ve posed some problems you think there are with that approach. But it’s very likely what they did to calculate the probabilities (which “It would be really nice if they’d tell us how they’re actually calculating these statcast-based “probability” stats” is what you asked). As the hit article, and the earlier reply noted, the only thing they did above what I mentioned was to make slightly larger bins and use some weighted-average to smooth them so as to address some of your aforementioned concerns. Maybe I’m wrong, and some very similar approach is not what they’ve done.. but I personally doubt it.
What you’re describing is a naive approach to kernel regression without implementation details. The concerns that Oblarg is posting about are exactly those implementation details.
https://baseballsavant.mlb.com/statcast_catch_probability
the math is shown. it’s pretty much exactly as I described. tiles based on increments of .1 seconds and 5 feet. (Balls Landed) / (Batted Ball Events) = Probability
Note this is not a regression equation of any kind. Despite all the mini-nate silvers we’ve got commenting just dying for it to be.
You may not like the approach and have a lot of issues with how you would go about calculating the regression equation, but I’m not describing creating a regression equation which would have the issues you noted. ANd a regression equation is not what the creators have done here.
instead we’re talking about a table. A literal table based 100% on actual data. Not a function of distance and hang-time to generate a probability.
Bottom line, I clearly mis-took Oblarg’s post to be an honest question about how they did the math when it was really just a faux-question aimed at giving him the opportunity to talk about how he has tried and enjoys building a statistical model based on Kernel Regression to output the probability in lieu of just going to the appropriate location on a table of historical rates. That’s a fine and interesting discussion, but my mistake for not recognizing the comments true intent earlier.
Can I get a ELI5 on this?
The right way to do these calculations was presented at hardballtimes a while back and uses exit speed plus launch and spray angle in 3-D to get wOBA with cross-validation used to get the optimal bandwidths. The result is known as the wOBA cube as seen at
http://triplesevenproductions.com/daily-fantasy-sports/sabermetrics-dark-full-terrors/
Yep, this is essentially what I did (though my analysis was restricted to 2D).