Let’s Check In on the Odds on Apple TV+’s Friday Night Baseball Broadcasts
Last year, I looked into the odds displayed on Apple TV+’s Friday Night Baseball broadcasts. I found them to not be very good. Then I stopped paying attention; there are a ton of baseball games on every week, most of which Apple does not broadcast, and even when I did watch an Apple game, I basically ignored the odds. I knew they were silly, after all – why distract myself by looking at them?
Of course, I didn’t really expect that to keep going indefinitely. Apple is a massive company. They have more than 150,000 full-time employees, and a ridiculous proportion of that group knows how to code. At its core, this is a data problem. There are companies I’d trust over Apple to solve a data problem, but there aren’t a lot of companies, you know? Sure, they outsourced their predictions to nVenue, a sports analytics company, but they’re Apple. Surely they’d find a way to make this all work. I noted their relative inaccuracy in my head as a temporary curiosity and moved on.
Last month, I started compiling data for an update. It’s all well and good to assume things have changed, but at some point, you have to go verify it. I decided to wait for the back half of the season because of the way I designed my test, which I’ll now explain. Similar to last time, I started by watching a bunch of games. This time, I got data from the 12 Apple TV+ games played between August 18 and September 22. I watched the entirety of those broadcasts and noted the last probability, if any, displayed before every pitch.
Those probabilities covered a range of possible outcomes – double plays, home runs, strikeouts, walks, hits, RBI, even the odds of a groundout on one occasion. I noted each of them, as well as the ultimate result of the plate appearance. That gave me 2,368 observations to test.
As with the last time I tested these odds, I used a straightforward control group. A year ago, nVenue stated that they use more than 100 variables to construct their predictions; their website currently boasts, “Our machine learning platform considers hundreds of variables including both batter and pitcher data to generate outcomes that update with each pitch.” I used just one: the count. For every game, I simply gathered the major league average result distribution by count through the previous day’s games and used that to make dummy predictions.
This is not a system I’d recommend to anyone. It’s silly and naive; in fact, I called it “naive method” in all the spreadsheets I used for this project. I didn’t intend for it to be a real predictor, because it ignores a lot of things that are clearly important. Juan Soto is in the sample of games I watched, and he got a walk rate projection. Aaron Judge is in it too, and his odds of hitting a home run were displayed. If you tried to use league averages to project how likely Aaron Judge is to hit a home run, you’d be laughed out of any stadium in America.
My “model” couldn’t come up with odds for everything the broadcast did. With only count taken into consideration, I couldn’t come up with odds for double plays or RBI, and I also didn’t use granular enough statistics to handle groundouts. Rather than come up with some silly way of predicting those, I simply ignored them. I also excluded plate appearances that ended without a conclusion, like an inning-ending caught stealing or the like. Finally, I used a bit of common sense. The odds of someone recording an RBI with the bases loaded and two outs are the same as the odds of reaching safely, so I treated those interchangeably. That left me with more than 2,000 data points, as I mentioned, but you’d need a fancier naive model than mine to provide a sanity check for every last probability displayed on the broadcast.
Anyway, those are the particulars of how I came up with a silly foil for the odds on the broadcast. From there, I had two data sets to test against each other. First, I calculated a Brier score for each set of predictions. They were nearly identical; nVenue’s odds came in at 0.198 and the naive odds checked in at 0.20. That’s a number without much meaning, though; I don’t know what a good or bad Brier score would be for making baseball predictions, or whether a fancy model would be expected to outperform a simple one when averaged out in this manner.
Next, I used the same test I used last year, because I really like it. It’s quite simple: I let the models bet against each other. First, one model makes the odds for every single prediction we have. Next, the other gets to bet against it. Finally, I reverse the process. It works like this: Let’s say the nVenue odds give Giancarlo Stanton a 22% chance of reaching base against Brayan Bello after reaching an 0-2 count. The naive model says that runners reach 20.3% of the time in that situation (through the games of August 17, the last day before the game in question). So the naive model bets “don’t reach,” and either makes 0.22 units (if Stanton makes an out) or loses 0.78 units (if Stanton reaches).
Similarly, the broadcast’s model gets to gamble against the naive one. So it bets “reach,” since its prediction is higher than the naive odds, and either makes .797 units if Stanton reaches base or loses 0.203 units if he makes an out. I repeated this process for every single prediction shared by the two models. In that particular case, Stanton reached base, so I recorded +.797 for the nVenue model’s “gambling” gains and -.78 for the naive model.
The first time I looked at these on-screen odds, they got absolutely trounced by my simple model. As best as I could tell, that was the case largely because they often made strange decisions about how to adjust the odds within an at-bat. They might increase the odds of a player reaching base after a strike or decrease them after a ball. That led to some strange probabilities and also the model chopping itself up, “betting” on good outcomes in bad spots and bad ones in good spots.
An example: Last Friday, Mookie Betts fouled off a few pitches in an 0-2 count. After the last foul ball, the broadcast listed his odds of reaching base as 25%. Then he took a ball in the dirt to reach a 1-2 count… and his odds of reaching base dropped to 21%. That’s an extraordinary claim; league-wide, on-base percentage goes up by roughly 30 points in the transition from 0-2 to 1-2. Toss that prediction, and you could imagine nVenue’s predictions performing much better.
To their credit, they mostly have tossed those backwards predictions this season. When I tested these odds last year, I looked at a sample of 2,075 predictions. Of those, 216 were “wrong-way,” or in disagreement with the previous pitch’s prediction considering the change in count. Let me be very clear: That’s a ridiculous amount. I’m willing to believe that there are some rare circumstances where a counter-intuitive change in probability might make sense, but 10%? C’mon, man.
This year, the wrong-way odds are much more muted. There were 28 in the 2,368 observations I tested, down an order of magnitude from last year. As you might expect, making obvious errors on 10% of your predictions makes it very hard to beat even a simple model. This year’s contest was on much more even footing.
In fact, the two models did almost identically. When nVenue’s predictions got to bet against my dumb ones, they made 39.6 units of “profit” across the sample. When my dumb ones got to bet against their predictions, they made 28.87 units. In other words, it’s close to a statistical dead heat.
I have a few ideas about what has changed about the on-screen odds in the last year. The first is snipping those wrong-way errors, which is easy enough. Just write a simple check into the code that displays the odds. If you’d do something “wrong-way,” simply repeat the last number instead. In other words, if you want to lower Betts’ odds of reaching base when he takes a ball, instead leave them the same. Bam, plenty of juice is eliminated right there. They clearly didn’t do this exactly, since there are still occasional blips, but I’m sure there’s some type of quality control in there now.
Second, last year’s odds were wild. Take the simplest forecast: The odds of reaching base at the start of a plate appearance. League-wide OBP was .312 last year, and it’s .320 this year. In the 2023 data I looked at, 54.4% of the 0-0 “reach base” predictions that the broadcast displayed were between 31% and 33%. In other words, they’re weighing league average heavily when making their predictions, which seems quite reasonable to me if you just want usable odds.
Last year, their process was less “make usable odds” and more “take big swings.” Only 23% of their 0-0 “reach base” predictions were between 30% and 32% (centered around the league average OBP), while 37% of their predictions were either below 26% or above 36%; we’re talking huge outliers here. For comparison, less than 10% of qualified hitters have OBPs that far away from league average. In this year’s predictions, only 3.6% of 0-0 reach base predictions are five percentage points or more away from league average. Those were some of the most damaging predictions that last year’s iteration of the on-screen odds made. Only 16.7% of the players that last year’s on-screen odds gave a 38% chance or higher of reaching base actually did so. Meanwhile, 44.4% of the players that last year’s odds gave a 25% or lower chance or reaching base did so. Kevin Kiermaier somehow appeared on both lists. I’m not sure what was going on there, but if you’re wondering why things were so bad, that’s where I’d start.
In other words, the new odds displayed on the Apple TV+ broadcasts have solved the main problem that the first-generation odds had. Those original odds were often outlandish – I find it hard to believe that Ji Man Choi had a 45.4% chance of reaching base against Dylan Cease, or that Nick Senzel only stood a 10% chance of getting on against Génesis Cabrera. There just aren’t outliers like that in the 2023 version; the highest number for 0-0 reach base probability I saw was 38%, achieved three times by three very good hitters (Betts, Julio Rodríguez, and Yandy Díaz). The lowest projection was 28% (five times).
If you sand off the insanely rough edges and get rid of the obvious missteps, you’re left with a down-the-middle prediction model. It’s quite similar to my naive model, and as best as I can tell, that’s on purpose. The big swings weren’t working – their model got drubbed by one that didn’t even know which players were on the field. They regrouped and went with something simpler, and the results have definitely improved. Whether you’re measuring by Brier score or performance against my naive model, this year’s iteration of odds are better than before. In fact, I can say with a fair degree of confidence that they’re about as good as I can do when I don’t pay attention to the players on the field, which is a huge step up from last year, when I was quite confident they were worse than that.
As always, nothing about my study is statistically conclusive. If you had more time and wherewithal than I do, you could test every single Apple TV+ game from this year rather than a subset of 12 games. You could build a fancier model to test it against as well, one that considers player identity or at least handedness. It’s at least possible that my dumb model is just getting lucky over and over again, though I don’t think it’s especially likely. I really do think that putting odds on the screen is cool, particularly when they point out something unexpected. I think RBI odds, which the broadcast has leaned into this year, are a great example, and I’d love to see some testing on how they perform against a well-constructed baseline.
But if you were worried that Apple was coming to take over baseball analysis, I think it’s safe to say that you’ve dodged a bullet. Two years in, the odds they display on their broadcasts are about as good as the model I constructed to test against them that’s purposefully not trying very hard. The odds have gotten better, but they’re not overwhelmingly interesting, and there’s no sign that Apple is taking them in-house to apply their algorithmic expertise to them. I’m a big fan of the aesthetics of the broadcast, and I think its crew does a great job of getting players and managers to answer reasonably interesting questions during the game. I’m just done waiting for them to tell me what the odds of a given event happening are; I think that ship has sailed.
As is customary when I do something like this, you should feel free to peruse the data if that’s up your alley. I can’t guarantee every probability is recorded perfectly; the odds don’t always tick in a timely manner, and watching 12 baseball games while trying to keep your eye on some tiny white numbers in one corner of the screen is hardly a foolproof recording method. I think I’ve done a good job with them, though, and you can find all of my calculations and data here. Finally, here’s a game-by-game breakdown of the Brier score and gambling method results for each game I tracked:
Date | Game | Apple Brier | Naïve Brier | Apple Betting | Naïve Betting | Wrong Way Errors | Pitches Tracked |
---|---|---|---|---|---|---|---|
8/18 | Yankees-Red Sox | 0.19 | 0.19 | -1.27 | 3.33 | 0 | 189 |
8/18 | Reds-Blue Jays | 0.18 | 0.18 | 5.43 | -1.22 | 1 | 195 |
8/25 | Red Sox-Dodgers | 0.21 | 0.21 | -5.97 | 11.99 | 1 | 182 |
8/25 | Mariners-Royals | 0.18 | 0.17 | -9.12 | 15.53 | 5 | 205 |
9/1 | Mets-Mariners | 0.24 | 0.24 | -3.36 | 6.7 | 3 | 194 |
9/1 | Guardians-Rays | 0.22 | 0.22 | 8.2 | -0.71 | 2 | 155 |
9/8 | Reds-Cardinals | 0.19 | 0.19 | -6.24 | 13.65 | 0 | 186 |
9/8 | Astros-Padres | 0.18 | 0.18 | 17.66 | -9.62 | 2 | 258 |
9/15 | Orioles-Rays | 0.18 | 0.19 | 14 | -9.69 | 2 | 115 |
9/15 | Cardinals-Phillies | 0.22 | 0.22 | 13.79 | -5.26 | 7 | 291 |
9/22 | Phillies-Mets | 0.19 | 0.19 | 3.84 | 1.76 | 2 | 201 |
9/22 | Dodgers-Giants | 0.2 | 0.2 | 2.68 | 2.41 | 3 | 197 |
Total | — | 0.198 | 0.2 | 39.62 | 28.87 | 28 | 2368 |
Ben is a writer at FanGraphs. He can be found on Twitter @_Ben_Clemens.
Interesting! I was curious if they had changed anything (since I don’t have Apple TV) since last year. Glad the probability feature is functional this year. Can’t believe they had a model for a whole season that was so obviously bad.