How Good Are Those Probabilities on the Apple TV+ Broadcasts?

© Troy Taormina-USA TODAY Sports

As you’re probably aware, Apple TV+ has stepped onto the baseball broadcasting scene this year, airing two games every Friday. They’re stylistically different from your average baseball broadcast, even at a glance. The colors look different, more muted to my eyes than the average broadcast. The score bugs are sleek, the fonts understated. The announcers are mostly new faces. And most interestingly, to me at least, the broadcast displays probabilities on nearly every pitch.

As a big old math nerd, I love probabilities. They appeal to something that feels almost elemental. Every time I watch a baseball game, I wonder how likely the next hitter up is to get a hit – or to reach base, or strike out, or drive in a run. It’s not so much that I want to know the future – probabilities can’t tell you that – but I would like to know whether the outcome I’m hoping for is an uphill battle or a near-certainty, and how the ongoing struggle of pitcher against hitter changes that.

The Apple TV+ broadcasts gets those probability numbers from nVenue, a tech startup that got its start in an NBC tech accelerator. According to an interview with CEO Kelly Pracht in SportTechie, the machine learning algorithm at the heart of nVenue’s product considers 120 inputs from the field of play in making each prediction.

Machine learning, if you weren’t aware, is a fancy way of saying “regressions.” It’s more than that, of course, but at its core, machine learning takes sample data and “learns” how to make predictions from that data. Those predictions can then be applied to new, out-of-sample events. Variations in initial conditions produce different predictions, which is why you can think of it as an advanced form of regression analysis; at its most basic, changes in some set of independent variables are used to predict a response variable (or variables).

If that all sounds impenetrably math-y to you, well, that’s one of the downsides of the approach. It’s an opaque process, which makes sense: you try using the interaction between more than 100 variables to predict the odds of a player getting on base and see if it’s anything other than confusing. Add to that the fact that companies aren’t exactly handing out the secret sauce driving their predictions, and there’s really only one way to judge algorithm accuracy: by looking at the results.

And for me, the nVenue results have been confusing. Take this example, from that interview with Pracht I referenced earlier:

Pracht’s favorite example of how the algorithms have worked came during the first at-bat of the first game from last year’s World Series. She recalled that Braves designated hitter Jorge Soler opened with a 2% chance of homering against Astros starter Framber Valdez, which increased to 3% after ball one. The second pitch was also a ball, at which point Soler’s homer potential grew to 19%. Soler then slugged a 2-0 fastball into the leftfield seats.

I think about baseball odds professionally. I was following right along on 0-0 and 1-0. But a 19% chance of a home run sounds inconceivable to me.

An individual probabilistic prediction, though, is no way to judge any system. “That doesn’t sound right to me, and I watch baseball” is a weak argument. Even if the face validity of these predictions is low, the predictions could still be good.

Rather than name a few examples, I did something slightly more involved. First, I recruited a little help from my friends. Ben Lindbergh organized some Effectively Wild listeners to help me chart the probabilities shown in every pitch of each Apple TV+ game of the year. Then I set out to test those probabilities.

For every pitch where the broadcast recorded a probability, we noted the last probability that was shown before the pitch was thrown – both the type and the likelihood. In some cases, the broadcast showed multiple statistics before the pitch was thrown, but we recorded only the last reading in each case. Then I noted whether the result had happened or not – as all the predictions are binary, a simple 1 or 0 sufficed. In cases where the play ended without a satisfactory answer to the probability in question – when a runner was thrown out to end a half-inning mid-plate appearance, for example – I threw out the predictions.

This gave me what we in the data industry refer to as “a big honking sample.” For reasons that will become apparent, I skipped the two April 8 games, but that still gave me 12 games, and thousands of pitches. As an example, let’s take Marcus Semien’s at-bat in the top of the first inning last Friday night. When he came to the plate, the broadcast listed him with a 22% chance of reaching base. After he took a strike, that number increased to 31%, then 32% after he got down 0-2. He then took a ball (30%), another ball (18%), and a foul (20%) before striking out on a foul tip. I recorded each of those counts and probabilities, as well as a 0 for the outcome – he didn’t reach base.

In the 12 games I charted, that gave me 2,575 observations in 10 categories: strikeout, walk, reach base, hit, out, in-play out, GIDP, RBI, extra-base hit, and home run. Some were used sparingly; extra-base hits only showed up for a few games, and in-play out was only used for one pitch.

With all the data painstakingly recorded, all that remained was to test it. To do this, I needed some control predictions. After all, I can tell you the Brier score – a measure of probabilistic prediction accuracy – of the sample set as a whole is 0.196, where 0 is the best possible score and 1 is the worst. That’s meaningless without context, though; if you can’t compare one set of predictions to another, that Brier score is just a number in space.

The nVenue model uses upwards of 120 inputs. I decided to use exactly one: the count. For each game, I snapped the major league average rate of each result after the relevant count through the prior day. For games from April 15, for example, I calculated the odds of each outcome as of all major league plate appearances through April 14. That particular one looks like this:

Odds of Outcome By Count, Through 4/14/22
Count K Reach Walk Hit Out XBH HR
0-0 23.0% 31.3% 9.2% 20.7% 68.7% 7.4% 2.8%
0-1 31.0% 25.9% 5.3% 19.1% 74.1% 5.9% 2.2%
0-2 47.1% 18.5% 2.7% 14.6% 81.5% 4.2% 1.1%
1-0 18.9% 38.1% 16.6% 20.3% 61.9% 8.4% 3.3%
1-1 27.2% 30.9% 10.9% 18.8% 69.1% 6.5% 2.5%
1-2 44.2% 21.9% 6.7% 14.1% 78.1% 4.4% 1.5%
2-0 15.5% 48.4% 30.7% 16.8% 51.6% 7.3% 3.0%
2-1 23.4% 39.6% 20.3% 18.4% 60.4% 7.1% 2.7%
2-2 39.0% 27.7% 13.3% 13.7% 72.3% 4.9% 1.8%
3-0 7.0% 72.9% 63.2% 9.7% 27.1% 4.3% 2.1%
3-1 13.1% 61.1% 47.3% 13.4% 38.9% 6.3% 2.6%
3-2 27.9% 46.1% 32.6% 13.1% 53.9% 5.0% 1.8%

I used this to make my own predictions for reaching base, getting a hit, walking, striking out, making an out, hitting a home run, and hitting an extra-base hit. Double plays and RBIs don’t work with my one-factor model – count isn’t sufficient, as you need base/out state as well – so I simply didn’t test the accuracy of those predictions. The same is true for that single in-play out prediction; I simply tossed it out. That’s also why I didn’t use data from the games of April 8; there wasn’t enough major league data to use after only one day of games.

That left me with a set of 2,075 pitches where I had a naive prediction (how all major leaguers have done after that count in 2022) to test against the nVenue prediction. Let me be quite frank about my “model”: it’s clearly terrible. It doesn’t take enough into account. Do you think the chances of Mike Trout getting a hit against Joe Bullpenshuttle are the same as those of a backup catcher getting a hit against Gerrit Cole? They’re obviously not! Do you think that Maikel Franco is as likely to walk with the bases loaded as Juan Soto is with first base open? Again, no. A purely count-based prediction is obviously flawed, but I think it’s a good baseline.

How did the two models do? In the plays where they both made predictions, the naive model slightly outperformed nVenue’s predictions, putting up a Brier score of 0.218 as compared to 0.226 for nVenue. The naive count-based prediction put up a lower (superior) Brier score in eight contests, and finished tied in another. The on-screen odds did better only three times.

Brier scores penalize overconfidence, and I wanted a metric that was confidence-neutral, so I devised another test. I let each model gamble against the other. For every pitch where they both made predictions, I made each model make a “bet” based on the probability the other model gave for the target outcome.

That’s a mouthful, so let’s take an example. Let’s say, for the sake of argument, that the count-based model gave a batter a 30% chance of reaching base. If the nVenue probability was higher than that, I had them bet “reach base.” If it was lower, they bet “don’t reach base.” The payout is simple: if the model “bets” on reaching base, the payout is 0.7 if the player reaches base (1-0.3), and -0.3 if the player doesn’t reach base. It’s the same for every pitch; the “payout” is either one minus the odds (for a positive result) or the negative of the odds (for a negative result).

You could rinse my model by gambling against it this way. All you have to do is bet on good outcomes when the batter/pitcher matchup favors the hitter (a platoon advantage, say, or a great hitter against a bad pitcher) and vice versa when the opposite is true. It should be incredibly easy to beat this count-only prediction.

In practice, the count-only prediction drubbed nVenue’s displayed probabilities. Over those 2,075 pitches, the nVenue prediction did slightly worse than breakeven when gambling against the count-only predictions, losing 11 units across the entire set. On the other hand, when the count-only predictions got to gamble against nVenue’s posted number, it racked up a score of 134.8 units.

You might wonder why the two numbers don’t add up to zero, but that’s normal in this style of test. There’s no constraint that makes it sum to zero; the two “gambling” runs use different odds, since each set of predictions uses its counterpart to set odds. More important than the exact numbers is the fact that the posted probabilities were consistently beaten by my extremely simple one-factor estimation, and the size of the deficit; the count-only model did far better, and did so consistently. In their description of the odds, nVenue touted “15,000 ways to bet on baseball,” but I’m skeptical that any of them involved betting against a count-only model and losing; perhaps the gambling advice part of the model isn’t ready for prime time.

Why is that the case? One reason is that the posted odds make some obvious missteps. Take that Semien at-bat I mentioned above. Twice, he took a strike only to see his chances of reaching base increase. Twice, he took a ball, only to see his chances of reaching base decrease. Those shouldn’t happen. That’s not true for every possible outcome – situations like hits, RBIs, and home runs aren’t so cut and dry, as your odds of getting a hit (hits divided by total plate appearances) decline as a walk becomes likely and vice versa – but for strikeouts, walks, odds of reaching base, and odds of making an out, the odds simply shouldn’t tick the “wrong” way.

That’s not to say there’s no interesting information in these probabilities; given how he’s hitting, it isn’t necessarily wrong to peg Semien as being less likely to reach base than an average hitter before the at-bat started. But despite correctly shading Semien’s odds of reaching base down in that 0-0 count, the six probabilities nVenue displayed over the course of his at-bat came out roughly even against the count-only model, thanks to the unintuitive probability moves as he took strikes early in the count.

In general, it seems to me that the on-screen odds suffer from over-fitting. Even if you’re not a statistical sort, you can implicitly understand this from watching a few baseball broadcasts over the years. If the screen displays that a hitter is 5-for-7 on Fridays against opposite-handed pitching in the sixth inning or later, you rightly say “yeah, that doesn’t sound like it matters.” A poorly calibrated model might not, though. It doesn’t have to be anything specific like that – but overfitting is a risk when you’re training models on past data, and my guess from the outside looking in is that that’s the culprit here. (Following Kelly Pracht’s appearance on Effectively Wild yesterday, during which Ben Lindbergh mentioned the results of my analysis, we reached out to nVenue for further comment. They responded by saying, “We all know that in sports, player averages can’t paint the whole story. nVenue believes in going beyond the average to generate predictions for each and every individual match-up and situation. Our team has run millions of regression data points outside of the twelve baseball games aired during Friday Night Baseball that have been included in this study. Our studies validate that our data is more relevant and accurate than an average. We love talking data, especially around baseball. We look forward to reviewing any studies as we prepare to release our own in the future.”)

Of note, the on-screen odds seem to have tightened up over time. Half of the naive prediction’s “gambling” gains came in the first four games. In those four games, the on-screen odds made what I’ll call “wrong-way errors” – those probability moves that go against the count – 101 times. There have only been 115 such strange odds moves in the subsequent eight games. The count-only predictions still had a superior Brier score and superior gambling results in those last eight games, but it was far closer. You can see the game-by-game comparison of the two “models” – nVenue’s and my count-based one – here:

Game-by-Game Model Comparison
Date Game Apple Brier Naïve Brier Apple Betting Naïve Betting Wrong-Way Errors Pitches Tracked
4/15 Reds Dodgers 0.216 0.184 -9.2 21.4 23 135
4/15 White Sox-Rays 0.176 0.143 -8.2 25.0 25 177
4/22 Rangers-A’s 0.165 0.142 -0.2 19.4 21 159
4/22 Cards-Reds 0.247 0.236 4.8 10.6 32 162
4/29 Giants-Nats 0.210 0.206 0.0 9.2 18 193
4/29 Yankees-Royals 0.319 0.319 -3.7 2.9 17 233
5/6 Red Sox-White Sox 0.249 0.254 6.6 2.2 16 170
5/6 Rays-Mariners 0.252 0.235 -14.1 26.6 6 197
5/13 Padres-Braves 0.258 0.255 1.0 4.6 15 184
5/13 Cubs-Dbacks 0.244 0.245 7.2 -0.7 13 129
5/20 Cards-Pirates 0.164 0.155 -5.3 13.1 15 171
5/20 Rangers-Astros 0.177 0.178 10.1 0.6 15 165
Total 0.226 0.216 -11.0 134.8 216 2075

These tests aren’t conclusive in some scientifically provable way. There’s some serial correlation between our observations; this system takes multiple observations of the same plate appearance. But even if we limit the test to 0-0 counts, when the dummy predictions reflect the league average rate of every outcome with no further information, the count-only predictions have a lower Brier score and positive gambling returns when compared to nVenue’s predictions.

I should reiterate: my count-only “model” is terrible. It’s so bad! Do not use it to predict things. You could vastly improve on it by adding more variables. Maybe not over 100 – the only model with that many variables I’ve seen underperformed my one-factor model in the tests I just described – but I’m not claiming any special expertise in predicting baseball here. I am, instead, claiming that the predictions shown on these broadcasts every Friday would lose money if they gambled against my objectively bad predictions. They’re flawed, perhaps past the point of usability.

That doesn’t mean there’s nothing cool in them. In that same Rangers-Astros game, Kole Calhoun stepped to the plate in a 1-1 tie in the top of the fourth. The broadcast displayed an 8% chance that he would hit a home run, more than triple the rate at which the league hits home runs on a per-PA basis. He promptly bopped the first pitch out. Power hitter, homer-prone Cristian Javier on the mound, Minute Maid Park; the odds truly should have been higher, and I think that insights like that are undeniably interesting.

At present, though, these odds are worse than not seeing odds on screen, at least as far as I’m concerned. I wish more people thought about baseball probabilistically, but having clearly inaccurate odds – Marcus Semien isn’t more likely to reach base on 0-2 than he is on 0-0, no matter what the screen says – could result in people trusting odds less, not more. Perhaps there are some more cool insights to be mined from this complex model, but for now, I think that showing these odds during broadcasts is doing viewers a disservice.

A huge thanks to Ben Lindbergh, Megan Schink, Zander Stroud, Kevin Arrow, and Lucy Bloom for their help charting these games. You can find all the data used in this article here.





Ben is a writer at FanGraphs. He can be found on Twitter @_Ben_Clemens.

85 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
nomarwomacks
1 year ago

Fantastic! I’m glad someone stress tested these numbers.