Fun With Playoff Odds Modeling

Author’s note: “Five Things I Liked (Or Didn’t Like) This Week” is taking a short break, but will return next Friday for the end of the regular season.
Earlier this week, I did the sabermetric equivalent of eating my vegetables by testing the accuracy of our playoff odds projections. I found that our odds do a pretty good job of beating season-to-date odds (particularly late) and pure randomness (particularly early, everything does pretty well late). It’s good to intermittently check in on the accuracy of our predictions. It’s also helpful to build a baseline as a benchmark to measure future changes or updates against.
Those are a bunch of solid, workmanlike reasons to write a measured, lengthy article. But boring! Who likes veggies? I want to beat the odds, and I want to flex a little mathematical muscle while doing it. So I goofed around with a computer program and tried to find ways to recombine our existing numbers to come up with improved odds built by slicing up existing ones. It didn’t break the game wide open or anything, but I’m going to talk about my attempts anyway, because it’s September 19, there aren’t many playoff races going on, and you can only write so many articles about whether the Mets will collapse or if Cal Raleigh will hit 60 dingers.
What if you just penalized extreme values?
I first tried to correct for the fact that early-season projection-based odds (which I’m calling FanGraphs mode for the rest of the article) seem to be too confident and thus prone to large misses. I did so by applying a mean reversion factor that pulled every team’s values toward the league-wide average playoff chances (i.e. how many teams made the playoffs that year). This method varies based on the current playoff format; we have 16-team, 12-team, and 10-team samples in the data, and I adjusted each appropriately. I set the mean reversion factor so that it was strong early in the year and decayed to zero by the end of the season.
This did nothing, basically. More specifically, applying a variable reversion factor by month didn’t improve the Brier score of either FanGraphs mode or season-to-date mode odds. I also tried Winsorizing the odds, or in other words applying a cap and floor to odds and recasting any number above the cap or below the floor to their respective levels. This did nothing. I first tried a fixed cap and floor, then had the computer set Brier-minimizing caps and floors for each month. Neither worked; in fact, when I asked the computer for the “optimal” Winsorization factor, it returned, “Don’t give me a cap or floor at all” for every month except April, and used 2%/98% bands in April, or in other words basically nothing.
Quite frankly, I didn’t expect this to work. Squeezing everything in doesn’t improve accuracy; it makes some predictions better but others worse. You need to do some kind of targeted squeezing to make a difference, but the Winsor test showed that the optimal cap and floor was, more or less, no cap and floor. Oh well.
What if you blended them?
Fine, squeezing the values doesn’t do anything. What about setting a month-by-month blend of FanGraphs mode odds and season-to-date mode odds and using that to take the best of each model when it is at its most useful. To do this, I told the computer to calculate the Brier-minimizing weight for each month by looking at the data and finding the mix of odds that produced the lowest Brier score in that sample. I also told it not to cheat – or, more specifically, I told it that it wasn’t allowed to look at future data when guessing each year’s blend. When evaluating 2019, for example, it set the weights based on the 2014-2018 seasons.
This worked pretty well, as it turns out. The Brier-minimizing weights start with slightly more FanGraphs than season-to-date, then ramp up as the year wears on. I tried both a regular version (calculate each month and use that value without adjustment) and a smoothed version that forces a smoother change in blend as the year goes on. Both of them did quite well, particularly in March/April and May. They reduced the mean squared error by 5% relative to FanGraphs odds in March/April and by 3% in May. By the end of the year, though, things get weird: The un-smoothed version uses 100% FanGraphs odds. Here are those weights, which worked out to a 2% decrease in mean squared error overall:
Month | Un-Smoothed Weights | Smoothed Weights |
---|---|---|
March/April | 0.539 | 0.633 |
May | 0.619 | 0.664 |
June | 0.734 | 0.700 |
July | 0.700 | 0.681 |
August | 1.000 | 0.780 |
September/October | 0.996 | 0.764 |
If you want a handy rule of thumb that will do ever so slightly better than the raw FanGraphs odds, you can think of starting with a 60% FanGraphs 40% season-to-date blend, then increasing it toward 100% FanGraphs as the year winds on. The gains are pretty small, because it’s just inherently difficult to have small Brier scores with so much uncertainty left in the season, but small gains are the name of the game in this article.
What if you use Bayesian inference?
We love Bayes here at FanGraphs. Bayesian inference means adjusting your prior expectation based on evidence, and adjusting it by different amounts depending on what the evidence says. Broadly speaking, it asks how likely it is that we’d see the observed result (say, a team playing .500 baseball for a month) given our prior expectation (let’s say we had them as a .560 team), and adjusts our new expectation using that evidence.
This one isn’t quite so simple as, “Use two-thirds FanGraphs odds early.” The rule varies its usage of weights based on how many games have been played — the White Sox were 1-0 this year, that doesn’t mean we were way off in our estimation of their skill — and how big the disagreement is. If our odds say a team had a .530 winning percentage in expectation and they’ve played to a .525 winning percentage through two months, my confidence in our projections increases. If they’ve played to a .510 winning percentage, I’m slightly more skeptical. If they’ve played to a .630 winning percentage, uh, maybe we got something wrong. I also had to take into account the rest of the league’s play, because if your team starts right as expected but other teams don’t, that might change our opinion of you as well. I added some terms to ensure that everything added up to 100% after all of that, because that, too, is important.
In any case, I had the computer do it, and again I told the computer not to cheat. (Side note: It’s a little more complicated than that in practice, but ensuring that your experimental design doesn’t cheat is half of the difficulty of getting good answers out of models.) The weights and values used to make a Bayesian prediction were selected based on backwards-looking data. Then, I measured the results out-of-sample. In other words, my evaluation used 2014-2017 data to produce adjusted odds for the 2018 season, then measured those odds’ accuracy. Then it used 2014-2018 data to produce adjusted odds for the 2019 season, and so on. The simplest way to think about this is that we give a big value to our existing model, but when it’s proven very wrong early, we defer slightly to the real world telling us what it thinks.
The Bayesian method did incredibly well at the beginning of the year. It’s better than any of our models, better than any blend of our models even. That’s because it can “choose” how much to listen to the FanGraphs odds based on how closely they’re hewing to what’s going on on the field. That’s less useful later in the season, of course; as we’ve seen, FanGraphs odds do better than odds based on season-to-date play by the time half the year is in the books. In March and April, though, our assumptions are more likely to be wrong, so a bit of Bayesian reasoning helps.
When I told the computer to break things up month-wise and come up with Bayesian weights, it did OK, reducing mean squared error by roughly one percent for the season as a whole. That’s disappointing, though, and it’s unnecessarily poor, because the Bayesian method does worse than the FanGraphs-only method by the time August rolls around. It’s trying its hardest, but it’s a simple rule that doesn’t know anything about the model performance. All it cares about is divergence between projected winning percentage and season-to-date record, and as we know, that matters quite a bit less by year’s end.
To fix that, I hacked things up. Is this good science? I’m not sure. But by telling my computer program that it should use a Bayesian method for the first half of the season, then the FanGraphs method after the All-Star break, I got our best answers yet, shaving a further two percent worth of mean squared error off of my previous best. Again, interpreting a Brier score on its own is difficult, but I got my hybrid Bayesian model down below 0.11. FanGraphs mode checked in at 0.118 without modification, and that was the best individual mode.
Just for fun, I took that modified Bayesian model and asked it to compute playoff odds on April 30, 2025. I picked the past because it wouldn’t be all that interesting to check today (the model just uses FanGraphs mode this late in the year), and April is the month that sees the largest change. Here’s how this version differs from our FanGraphs odds on that date:
Team | FG Odds | Bayes Odds | Difference |
---|---|---|---|
Dodgers | 98.2% | 87.1% | -11.0% |
Mets | 86.7% | 86.9% | 0.3% |
Yankees | 82.4% | 84.7% | 2.3% |
Tigers | 78.3% | 82.5% | 4.2% |
Mariners | 74.8% | 75.3% | 0.5% |
Cubs | 67.3% | 69.2% | 1.9% |
Astros | 58.6% | 61.7% | 3.1% |
Padres | 50.2% | 58.8% | 8.6% |
Phillies | 67.8% | 58.3% | -9.5% |
Red Sox | 59.0% | 56.0% | -3.0% |
Braves | 68.8% | 50.7% | -18.1% |
Giants | 46.9% | 50.1% | 3.1% |
Diamondbacks | 54.3% | 49.6% | -4.8% |
Rangers | 49.4% | 43.4% | -6.0% |
Royals | 37.6% | 38.4% | 0.7% |
Brewers | 28.4% | 38.3% | 10.0% |
Twins | 36.7% | 35.6% | -1.0% |
Guardians | 34.2% | 33.2% | -1.0% |
Rays | 25.3% | 32.0% | 6.7% |
Reds | 12.0% | 23.6% | 11.6% |
Blue Jays | 27.1% | 21.0% | -6.1% |
Cardinals | 13.5% | 19.2% | 5.7% |
Athletics | 18.9% | 18.7% | -0.2% |
Orioles | 15.5% | 13.3% | -2.2% |
Angels | 2.3% | 3.9% | 1.6% |
Pirates | 4.7% | 3.8% | -0.9% |
Nationals | 0.9% | 2.4% | 1.4% |
Marlins | 0.3% | 1.2% | 0.8% |
White Sox | 0.0% | 0.8% | 0.8% |
Rockies | 0.0% | 0.0% | 0.0% |
It’s not perfect, but the biggest adjustments – Braves down, Reds up, Dodgers down, Brewers up – do a pretty good job of capturing the kind of information I’d want a model to pick up in April. Sure, the Dodgers started well, but so did everyone in their division, and the playoff picture looked slightly more competitive than expected. The Braves started poorly. The entire NL Central looked good.
The Bayesian version isn’t without its misses, because the season-to-date method isn’t without its misses. Here’s the same snapshot on May 31:
Team | FG Odds | Bayes Odds | Difference |
---|---|---|---|
Tigers | 94.6% | 95.9% | 1.2% |
Yankees | 97.2% | 94.8% | -2.5% |
Dodgers | 98.4% | 94.0% | -4.4% |
Cubs | 83.5% | 85.5% | 2.0% |
Mets | 84.2% | 84.7% | 0.5% |
Phillies | 90.5% | 72.8% | -17.8% |
Twins | 66.9% | 68.0% | 1.1% |
Astros | 67.0% | 63.7% | -3.3% |
Mariners | 69.1% | 61.8% | -7.3% |
Rays | 36.5% | 55.5% | 19.0% |
Cardinals | 42.7% | 54.7% | 12.0% |
Giants | 43.2% | 53.7% | 10.5% |
Padres | 43.1% | 41.5% | -1.6% |
Braves | 56.3% | 40.1% | -16.2% |
Royals | 42.2% | 37.0% | -5.2% |
Guardians | 44.5% | 35.7% | -8.8% |
Blue Jays | 40.5% | 32.1% | -8.4% |
Rangers | 21.5% | 27.5% | 6.0% |
Brewers | 21.9% | 25.7% | 3.8% |
Red Sox | 15.3% | 22.0% | 6.6% |
Reds | 5.5% | 20.0% | 14.5% |
Diamondbacks | 27.7% | 19.5% | -8.2% |
Nationals | 2.6% | 5.2% | 2.6% |
Angels | 1.8% | 4.4% | 2.5% |
Orioles | 1.8% | 0.8% | -1.0% |
Athletics | 0.8% | 0.7% | -0.1% |
Marlins | 0.1% | 0.5% | 0.4% |
Pirates | 0.3% | 0.4% | 0.1% |
White Sox | 0.0% | 0.3% | 0.3% |
Rockies | 0.0% | 0.0% | 0.0% |
Little too high on the Rays and Cardinals in retrospect, and too low on the Phillies and Blue Jays. But the general moves the Bayes model is making – shading teams whose performance and divisional position don’t match our preseason expectations – make a lot of sense to me.
Why show you all this? It’s to set down a performance marker. This is how good I can get the existing odds to be, using all the statistical methods I’ve picked up over the years, plus a few new ones I learned from my trusty AI assistant while researching this article. Now that I have a good testing regime set up and an idea of how far I can optimize a given set of odds using statistical techniques, I have a target for any future playoff odds calculations to be tested against. All I need is a time series of their predictions, and they can slot right into the existing testing framework.
Could that be other sites’ odds? Sure, I suppose, though I don’t have a strong desire to play internet baseball referee. More likely, it’ll be future versions of our own odds, like the various new ones we’ve added to the playoff odds page in the last year, or a fully operational version of our depth-aware methodology. Regardless of what the future holds, though, I’m confident that my current folding of our existing methodologies is as good as I can do – and I’m also confident that the Bayes-aware version, while complex, does a good job of picking up some of the early-season slack without giving up the late-season excellence of projection-based modeling.
Ben is a writer at FanGraphs. He can be found on Bluesky @benclemens.
I like veggies!! And that is why I’m a healthy 76-year-old.