Checking in on Pythagoras

Kiyoshi Mio-Imagn Images

This June 25, the Dodgers and Tigers both played their 81st game of the season. Both teams finished the day 50-31, sharing the best winning percentage in baseball at .617. The Tigers got there with a slightly better run differential, though; their Pythagorean winning percentage was a cool .608, while the Dodgers checked in at .595. Pythagorean record is implied by runs scored and allowed, and broadly regarded as a more stable measure of talent than simple wins and losses. Since that day, though, the Tigers have gone 35-40 (.467 with a .483 Pythag), while the Dodgers have gone 38-37 (.507 with a .556 Pythag).

I’m bringing this up – last data project for a while, incidentally, I just had a bunch of things in my queue and couldn’t resist tackling them all – because “how good is that team, anyway?” has been a hot topic this year given the various surprising teams who have, at times, taken up the mantel of “hottest in baseball.” Versions of this question – “This team is doing well/poorly now, what does that mean for next month?” – have been both interesting and top of mind in 2025. The Tigers and Brewers played so well for so long that they each crashed the best-team-in-baseball debate. The Mets did their hot-and-cold thing. The Dodgers have endured multiple fallow stretches. Sometimes, teams felt like they were getting very lucky or unlucky relative to their run differential. But what does any of that even mean?

I highlighted the midpoint of the season because it fits into my experimental method. I was interested in answering this specific question: If we stop at the halfway point of each season and consider a team’s actual record and Pythagorean expectation, which does a better job of predicting its record in the second half? I took every game from 2010 through 2024 and used that to construct each team’s record and Pythagorean record at the halfway point. I considered each of those as estimates of second-half record. Then I measured three things, all related: 1) the correlation between first-half record of the selected type (actual or Pythagorean) and actual record in the second half, 2) the root mean squared error of each option, and 3) the Brier score of using first-half metrics to predict second-half record.

If you’re well-versed statistically or just read my last investigation, you know that Brier scores are the accepted best metric for measuring questions like this, where you make a projection and compare it to the actual outcome. If we’re trying to come up with how good a team is and wondering whether actual record or Pythagorean record is a better representation of future play, measuring which makes a better prediction of future records feels like the exact way to go. The lower Brier score of the two will be the one that has less error in its estimates.

To that end, I first calculated every team-season, split into two 81-game halves, from 2010 through 2024, excluding 2020. To test a given metric, I noted each team’s performance in that metric (actual record, Pythagorean record, several contenders you’ll meet later) in the first half and its actual record in the second half. I threw in a coin flip version that predicts each team will have a .500 record in the second half, just for fun. Here are the takeaways of that investigation. In each table that follows, I’ve highlighted the best performance in each metric in yellow:

First-Half Prediction of Second-Half Record, 2010-2024 (Excluding 2020)
Predictor Correlation RMSE Brier Score
Actual Record 0.5466 0.0832 0.2483
Pythagorean 0.5595 0.0823 0.2482
Coin Flip n/a 0.0928 0.2500

That’s a nice first result, and one that matches with the existing literature. In The Book, Tom Tango and his co-authors found effects of similar magnitude; they found RMSEs of roughly similar size and reported that Pythagorean expectation did slightly better than actual record when it came to predicting future record. They used seasons rather than half-seasons as a measure, and used different years, but the similarity of results is still gratifying to me. At various points, Tango has also measured this effect via correlation coefficient and found similarly sized numbers, with coefficients in the mid-50s.

But overall this is not a very satisfying answer. Yes, Pythagorean expectation is a better predictor of future record than actual record. No, it’s not that much better. To use Brier skill score language, actual record is hardly better than just picking .500 for every team’s record, decreasing mean squared error by only 0.67%. That’s tiny! Using a team’s Pythagorean record to predict the future isn’t much better, though; it’s only a 0.73% improvement in mean squared error relative to pure randomness.

The drawbacks of using actual record as a predictor are fairly obvious. A team that has gone 19-1 in one-run games and been outscored overall probably isn’t as good as a team that has the same record but with a 10-10 record in one-run games. What are the drawbacks of using Pythagorean expectation, then, that make it only slightly better than simply caring about wins and losses? The easiest one to pinpoint is that formula’s insistence that every run is worth the same.

The Royals won a game 20-1 last week. They wouldn’t have been more or less likely to win that game if they’d stopped at 10-1. Those last 10 runs still feed their run differential, though, and thus their Pythagorean record. And, oh yeah, they scored all of their final 10 runs against a position player pitching, after both teams had stopped contesting the game. If we really want to see how good teams are at winning games, we probably need to come up with some adjustment for games like that.

I went with what I’d consider the simplest option. Since I already had every game’s score, I just took the Pythagorean expectation for each game independently. Win a game 10-1? Pythag assigns that game a .985 winning percentage. Win it 20-1? Pythag assigns it a .996 winning percentage. That’s exactly what we want – those last 10 runs do very little to change our estimation of a team’s talent. Then I summed up every single game “expected winning percentage” to get a team’s first-half game-by-game Pythagorean expectation. I calculated two versions of this metric, one that uses this exact formula with no modifications and one that adds a single run to each team’s score in each game to avoid counting shutouts as all the same. (The Pythagorean formula gives you a 100% winning percentage if you don’t allow any runs).

Whether you modify game-by-game Pythagorean record for shutouts or not, it beats our other estimators of future talent:

First-Half Prediction of Second-Half Record, 2010-2024 (Excluding 2020)
Predictor Correlation RMSE Brier Score
Actual Record 0.5466 0.0832 0.2483
Pythagorean 0.5595 0.0823 0.2482
Game-by-Game Pythag 0.5499 0.0778 0.2474
Game-by-Game Pythag (Adjusted) 0.5560 0.0771 0.2473

But again, it beats it by so little! Skill score says that my two methods each reduce mean squared error by about one percent compared to just guessing every team’s record will be .500. Did you expect more? I expected more. Shouldn’t looking at a team’s Pythagorean record do a much better job of predicting its future than looking at its actual record? Shouldn’t my fancy, hand-calculated version that has special accounting for blowouts do even better? Those skill scores are so tiny. The error terms are still so high. I decided to try one more method: Splitting the difference. I took the average of a team’s actual record and Pythagorean record at the midway point and used that as my estimate of future record. It did better than either alone, but still worse than my modified game-by-game method:

First-Half Prediction of Second-Half Record, 2010-2024 (Excluding 2020)
Predictor Correlation RMSE Brier Score
Actual Record 0.5466 0.0832 0.2483
Pythagorean 0.5595 0.0823 0.2482
Game-by-Game Pythag 0.5499 0.0778 0.2474
Game-by-Game Pythag (Adjusted) 0.5560 0.0771 0.2473
50/50 Blend 0.5650 0.0810 0.2480

I’m satisfied that there’s no way to use either actual record or the runs scored in each game to beat randomness by all that much. I’m also satisfied that you shouldn’t use either actual record or Pythagorean record alone; blending them improves their performance. All of the options I tried beat a naive expectation of every team being equally skilled, but none beat it by all that much. I wasn’t quite stumped, though. I have one other source of high-quality team data: Projections. Instead of stopping 81 games into each season and using those games to come up with some estimate of future winning percentage, I stopped 81 games into each season and simply looked up each team’s rest-of-season projected winning percentage. Yet another improvement:

First-Half Prediction of Second-Half Record, 2014-2024 (Excluding 2020)
Predictor Correlation RMSE Brier Score
Actual Record 0.5633 0.0831 0.2483
Pythagorean 0.5761 0.0823 0.2482
Game-by-Game Pythag (Adjusted) 0.5785 0.0756 0.2471
50/50 Blend 0.5820 0.0809 0.2480
Projections 0.6098 0.0737 0.2469

(Note that the numbers are slightly different because we only have projections starting in 2014.)

For the record, this is using FanGraphs mode projections, which take ZiPS and Steamer for player talent and Depth Charts for playing time. Use season-to-date mode instead, and our projections perform worse than game-by-game Pythagorean.

The takeaway from all of this, at least for me? It’s really hard to make projections of future record, what with so much randomness baked into baseball. Using actual records or Pythagorean records to come up with estimations beats guessing randomly. Averaging those two does even better. Going game-by-game and handling blowouts differently is better still. Even that method can’t beat computer-driven projections. And yet that projection-based method, the best in our study, still only reduces mean squared error by 1.3% relative to pure chance.

None of this means that you shouldn’t watch a current season and try to guess the future, of course. That’s why we all like baseball so much. But next time you hear that a team is unsustainably over its head because its record and Pythagorean expectation don’t match, or that a team “can’t keep getting this unlucky,” remember that none of these methods are all that much better than random chance. Can a team keep playing over its head? Sure, and we’re not even that great at measuring where its head is, to continue the analogy. Can a team keep getting this lucky or this unlucky? Obviously! Baseball is a sport governed by randomness at the game level.





Ben is a writer at FanGraphs. He can be found on Bluesky @benclemens.

4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
jcutigerMember since 2024
48 minutes ago

Looking at fWAR, the Mets have way underachieved win total. Is there a formula that uses a team’s fWAR (positional, pitching, etc) to predict win total?