Here’s How I’m Planning on Evaluating Free Agency Predictions

Every year, FanGraphs (in this case, I am FanGraph) releases contract predictions for our top 50 free agents. We also run a contract crowdsourcing project for those players, and I have to say, the crowd is spectacularly good at this. Last year, for example, I looked through all of the various predictions across the internet and awarded the crowd the title of best overall prognosticator.
But honestly, the winner of that award was hard to determine because I didn’t have a great way to evaluate the various predictions. Why so difficult? Because not every deal ended up being for the length we all predicted. As an example, I predicted 12 years and $48 million per year for Juan Soto, while the crowd predicted 13 years and $45 million per year. Soto signed a deal that was for 15 years and $51 million per year. Who was closest to the mark? It’s not immediately clear. I did better on the AAV, but the crowd did better on the number of years. There’s no obvious determining factor to use when comparing the two. Even worse, the two are inversely correlated; more years generally means a lower AAV. The two predictions seem pretty similar to me, but I had to grade AAVs and total guarantees separately, and that just felt clunky and confusing.
After some time bouncing ideas off my friends and colleagues, and plenty of time in the FanGraphs Idea Generation Lab (not real, but man, it should be), I think I have a solution. It’s simple, really. Evaluating contract predictions would be much easier if the predictions and the actual contract were for the same length, so I made them all the same length.
You might have some objections to that. “Hey, that makes no sense,” or “that’s not how math works,” something along those lines. But hear me out: Even if we can’t go back and retroactively change contracts or predictions to make the years match up, we can try to figure out what a deal would have looked like if it were longer.
I wanted a simple rule that wouldn’t require me to exercise any judgment at all. Working through case-by-case decisions is for making predictions, not evaluating them. I wanted something with few moving parts, and I definitely didn’t want to need to lean on projections or complicated aging curves in my analysis. My goal was to make it so that anyone could perform this analysis with nothing more than a list of our predictions and a list of the actual contracts. That was limiting, but also freeing. If you can’t use much, you rarely struggle to figure out what to use.
After a bit of experimentation, I settled on a rule. To compare two contracts of different lengths, I extended the term of the shorter contract to match the term of the longer one. To determine the salary for those additional years, I took two-thirds of the predicted annual salary for each added year. Take Soto, for example. Since I predicted 12 years for his deal, I had to add three years to match the one he signed. I valued those years at $32 million, $21.3 million, and $14.2 million. The crowdsourced prediction was for 13 years, so I had two add two years, at $30 million and $20 million respectively.
With both predictions now for 15 years, I could compare them directly. Take my predicted contract plus the three added years, and I had Soto down for $643.5 million over 15 years, $121.5 million short of the deal he actually signed. The crowdsourced prediction came out to $635 million over 15 years, $130 million below his actual deal. My prediction was slightly better – but the two were exceedingly close to each other.
That’s the outcome I wanted. Those two predictions really do sound similar. It’s not clear whether 12/48 or 13/45 is a bigger offer; the latter is only for $9 million more in total salary, and there’s a whole extra year tacked on. On the other hand, it’s longer, and that really does matter. I think that calling the two roughly equivalent is the right way to go, and I like that my prediction came up as slightly higher after the adjustment; as I mentioned above, that 13th year being for only $9 million is instructive of how I’m thinking about it.
So what happens when a prediction is longer than the contract a player actually signs? You can just do the same thing in reverse. This is great for pillow contracts in particular. Take Gleyber Torres, who last offseason signed a one-year, $15 million deal. I had him down for five years at $18 million per, while the crowdsourced median was three years at $18 million a pop. Those were both bad predictions, and mine was clearly worse, but how bad? Well, that pillow deal and our two-thirds formula works out to $39 million over five years (a miss of $50 million) or $31.7 million over three years (a miss of $22 million). Those equivalent contracts seem reasonable to me; if you’d offer someone $15 million for one year, you’d probably offer them $30 million for three. That’s not true in every case, obviously, but for our purposes and using limited, objective inputs, I think it’s close enough.
I experimented with a few different ratios for this, and I don’t have any conclusive evidence that this two-thirds formula is the perfect solution. Halving the salary each year was my first guess, in fact, but I changed my mind after looking at some long deals like Soto’s. I’m definitely not convinced that this is set in stone, but it does look closest after some various spot checks of past contracts. I’d love some feedback here, truthfully; let me know if you think this method has merit, because while I think it does, it’s harder to feel confident in that view because there’s no real way to verify it perfectly using historical data.
Somehow, though, this still isn’t enough to conclusively say that one set of predictions is better than another. For one thing, contract length matters. It’s all well and good to equate two projections for the sake of evaluation, but if I predict a three-year deal and the player gets a seven-year pact, I was wrong. Those years tell us something about how teams view that player. A full evaluation of predictions still has to look at that, even after doing this math to transform the dollar side of the equation into a single metric.
But that’s only half of the problem, because there are two ways to evaluate the results my method spits out. First, you could evaluate a set of predictions by their average miss, with under-predictions and over-predictions offsetting. Second, you could use the absolute value of the misses, so that those two do not offset; over-predict one contract by $10 million and under-predict another by $10 million, and your average miss is still $10 million. There are arguments for both; the average miss version does a good job of explaining which set got the overall market right, while the absolute miss version does a good job of sussing out who got the closest on individual players. I think the second is more important, but the first is clearly useful as well.
To give you an idea of how this new evaluation method works, I’m going to compare the entire set of predictions from the 2024-25 offseason, both mine and the crowdsourced estimates. Here’s how our average predictions did, both using my old AAV-and-total-separately method and the new blended one:
| Group | Ben – AAV | Ben – Total | Ben – New | Crowd – AAV | Crowd – Total | Crowd – New |
|---|---|---|---|---|---|---|
| SP | -$1.04M | -$7.77M | -$6.58M | -$1.1M | -$7.34M | -$5.95M |
| RP | -$0.17M | -$3.44M | -$1.36M | -$2M | -$8.72M | -$6.49M |
| Hitter | -$0.54M | $3.46M | $2.21M | -$0.7M | -$2.21M | -$4.48M |
| Overall | -$0.69M | -$2.77M | -$2.45M | -$1.13M | -$5.71M | -$5.64M |
It’s gratifying to see that my new numbers generally resemble the old “total guarantee” metric. That’s good; if these were way off, it would suggest that my normalize-and-compare method was doing something weird about understanding total guaranteed money.
On the other hand, the absolute value prediction errors look pretty different with the new methodology:
| Group | Ben – AAV | Ben – Total | Ben – New | Crowd – AAV | Crowd – Total | Crowd – New |
|---|---|---|---|---|---|---|
| SP | $4.14M | $19.33M | $14.19M | $4.05M | $18.97M | $13.85M |
| RP | $1.80M | $7.22M | $4.48M | $2.45M | $9.16M | $6.94M |
| Hitter | $3.15M | $35.55M | $21.44M | $2.72M | $32.03M | $21.91M |
| Overall | $3.32M | $22.95M | $14.34M | $3.25M | $21.87M | $14.81M |
This is basically the whole point of this formula; by putting contracts on an equal footing in terms of length, this method reduces some false “errors” that really just come because a five-year deal is inherently for more money than a three-year deal. In the aggregated and offsetting table above, those – well, they offset. In absolute value world, they don’t, which is why we see more of an improvement here.
I’m fairly confident that this new formula comes closer to my goal of evaluating which predictions were best. I’d be doing this even if I couldn’t publish it, in fact; every year, I do an extensive post-mortem evaluation to try to improve my system for the next year. I used to hate pillow contracts and qualifying offers for that reason; they messed everything up so much by virtue of their length that it was hard to interpret the results. By controlling for that at least somewhat, I think I’ll be able to refine my methods even further; the more accurately you can measure your shortcomings, the easier it is to address them.
Why publish this now, instead of in February when I’m doing that post-mortem? So that it doesn’t look like I have my thumb on the scales. I truly have no idea if this method will “help” my predictions. How could I? Much of this year’s free agent class remains unsigned. No one likes a judge with a pre-conceived agenda to help one contestant or another. Proposing a new methodology, one that is self-professedly a little bit hand-wavy, and simultaneously evaluating myself on it just doesn’t sound right. I’d like to think it wouldn’t affect my judgment, but why take the chance?
Anyway, this is how I’m planning on evaluating predictions this year. I’d sincerely love to hear what you think. I want a simple, foolproof method for this project. If you have a better one than mine, I haven’t liked this idea long enough to be tied to it. Let’s figure something out together – though again, I really do like this one at the moment.
Ben is a writer at FanGraphs. He can be found on Bluesky @benclemens.
I’m on board with the idea of attempting to equalize the contract length to get a reasonable comparison. And while I can appreciate the simplicity of the 2/3 model, why not use an uninvolved 3rd party contract estimator for those empty years? ZIPS, for example.