A Few Thoughts On Evaluations

Over the last few weeks, I’ve been kicking around a few thoughts in my head, and so today, I’m going to try and turn them into a cohesive post. I can’t guarantee I’m going to succeed, given that I’m writing these sentences before I write the rest of them, but after pondering these in my head for a while, it’s probably time to put them down on virtual paper and get some feedback on the ideas presented.

The primary genesis of these thoughts are spurred by the fact that, with three weeks left in the season, the battle for the second AL Wild Card is being contested by the Twins and Rangers. Entering the year, our Playoff Odds gave the Twins a 4.6% chance of making the postseason (based on a 74-88 projected record), and the Rangers a 3.5% chance (with a 73-89 projected finish), and now it seems quite likely that one of the two is going to end up playing the Yankees in the Wild Card play-in game. The American League as a whole has been pretty weird this year, or if you take a different perspective, our preseason projections have performed poorly in forecasting AL team records this season.

Rangers fans — or a segment of their fanbase who use Twitter, anyway — have been particularly loud in their objections to our evaluation of their team, and understandably, 140 games of their team winning games at a .525 clip has reinforced their belief that our methodology of team evaluation is incorrect. Or that I personally have a bias against their team. Or some combination of the two.

From my perspective, though, the disconnect is mostly just a philosophical choice, and is the same choice that drives a lot of the disagreement around many of our less popular evaluations. Primarily, we evaluate players and teams by their inputs, not their outputs.

One of the primary goals of analytical research has been to attempt to isolate individual performance, as many of the more traditionally accepted measures of player value were based on taking a metric that involved contributions from many players and ascribing them to just one person. This has always been the core problem with pitcher Win-Loss records or RBIs, and has been one of the main reasons we’ve advocated for moving away from ERA as the primary way to evaluate a pitcher as well. These numbers are the result of a lot of variables working together, and we’ve generally preferred to move away from those kinds of metrics and towards things that focus more on measuring just the contributions of one player.

It’s a generalization that isn’t true in every case, but I tend to think that one of the main differences between how we often see things around here — versus how the mainstream baseball fan sees things — is that we tend to try to construct the whole picture from individual pieces, while traditionally, evaluations have more often been done by taking the whole and attempting to deconstruct it into individual pieces from there. We start with inputs and try to build up; I think most mainstream fans start with the output and try to tear down from there.

I do think starting from the building blocks of what have historically shown to be sustainable skills is probably a better way to evaluate player and team performance, because metrics that take a number of inputs are difficult to unravel after the fact. This is why, despite all the focus on analytics in baseball over the last 20 years, separating out the credit that should go to a pitcher or fielder on any single defensive play remains difficult. When you take a bunch of inputs, put them in a blender, and then try to identify all of those individual components by simply looking at the final product, it’s not so easy to tell what happened to get to that final product.

But there’s also a cost to starting from the inputs; we don’t know all of the variables, and by building models that are based solely on the ones we think we do have a decent understanding of, we leave things out. Including things that have pretty large impacts on the output metrics, especially at the team level. The most notable thing that gets excluded from using input measures? The order in which events occur.

For example, let’s take Joe Kelly. By ERA, Kelly was absolutely miserable in the first half, but has really turned his season around since the All-Star break. But take a look at how different that picture looks if you strip out the effects of sequencing and just look at how opposing batters have done against Kelly in the first and second halves:

Joe Kelly’s Half-Seasons
1st Half 0.268 0.338 0.418 0.329 5.67
Second Half 0.271 0.338 0.427 0.334 3.45

Based on simply the number and type of baserunners Kelly has allowed, his first half and second half performances don’t really look any different, but the order in which those events have occurred have had a dramatic impact on his results in differing directions, which drives a 2.22 run difference in ERA between his first half and second half results. If you’re using outputs to look at Joe Kelly’s season, you’ll see a huge change, but because sequencing is not an input that has been shown to be something that players or teams have sustainable control over, it doesn’t end up in the kinds of metrics we favor around here.

In reality, the FIP/ERA argument is the same as the BaseRuns/Pythag/Actual Wins arguments, and also drive the difference in our perception of the quality of the Twins and Rangers rosters. By the inputs that make up a team’s win-loss records, neither the Twins nor Rangers are actually having that great of seasons.

Sequencing Effects
Team Winning% Pythag% BaseRuns%
Twins 0.521 0.495 0.442
Rangers 0.528 0.480 0.479

When looking at the inputs without regard for the order in which they’ve occurred, the Rangers have played like a slightly below .500 team, while the Twins have played like one of the worst teams in baseball. Both are in the Wild Card race because they’ve maximized the value of their positive plays by stringing them together in optimal fashion, with the Twins scoring far more runs than we’d expect from a team with a .248/.304/.400 batting line, and the Rangers simply scoring their runs at the right time, turning a -27 run differential into roughly the same record as the Astros (+105) and Giants (+78).

At the input level, I think our ability to forecast player and team performance is pretty decent. It’s not perfect, certainly, but when we just look at individual performance, we do alright. When it comes to projecting the order in which those performances are going occur, however? We’re completely useless. Forecasts don’t even try to take sequencing into account, because we have almost no insight into predicting when events will occur. We can kind of get towards some decent predictions when we just say that, over some large period of trials that we should expect this type of thing to happen this number of times, but the order in which those events occur is a complete mystery.

But in baseball, the order of events is a big variable in output metrics, so there are going to be plenty of instances every year where simply taking the inputs we have a decent handle on and incorporating them together isn’t going to match the outcomes of what actually happens during a season. So, while outcome metrics have the drawback of being difficult to disentangle, input metrics have the drawback of being incomplete. It is, to some degree, a pick-your-poison situation, as both types of measures have strengths and weaknesses.

And the downside of mostly ignoring output metrics is that, when sequencing drives dramatically divergent results, we’re not going to see things the same way as those who are just looking at the outputs. And it’s basically a certainty that our models, based on transforming individual forecasts into a projected team output, are missing things. There’s no way we’ve solved for every important variable at this point, and in the future, the models will account for things that we aren’t accounting for now, and we’ll look back at our current tools and wonder why we missed things for so long.

But I think the preponderance of evidence suggests that evaluating players and teams by their inputs is generally more accurate than simply going with the outputs. As far as anyone can tell right now, sequencing really is mostly just randomness, and while it can have a big impact on the results, it isn’t really something that can be predicted ahead of time.

So, Rangers and Twins fans, I promise that we don’t hate your teams, and aren’t out to try and rain on your parades. We just don’t put nearly as much stock in output metrics as most people do, and your two teams happen to be this year’s examples of the power of the sequencing variable that we don’t really even attempt to take into account. Last year, it was the Orioles, and next year, it will be someone else. The order of events has a big impact for a few teams every year, and so every year, input-driven analysis is going to look silly.

But it’s the cost we’ve chosen to pay in order to attempt to isolate individual performances a bit more accurately. And I think it’s a price worth paying overall. This doesn’t mean our analysis is perfect, or that we never get anything wrong, but when the difference in results is simply driven by the order in which events occurred, I’m okay with that, and I don’t think an odd distribution of events means that the projection that assumed a normal distribution was incorrect.

At this point, all we can really do is gets as close as possible to projecting the quantity and quality of the events that will occur. We’re nowhere close to being able to predict the order of when those events will occur, and I’m not sure we ever will be. We’ll get closer, but sequencing is likely always going to be a variable with enough randomness to make projecting the order of events all but impossible.

Dave is the Managing Editor of FanGraphs.

Newest Most Voted
Inline Feedbacks
View all comments
6 years ago

Might I suggest that these evaluations provide a range of wins totals instead of a single value.

6 years ago
Reply to  Mikey

I’m guessing the value of wins will just be normally distributed around the expected win value – so it would show the Rangers or Twins getting have a 3% chance of getting 90+ wins but also a 3% chance of getting 60 wins

Pale Hose
6 years ago
Reply to  Dave

While this is true, we still never have discussions on kurtosis or heaviness of tails. The emphasis on mean is too simplistic.

6 years ago
Reply to  Mikey

There is something more to this. There are rosters that may project 80 wins but have a low ceiling even if lots of things work out while another team may project 76 wins but 90 wins if some things work out right. Perhaps floor and ceiling need to be better represented when doing future projections.

Moreover, I think there are more things we miss than just sequencing from projections. I think we don’t fully grasp the value of defense as a part of run prevention. Likewise, I think we also underrate relief pitching and other things like speed, youth and depth. While some of these are accounted for in some way at the player level there is probably some team effect that gets missed. And that’s before we even get to more esoterica like manager effect, ability of team to add salaries mid season and quality/quantity of prospect currency in the organization. All these things play a role in team outcome. And if we’re limiting analysis to only individual player data, we’re leaving a bunch of relevant things out or underrepresented.