We May Never Find Out How Good Umpires Can Be

Major League Baseball will look significantly different in 2023 due to several new rules, but there’s another change that won’t attract as much attention as a pitch clock or all that steamy base-on-base action. Ten veteran umpires have retired and 10 new ones will be taking their place. I’d like to explore the effect these new umpires might have, but first, let’s look at the state of umpiring right now.

The short version is pretty simple: Since the beginning of the pitch tracking era in 2008, umpires have improved their accuracy in calling balls and strikes every single year. Accuracy has gone from 81.3% to 92.4%. If an improvement of 11.1% in 15 years doesn’t sound particularly big, consider it this way: incorrect calls have been cut by nearly 60%.

I’m far from the first to note that umpiring has been improving ever since the tools to measure it became public. In Game 2 of the World Series, Pat Hoberg became famous (at least among Effectively Wild listeners) for calling the first perfect game ever recorded. However, there’s room for a much more nuanced understanding of the improvement if we break the numbers down using Baseball Savant’s attack zones.

Most pitches are pretty clearly either balls or strikes. The real action takes place in the shadows.

Umpires have pretty much always been perfect on waste pitches. In fact, in both 2018 and 2019, they got every single one right. They were perfect again in 2022, but a misclassification on a June 12 strike to Josh Bell spoiled their clean sheet.

The heart and chase zones are now largely mistake-free as well. Although performance has started to plateau, it’s still improving slightly every year. In the last three seasons, heart accuracy has gone from 98.9% to 99.2% to 99.3%, and chase accuracy has gone from 99.3% to 99.4% to 99.5%. In 2022, just 338 pitches over the heart of the plate were called balls. Depending on your fun fact of choice, that’s roughly one pitch every seven games, or 3.5 times a year for each umpire.

Accuracy in the shadow zone, however, is still improving at roughly 1% year, and it shows no signs of slowing down. In 2008, umpires were right 66.2% of the time in the shadow zone; in 2022, they were right 80.9% of the time, which translates to roughly 11 misses a game. I don’t imagine they’ll approach 100% accuracy as they have in the other attack zones, but I also don’t think that we have any idea what the upper limit is.

Thirty-eight percent of takes are on pitches in the shadow zone. In 2008, 67% of missed calls happened there. Now it’s 96%. In 2022, umpires missed a call in the heart, chase, or waste zone once every 2.74 games. All of this is to say that today’s umpires are pretty much never wrong on anything but close pitches, and they’re still getting better on those close pitches.

Now let’s take a closer look at our incoming and outgoing umpires. Since all 10 of the new umps have called big league games before, here’s a side-by-side comparison of the two cohorts. The data comes courtesy of Umpire Scorecards, which analyzes Statcast data to assess individual umpires. AAx is short for Accuracy Above Expected, which weights an umpire’s calls based on league tendencies. For example, an umpire who misses a pitch right down the middle would be docked more than an umpire who misses a pitch right on the corner.

(One quick note: Umpire Scorecards measures performance against the average umpire, not the rulebook strike zone. As creator Ethan Singer relayed on Effectively Wild, “Each pitch is assigned a likelihood that it’s called correct … based on location, speed, vertical movement, handedness, and some other sort of extraneous factors like count.” In any given year, it tends to rank umpires as 1.5-2% more accurate, but as the average umpire’s strike zone keeps getting closer to the rulebook strike zone, Umpire Scorecards’ ratings have presumably followed suit.)

Outgoing and Incoming Umpires – Career Averages
Retiring Umpires Age Accuracy AAx New Umpires Age Accuracy AAx
Ted Barrett 57 91.4 -0.94 Erich Bacchus 32 93.5 0.25
Marty Foster 59 91.4 -0.56 Adam Beck 34 94.4 1.18
Greg Gibson 54 92 -0.20 Nestor Ceja 35 93.7 0.37
Tom Hallion 66 91 -0.98 Shane Livensparger 38 94.1 0.87
Sam Holbrook 57 91.8 -0.32 Nick Mahrley 40 93.4 0.37
Jerry Meals 61 91.7 -0.64 Brennan Miller 31 93.5 0.12
Paul Nauert 59 91 -0.98 Malachi Moore 32 92.1 -1.16
Jim Reynolds 54 92.5 0.49 Edwin Moscoso 33 94.6 1.12
Tim Timmons 55 91.8 -0.37 Alex Tosi 34 94.5 1.68
Bill Welke 55 92.1 0.00 Junior Valentine 35 94.3 1.05
Average 57.7 91.67 -0.45 Average 34.4 93.81 0.59
SOURCE: Umpire Scoreboards
Note: Timmons, Holbrook, and Nauert didn’t call balls and strikes in 2022.

That is extremely stark. The younger umpires have outperformed the older ones by an overwhelming margin. While I didn’t expect the outcome to be quite so definitive, the direction of the results isn’t surprising. Again, I’m not the first to note that younger umpires tend to outperform older ones, but I do think there’s a way to draw some more nuanced conclusions from the data.

The correlation coefficient between birth year and AAx is .63. There’s an even stronger correlation between birth year and accuracy, but those numbers are skewed by the league’s overall improvement.

There are plenty more factors that make it very hard to get a handle on the true relationship between age and accuracy. The small pool of umpires and the paucity of data don’t help. Umpire Scorecards’ data only goes back to 2015, but some of the umpires it’s assessing called their first big league games during the Carter administration. They’re bound to score worse not just because their eyesight could be starting to go, but because they had to adjust to the current system in the middle of their careers, whereas younger umpires couldn’t advance to the big leagues unless the system rated them highly. That’s an awfully strong motivator.

Umpire Auditors has data on 124 umpires. Sixty-two made it to the big leagues before PITCHf/x in 2008, and 62 made it after. For the older umpires, the correlation between age and AAx is .48. For the younger umpires, it’s .18. The younger umps still grade better, but the relationship isn’t nearly so strong. Here’s the same graph from before, but with the two groups split into different colors. It’s easy to see how much stronger the directional relationship is in the older group.

I tried to calculate aging curves based on accuracy, but the league’s overall improvement renders them more or less useless. Umpire Scorecards has data for 597 back-to-back individual seasons, and the umpires improved in 398 of them. An aging curve that doesn’t account for the rising league average will indicate that umpires never stop getting better. An aging curve that does account for it will indicate that they never stop getting worse, because year-over-year individual improvements are usually smaller than the difference between the younger new umpires and the older ones they’re replacing. However, AAx accounts for the time period, which makes it our best shot for an aging curve.

Now there are all sorts of things wrong with this graph. All the factors I listed in the previous paragraphs are distorting it in their own way. However, I don’t think that the overall shape, peaking in the mid-40s — coincidentally or not, around the age when visual acuity starts to decline — is particularly farfetched, though we could definitely benefit from more years of data. We don’t yet know what umpires who came of age in the pitch tracking era will look like in their 50s and 60s, but they’re starting out from a much higher place.

The league first started assessing umpires using the QuesTec pitch tracking system in 2003. If we trust our aging curve, then the first umpires who came up in the pitch tracking era only recently hit their prime. In 2022, umpires who debuted before 2003 made 33% of ball-strike calls. That number is about to drop precipitously, but it’ll take another 10-15 years before it hits zero. When it does, we will likely reach peak accuracy, and we’ll have a better handle on its relationship with age.

The problem is, we’re probably not going to make it that long. The robots are most assuredly coming. The automated ball-strike system debuted in the Atlantic League in 2019. Four years later, it’s in every Triple-A ballpark. Teams will be splitting their 2023 seasons between a challenge system and full ABS, and it seems safe to assume that some form of automation will be coming to the big leagues soon. I’m not necessarily opposed to the change, but I will be sad about this particular unintended consequence. We’ll never get to find out what the upper bound of umpire performance truly looks like.

The last point I want to make is that while watching umpires become the best versions of themselves is rewarding on a spiritual level, that doesn’t necessarily mean it’s good for the game on the whole. As Noah Davis and Michael Lopez noted at FiveThirtyEight all the way back in 2015, umpires have improved much more at calling strikes than balls, and the resultant rise in called strikes is one of the reasons for a decline in offense. With the luxury of hindsight, I wonder whether the league should’ve seen this coming.

Umpires were much better at identifying balls than strikes. With so much room for improvement on pitches in the strike zone, it only makes sense that improved accuracy would mean extra strikes. I bring this up because even though the gap between pitches in and out of the zone has narrowed considerably, it’s still there. If the league were to implement a robo-zone tomorrow without changing the strike zone at all, the offensive environment would instantly get even tougher. I’m not trying to scare anybody away from our techno-future, but maybe we should all watch Terminator one last time before we flip the switch.

Davy Andrews is a Brooklyn-based musician and a contributing writer for FanGraphs. He can be found on Twitter @davyandrewsdavy.

3 months ago

God, I hope robot umpires are never a thing. The most I’d accept is a challenge system, where a batter or pitcher can challenge a call maybe once or twice per game. Keep the game of baseball human, please.

3 months ago
Reply to  mariodegenzgz

I think they’re talking about a challenge thing where you get 3 challenges, and if you’re successful you keep it.

3 months ago
Reply to  mariodegenzgz

Eh, human aspect of umpiring is overrated. I’ve watched baseball in the old days where every umpire has their own strike zone (i.e. Eric Gregg had a strike zone as wide as himself and no batter realistically could hit many of his stirkes) and pitchers would have their own clout in terms of what kind of zone they get (i.e. Maddux gets more strikes called because he’s Greg Maddux and his reputation dictates his pitches are strikes even if it’s a ball if some other pitcher threw it).

It’s only when these non-humane technology that became part of the game when umpires became more uniform and the strike zone became more uniform. Which this article has demonstrated as younger umps are more likely to call the “actual” strike zone compared to older ones.

And to me it seems the more technology came into play, the better the quality of product.

3 months ago
Reply to  mariodegenzgz

The funny thing is, with the implementation in the minors most fans have no clue the robo ump is making the calls. This whole concept of thinking a human is required to make a call on strikes is pretty absurd when you look at how egregiously bad even some of the best umps can be in a game. We’re talking game changing missing about 10% of pitches in a single game on average.

Last edited 3 months ago by lacslyer
3 months ago
Reply to  lacslyer

My big issue is the blown out call that leads to runs. That’s not calculated in umpire scorecards. I don’t know that it’s tracked anywhere. In one game I twice saw a blown 3rd strike followed by HR. The blown 3rd strikes weren’t considered impactul though they led directly to runs. I do t think those would happen with robo umps. And those are calls that are the worst for any sense of fairness

3 months ago
Reply to  mariodegenzgz

I want the robots once they work out a good standard for determining top and bottom of the zone. But I never want a challenge system, it will slow things down and put managers in very difficult positions.