Let’s Have Fun With Non-Neutral Games Contexts

January 23, 2020

As I was browsing through Baseball Twitter a few days ago (terrible habit, I suggest you avoid it), I came across an interesting question:

Which player would you rather have at the plate in a bottom of the 9th bases loaded 2 outs down by 1 situation?
Player A
136 wRC+ 7% BB rate .191 ISO
Player B
161 wRC+ 14.7% BB rate .267 ISO
Player C
174 wRC+ 18.7% BB rate .343 ISO

— Max Greenfield (@GreenfieldMax18) January 21, 2020

My brain loves puzzles and answering questions, so I decided to vote. The obvious choice is Player C. He’s the best hitter, and I want my best hitter in the most important situations. The point of the exercise has to be to dunk on fans who think Player A is clutch, right?

Well, essentially yes. The players in this question are 2019 DJ LeMahieu, 2014 Giancarlo Stanton, and 2017 Aaron Judge. Yankees fans were really into LeMahieu in 2019, to the point of advocating for him as MVP, and while he was certainly a good hitter, he’s not Aaron Judge.

But don’t stop the analysis there, because something important is missing. Picking the best hitter is easy — as long as you define best correctly. For example, ISO shouldn’t enter into your decision at all. With the bases loaded and two outs, there’s not much difference between a single and a home run, and there’s definitely no difference between a double and a home run. Power stats aren’t relevant here.

Why not simply use wRC+? It’s a wonderful stat — but it’s not the right statistic for this calculation. To work out the value of each action (out, walk, hit) in a wRC+ or wOBA framework, you take the change in run value from the state before the play to after the play for each possible base/out state, then weight those changes by how frequently each state occurs.

That’s a mouthful, so let’s walk through an example. We want to figure out the value of a single. In this hypothetical example, our batter comes to the plate to face one of two situations — bases empty with no one out, or a man on third with two outs. Let’s further say, for the sake of argument, that each of them occurs equally often.

By consulting this handy chart, we can define the run values. Going from bases empty and none out to a man on first and none out moves run expectancy from 0.5439 runs to 0.9345 runs. Therefore, the added value of the single (relative to before the plate appearance) is (0.9345-0.5439), or 0.3906 runs. In the other case, with a runner on third and two outs, we need to add the run that scores with a single. Before the play, run expectancy was 0.3907. Afterwards, a man on first with two outs has a run expectancy of 0.2422, to which we add 1 for the scoring runner. That makes the value of that hit (1.2422-0.3907), or 0.8515.

From there, we can simply average the two to get the run value of an average single in this pocket universe — 0.6211. We don’t have to stop there, either. We could vary the frequency of each outcome, like so:

Single Value In Different Worlds

Variation	Bases Empty Frequency	Runner On Frequency	Run Value of a Single
1	50%	50%	0.621
2	75%	25%	0.506
3	25%	75%	0.736

Or we could add a third outcome, say bases empty and two outs:

Single Value, Take Two

State	Frequency	Run Value of a Single
None On, None Out	33.3%	0.3906
Man on Third, Two Outs	33.3%	0.8515
None On, Two Out	33.3%	0.1275
Total	100.0%	0.4565

There are a few other calculations (to convert raw runs into wOBA, account for league average, and apply park factors), but that’s the basis for wRC+. There’s no way of knowing what situation you’ll face when you come to the plate, and so the value of each hit has to reflect its possible value across all situations, weighted by frequency.

That works perfectly well when we’re evaluating hitters in the abstract, because you can’t predict who will be up in a big spot or which inning the home run will occur in. You simply want the hitters who do the most on-average-good things the most times. But in a particular situation, such as the one posed in our initial question, that assumption stops making sense.

There’s a further complication, as well. All of the above calculations were done in units of runs. But baseball teams aren’t, in the end, judged by runs. They’re judged by wins. Here’s another hypothetical situation. It’s the bottom of the eighth, and the game is tied. There are runners on first and third with two outs.

One batter hits a home run a third of the time and strikes out the other two thirds of the time. His expected run value here is one. A third of the time he gets three runs, and two thirds of the time he gets zero (ignore the fact that the inning continues after the home run for the moment). Another batter is a real weirdo: he’s guaranteed to score the run from third with a single, but get thrown out trying to stretch his hit to a double, ending the inning. His expected run value is also one — one run scores every time.

From a runs perspective, these two hitters are equally valuable. Add in the game state, however, and things change. Here are the odds of winning a game from the three states our batters can leave the game in, via the delightful WPA Inquirer:

Win Probability

State	Win Probability
End 8, Up 1	82.6%
Bottom 8, Up 3	96.4%
End 8, Tied	50%

By combining the latter two states, we reach a pretty obvious answer. The team would vastly prefer the batter who always singles. In his appearances, the team wins 82.6% of the time. When the feast or famine hitter is playing, on the other hand, the probability of a win falls to 65.5%. The first run is far more valuable than the second and third runs, which makes runs the wrong unit to consider if we want to decide who should bat in this particular situation.

Back to the question at hand: we’re looking for a batter to come to the plate with two outs and the bases loaded in the bottom of the ninth inning. wRC+ isn’t going to do the trick, both because of its valuation of runs and its context neutrality. When a single is as good as a homer, we have to go more granular.

Take a look at each batter’s production on a rate basis (with sacrifice bunts and intentional walks removed):

Rate Stats for Three Yankees

Player	BB+HBP	1B	2B	3B	HR	Out
DJ LeMahieu	7.34%	20.80%	5.05%	0.31%	3.98%	62.54%
Giancarlo Stanton	11.89%	14.01%	5.05%	0.16%	6.03%	62.87%
Aaron Judge	18.14%	11.24%	3.60%	0.45%	7.80%	58.77%

Next, let’s work out the team’s odds of winning after each possible result. The only tricky one is a single. For that, I’ve assumed that the winning run scores 75% of the time, stops at third 20% of the time, and is thrown out at the plate 5% of the time. You can change those assumptions if you’d like, but I’m happy approximating:

Win Probability After:

Event	BB+HBP	1B	2B	3B	HR	Out
Win Probability	66.40%	90.78%	100%	100%	100%	0%

With the odds of winning after each state and the chances we reach each state, we can calculate the team’s win probability with each batter at the plate:

The Best Yankee, In Context

Player	Team Win%
DJ LeMahieu	33.1%
Giancarlo Stanton	31.9%
Aaron Judge	34.1%

It’s close! It’s much closer than expected, actually. The player with the best wRC+ ended up being the right choice, but not by much. The player with the worst wRC+ was a narrow second. And I actually voted for Stanton!

There’s a point to this analysis. Stats are great, and yet when you use them out of context, they don’t necessarily say what you’re trying to say. If I wanted to decide who to have at the plate in this situation, this grid of data about each player clearly wouldn’t be useful:

Some Data About Yankees

Player	Cats or Dogs?	Favorite Color	Greatest Fear
DJ LeMahieu	Dogs	Blue	Spiders
Giancarlo Stanton	Cats	Orange	Snakes
Aaron Judge	Dogs	Red	Loneliness

But the truth is, that’s not all that different from this table:

Some Data About Yankees

Player	Stolen Bases	BB%	ISO
DJ LeMahieu	5	7%	0.191
Giancarlo Stanton	13	14.70%	0.267
Aaron Judge	9	18.70%	0.343

The second table looks more like baseball stats, to be sure, but it doesn’t give much more relevant information. Even adding wRC+ doesn’t give us enough data, because it’s measuring something valuable but unsuited to the situation.

Before we finish, this choice and its non-intuitive wrinkles made me think of another fun hypothetical. Imagine an NL team with two 100 wRC+ pinch-hitters. The two batters get to their line in very different ways, though:

Two Hypothetical Pinch Hitters

Batter	BB+HBP	1B	2B	3B	HR	Out
Johnny Singles	5.0%	22.0%	5.0%	0.3%	1.5%	66.2%
Joe Dinger	12.0%	12.0%	3.0%	0.0%	5.0%	68.0%

Our hypothetical team gets one chance to pinch hit every day, and it’s always in one of two situations: bottom six, game tied, man on third with two out, or bottom six, game tied, bases empty with two outs. Before these batters come to the plate, the team’s odds of winning are 56.1% in the first case and 51.8% in the second case. After they bat, here’s how much those odds change:

Change in Win Probability

Player	Situation 1 WP% Change	Situation 2 WP% Change
Johnny Singles	0.46%	-0.23%
Joe Dinger	-0.47%	0.22%

If you have each situation an equal number of times, you’d prefer the first hitter, who would add a whopping 0.2 wins over a completely neutral hitter in that situation. The second batter is worth 0.2 wins less than that true neutral hitter. If we can mix and match, however, we can add 0.5 wins to the team’s total relative to an average hitter.

That doesn’t sound like much — and I mean, it’s really not. But on the other hand, half a win over 162 PA isn’t nothing. And it’s created out of whole cloth; the players don’t look any different at a high level, and it doesn’t sound like having two 100 wRC+ hitters is better than having one.

Teams probably don’t need to focus on doing things like this, on diversifying skillsets at the cost of production. It’s enough value that they shouldn’t ignore it if it falls in their laps, and managers should absolutely consider the game state and batter tendencies when sending out a pinch-hitter, but it doesn’t merit dramatic roster overhauls.

In fact, this analysis shows why wRC+ is a good metric overall; these hitters are as different as possible, and the situation I created for them is polarized to their skill sets, and yet the difference amounts to very little over a full year. That doesn’t mean there’s no value to considering context, however. If nothing else, it turns a perfunctory online survey into a 1600-word article, and hopefully an enjoyable 10 minutes of reading.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG