# One Last Refresher (On Strikeouts and Walks)

This is the last of a set of articles I’ve written over the past few weeks. Each one tries to determine what’s real and what’s noise when it comes to the outcome of a plate appearance. For the batted ball articles, the conclusions generally tracked. Variations in home run rate are largely due to the batter. Pitchers and batters both show skill in groundball rate. And line drives and popups are somewhere in between — batters exhibit a little more persistence in variation than pitchers, though neither does so strongly.

Strikeouts and walks are a different beast. It’s pretty clear that pitchers and batters can be good or bad at them. No one looks at Chris Davis or Tyler O’Neill and thinks “eh, that’s pretty unlucky to have all those strikeouts, I bet they’re average at it overall.” Likewise, Josh Hader isn’t just preternaturally lucky — he’s good at striking batters out.

So rather than attempt to prove that pitchers can be good or bad at striking out batters and vice versa, I’m interested in whether one side has the upper hand. I’m adapting a method laid out by Tom Tango here, but I’ll also repeat the same methodology I used in the previous pieces in this series.

First, let’s take a look at how pitchers did from year one to year two. As I did before, I divided every pitcher and batter into quartiles based on their 2018 strikeout rates, then I used those quartiles to group 2019 plate appearances. Then I weighted each batter by the minimum of pitchers they faced in any quartile. Here you can see that pitchers’ strikeout rates in year one do a reasonable job of predicting year two:

Year | Quartile 1 | Quartile 2 | Quartile 3 | Quartile 4 |
---|---|---|---|---|

“2018” | 15.4% | 19.3% | 23.2% | 30.1% |

2019 | 18.4% | 20.0% | 24.0% | 27.7% |

Of course, the same could be said for batters. Those who struck out a lot in 2018 still did so in 2019:

Year | Quartile 1 | Quartile 2 | Quartile 3 | Quartile 4 |
---|---|---|---|---|

“2018” | 15.1% | 21.5% | 26.1% | 35.7% |

2019 | 16.7% | 21.6% | 25.0% | 31.7% |

Don’t focus on the rates, as the numbers can be skewed because I’m equal-weighting each group rather than taking the real total rate of strikeouts. Rather, focus on the pattern between years. In both cases, roughly three quarters of the spread in year one remains in year two. Before we get into Tango’s method, here’s a grid of how each 2018 quartile did when facing each other in 2019. The batter quartiles go down the left side, and the pitcher quartiles across the top:

Quadrant | 1 | 2 | 3 | 4 |
---|---|---|---|---|

1 | 13.7% | 14.5% | 17.4% | 20.3% |

2 | 18.2% | 19.7% | 23.0% | 27.0% |

3 | 20.5% | 23.1% | 27.6% | 31.3% |

4 | 27.1% | 29.6% | 33.8% | 38.8% |

Now let’s use Tango’s method to regress these rates. First, we look at how much smaller the spread between the first and fourth quartiles is in year two as compared to year one. For the pitcher quartiles, that’s 37% — the spread declines, but it’s still quite large.

From there, we need the average least-n weight in our dataset, which comes out to 34. Chuck it into the formula, which is [(Regression%)/(1-Regression%)]*(Average Weight), and we get an answer of 20. What does this mean? Well, let’s say you have a pitcher and want to work out how much to regress their skill to the mean. The math is pretty easy; take that pitcher’s least-n weight and divide it by that weight plus 20. Jeff Hoffman’s least-n was 30; so if you want to decide how much of his skill to retain, you can simply take 30 and divide by 50 (30+20). Want to predict his strikeout rate in 2020 using 2019 data? Retain 60% of his difference from the mean.

On the batter side, we can do the same thing. Batters regress to the mean a bit less, and they did it in a smaller average sample size. Their ballast weight (the number you use to regress the data to the mean) is 8.5. Take Travis Shaw, with a least-n of 30. Want to project him in 2020? Retain 78% of his variation from the mean.

What this means is that you can believe a batter’s strikeout numbers, for a given observation, more than you’d believe a pitcher’s. That’s borne out by looking at the grid of batter-pitcher matchups. On average, moving a tier higher in batter strikeout rate (in 2019) added 5.3% to the strikeout rate. Moving a tier higher in pitcher strikeout rate increased strikeout rate by 3.2% on average.

Let’s do the same for walks. First, we’ll look at how pitchers varied in 2018 and 2019:

Year | Quartile 1 | Quartile 2 | Quartile 3 | Quartile 4 |
---|---|---|---|---|

“2018” | 4.9% | 7.3% | 9.3% | 13.4% |

2019 | 6.3% | 7.7% | 8.9% | 10.4% |

Okay, there’s some spread there. Let’s do the same for batters:

Year | Quartile 1 | Quartile 2 | Quartile 3 | Quartile 4 |
---|---|---|---|---|

“2018” | 4.5% | 6.9% | 9.5% | 12.5% |

2019 | 6.5% | 8.0% | 10.0% | 11.8% |

Neat! Batters have more explanatory power again. Let’s get it in grid form:

Quadrant | 1 | 2 | 3 | 4 |
---|---|---|---|---|

1 | 4.0% | 4.6% | 5.2% | 7.1% |

2 | 5.0% | 6.4% | 7.2% | 8.7% |

3 | 6.3% | 7.8% | 8.9% | 10.8% |

4 | 8.8% | 10.5% | 12.1% | 13.2% |

We’ve got the same story as before — batters and pitchers both have input. I’ll save you the grunt work of doing the Tango-style math here; you should use ballast numbers of 37 for pitchers and 14 for batters. In other words, make something of a batter’s walk rate before you do the same for a pitcher. That makes some sense to me; pitchers are mostly in a narrow band of walk rates long-term while batters really aren’t. Mike Trout has a 19% walk rate over the last three years while Dee Gordon is at 3.1%. The biggest pitcher gap is Tyler Chatwood’s 14.5% and Mike Leake’s 4.1%. Batters just vary more.

If you’re still with me after all this time, thank you. It was a bit of a slog to write it but hopefully not one to read it. Today’s conclusion isn’t particularly surprising: you should care about batter *and* pitcher strikeout and walk rates, because both can tell you something. Maybe you can learn about batters’ rates a little faster, but you would be well served to use both.

In no time we’ll be back to our regularly scheduled articles talking about new pitches and new performance levels, or perhaps even teams’ competitive aspirations. But I’m glad I had a chance to go over some of the basics before games start up again and make sure that what we expect still roughly holds.

Ben is a contributor to FanGraphs. A lifelong Cardinals fan, he got his start writing for Viva El Birdos. He can be found on Twitter @_Ben_Clemens.

I would rewrite (Regression%)/(1-Regression%)*(Average Weight) with additional brackets to make clearer the order of operations, because it’s ambiguous – in some journals multiplication has preference over division, so this formula would be interpreted incorrectly. [Regression%/(1-Regression%)]*Average Weight is clearer.

Wow, I didn’t realize that. I’ll update it, and yeah, you were interpreting it correctly.

It’s why all those viral facebook posts like “what’s 3+4/6*8? 80% of users get it wrong!” are idiotic – it’s all convention and there’s more than one such convention. Always best just to make things unambiguous.

There is a correct order of operations but not even programming languages always get it right

There is certainly a correct order in PEMDAS in that multiplication and division go before addition and subtraction, but there is not a “correct” order except based on convention for dealing with division followed by multiplication, which can vary from journal to journal, especially depending on your discipline. Some will give division and multiplication equal precedence and then go from left to right, while some will go with multiplication over division. In math we generally just frown on something like 3/3*2 as ambiguous and instead write the division 3/3 as 3 over 3 to avoid all ambiguity (easier done in papers where you can just use latex) I am an applied mathematician and this has unfortunately come up before in certain physics journals.

Edit: I should say, it’s all convention. Mathematical notation always is.

A lot of writing of saber equations is formatted badly. I was writing something to process and calculate FIP last night and the equations in FG guts were both inaccurate until I went through and added more parenthesis to make the order of operations work correctly. (3 * BB + HBP) does not produce the right result, and / leagueIP + FIPconstant has similar issues

(3*(BB + HBP)) and / leagueip) + FIPconstant are what was needed to get accurate results from my code.

How about just using LaTeX to formulate the equations? It is de facto standard in most of scientific and math writings. I think WordPress support LaTeX too.

That would be the ideal solution.

There’s like half a dozen LaTeX plug-ins for WordPress (just like every other kind of plug-in, though compounded by the overlap between programming and the typical user of LaTeX). The trick is picking one that works best for the FG writers while offering the least maintenance headaches / vulnerability surface for Sean and other site admins.