Pitcher zStats Entering the Homestretch, Part 1 (Validation)

One of the strange things about projecting baseball players is that even results themselves are small samples. Full seasons result in specific numbers that have minimal predictive value, such as BABIP for pitchers. The predictive value isn’t literally zero — individual seasons form much of the basis of projections, whether math-y ones like ZiPS or simply our personal opinions on how good a player is — but we have to develop tools that improve our ability to explain some of these stats. It’s not enough to know that the number of home runs allowed by a pitcher is volatile; we need to know how and why pitchers allow homers beyond a general sense of pitching poorly or being Jordan Lyles.
Data like that which StatCast provides gives us the ability to get at what’s more elemental, such as exit velocities and launch angles and the like — things that are in themselves more predictive than their end products (the number of homers). StatCast has its own implementation of this kind of exercise in its various “x” stats. ZiPS uses slightly different models with a similar purpose, which I’ve dubbed zStats. (I’m going to make you guess what the z stands for!) The differences in the models can be significant. For example, when talking about grounders, balls hit directly toward the second base bag became singles 48.7% of the time from 2012 to ’19, with 51.0% outs and 0.2% doubles. But grounders hit 16 degrees to the “left” of the bag only became hits 10.6% of the time over the same stretch, and toward the second base side, it was 9.8%. ZiPS uses data like sprint speed when calculating hitter BABIP, because how fast a player is has an effect on BABIP and extra-base hits.
ZiPS doesn’t discard actual stats; the models all improve from knowing the actual numbers in addition to the zStats. You can read more on how zStats relate to actual stats here. For those curious about the r-squared values between zStats and real stats for the offensive components, it’s 0.59 for zBABIP, 0.86 for strikeouts, 0.83 for walks, and 0.78 for homers. Those relationships are what make these stats useful for predicting the future. If you can explain 78% of the variance in home run rate between hitters with no information about how many homers they actually hit, you’ve answered a lot of the riddle. All of these numbers correlate better than the actual numbers with future numbers, though a model that uses both zStats and actual ones, as the full model of ZiPS does, is superior to either by themselves. Read the rest of this entry »