Randomness, Stabilization, & Regression

“Stabilization” plate appearance levels of different statistics have been popular around these parts in recent years, thanks to the great work of “Pizza Cutter,” a.k.a. Russell Carleton.  Each stat is given a PA cutoff, each of which is supposed to be a guideline for the minimum number PAs a player needs before you can start to take their results in that stat seriously.  Today I’ll be looking at the issue of stabilization from a few different angles.  At the heart of the issue are mathy concepts like separating out a player’s “true skill level” from variation due to randomness.  I’ll do my best to keep the math as easily digestible as I can.

The first thing I want to discuss is randomness.  I’m sure I’m mainly preaching to the choir here, but let me briefly illustrate something: if you have a fair 6-sided die, you know that a given side has a 1/6 chance of coming up; that’s the “true” probability (about 0.167).  But the fewer times you roll it, the greater the chance that your desired number is going to come up a lot more or less than 16.7% of the time.  Feel free to play around with these interactive charts to get a better feel for how that works:

Instructions: enter a whole number between 1 and 500 under Number of Trials, and a decimal between 0 and 1 (or a percentage between 0% and 100%) under “True” Success Rate.

These charts represent the binomial distribution.  The top chart indicates the probability (on the vertical axis) of getting the particular result shown on the horizontal (x) axis. The lower chart shows the cumulative distribution function, meaning each point indicates the probability that the result will be equal to or less than the value on the x-axis.  Look at the points, in particular — the lines are just there to show the overall trends.

The binomial distribution indicates how likely it is that the results you’re seeing are fluky; i.e., it helps to separate out the randomness in your observations.  Now, baseball is a lot more complicated than rolling dice.  First of all, you can only guess what a player’s “true” success rates are.  Complicating things even more are the effects that opposition, parks, and weather can have on their observed success rates.  That’s not even the whole of it — a player’s “true” rates can change over their careers, and probably even over the course of a season (depending on their physical — and arguably, psychological — health).  Still, the principle behind the binomial distribution plays an important role in baseball stats.

The Pizza Cutter Method, Version 1

To recap Russell’s original method: he splits each player’s PAs over several seasons into 2 groups — evens and odds (according to their chronological order, I believe).  He then figures out each player’s stats in each group for every PA level in question, and figures out the correlations between the stats in the two groups for all the players.  He then tries to find the PA level at which the correlations between the odd and even groups reach 0.7.  Now, using that 0.7 correlation figure has been the topic of some debate, which I’ll address in a little bit.  First, I want to show you the results of my research on 2012 Retrosheet data, using something close to Russell’s original methodology:

Pizza Cutter

The asterisk next to OBP is there because I was calculating times on base divided by plate appearances, for the sake of keeping my Retrosheet spreadsheet a little simpler… the proper definition of OBP would actually subtract sacrifice hits (successful bunts) from PA in the denominator.  So, expect a little lower average and probably a tiny bit higher variance for OBP* than for actual OBP.

Anyway, what we see here is that OBP doesn’t stabilize over the course of a season, according to Russell’s definition, but we might extrapolate out the regression line you see there and say that a 0.7 correlation might be reached somewhere in the area of 800-900 PA.  However, as I mentioned, some — including Tom Tango — disagree with using 0.7 correlation as the standard for this purpose, thinking 0.5 makes more sense (the reason for that could use a little more setup, I think).  According to the alternative view, OBP* stabilizes around 600 PA.  Or at least it did in 2012… I don’t think it’s anywhere near a given that it will be that consistent between years.

A Nerdy Debate

Why was an 0.7 correlation between the split-halves Russell’s target?  Well, if you square correlation, you get a figure that’s supposed to represent the proportion of the variance in factor y that can be accounted for by knowing factor x.  0.7 squared equals 0.49, implying that the factor on the x-axis explains 49% of what’s happening with the factor on the y-axis.  So, the idea was the point at which the correlation between the halves passes 0.7 is the point at which we’re seeing half-signal, half-noise out of the sample.  However, Tango disagrees, and says an 0.5 correlation is what achieves that goal in this type of situation.  Kincaid, of 3-dbaseball.net, agrees with Tango, and explains in the comments,

“r=.7 between two separate observed samples implies that half of the variance in one observed sample is explained by the other observed sample.  But the other observed sample is not a pure measure of skill; it also has random variance.  So you can’t extrapolate that as half of the variance is explained by skill.”

Correlation squared, or R squared, also known as the coefficient of determination,  is supposed to compare the results of a mathematical model to the observed results this model is supposed to predict or explain.  The model is supposed to be a representation of the “true” (e.g. skill-based) rates, having regressed out as much of the randomness as it could.  So, you see why comparing a de-lucked sample (from the model) to an observed sample is a bit different from comparing two observed samples — the model has a big leg up in matching up with the observed sample.

Kincaid continues:

“You actually get half of the variance explained by skill when the correlation between samples is .5.  That is the point when half of the variance in each sample is explained by skill, and half by random variation.  Then, when you try to explain the variance of one sample (which is half skill/half random) with the other sample, you get half the variance of one sample explaining half the variance of the other sample, which means the overall variance of one will explain 1/4 of the other.”

Another Method — Regression Towards the Mean

The aforementioned Kincaid wrote a great (and very mathy) article on some methods of separating out the luck and skill of players.  I’m going to apply some of that knowledge to my same 2012 batter OBP* sample.

First of all, remember that what we see in the stats is part skill, part luck.  In other words:

Total Variance = “True” Variance + Error Variance

That “error” is from sources like insufficient sample size and poor measurements.  We want to try to cut out the distraction the errors cause to try to get to the “true” variance.

The first thing we need to do is find the total variance of the population.  Not surprisingly, that depends a lot on how many PAs we’re analyzing per player. You may have noticed that underneath my binomial distribution charts, the variance of the distribution was being calculated.  The formula for that was:

Variance = p*(1-p) / n

…where p = the mean probability of the stat in question (“success rate”… e.g., the league average of the stat, over a given number of PA), and n = the number of events being considered (e.g., PAs per player).

That formula should give us a good idea of how much of the variance we’re seeing comes from the PA level we’re analyzing.  It doesn’t address the variance due to the sample size of players we’re using, though (maybe somebody who knows more about stats than me can tell/remind me how that works).

So, for each PA level under investigation, we calculate the random variance as predicted by the second formula. Here, you’ll see both the total (observed) variance and the random variance:

Pizza Cutter 2

The p in the random variance formula changes noticeably in this sample, by the way; up until about 60 PA, the mean OBP* is only around .300, but that number climbs steadily to around .340 by 500 PA.  Part of that rise may have something to with batters getting more comfortable with more PAs, but I’m sure the main cause is the exclusion of pitchers, bench players, and other fill-ins from the higher-PA samples.

Now we can subtract the random variance from the total variance, and it gives us:

Pizza Cutter 3

It looks like once we’ve taken batters with fewer than 100 PA out of the equation, things get a lot more predictable.

Now, to find the PA level at which the true variance equals the random variance (in other words, where the signal equals the noise — the same as Pizza Cutter’s goal) we bring back the second equation, this time inserting the true variance and solving for n at each PA level:

Pizza Cutter 4

Ignoring the flukiness of the first 100 PA, it looks like the “stable” PA level plateaus around the low 300s.  Starting a little before 500 PA, there may be some small sample size and outlier issues, with fewer players able to meet that threshold.

How does all this relate to regression to the mean?  Well, let’s say that the mean OBP is 0.330.  Let’s also say the “stable” PA cutoff for OBP is about 305.  Going by the method Kincaid outlines, 305 is going to be our PA denominator, so we find the numerator that’s going to average out to a .330 OBP* by multiplying 0.330 * 305 = 100.65.  Now, pick a player — any player — and add 100.65 to however many times he reached base, then divide that quantity by the quantity of his PA + 305.  What you’ve just done is to regress that player’s numbers 50% towards the mean, providing an estimate of their true talent based on that threshold.  It will therefore estimate batters with fewer PAs to be a lot closer to the mean than a batter with the same stats but a lot more PAs (unless he’s already close to the mean, of course).

Pizza Cutter, Redux

Russell improved his methodology in a more recent analysis, employing a much bigger sample, and using a fancier technique than his original split-sample method — a formula known as KR-21 (Kuder-Richardson #21).  I eventually managed to track down that formula in a statistics textbook:

“KR 21” = (K/(K-1)) * (1 – (M*(K-M))/(K * s^2))

K = # of items on the test

M = Mean Score on the test

s = standard deviation of the test scores based on the mean of the test scores

(page 147 of the linked textbook)

I used that to come up with the following:

Pizza Cutter 5

Notice that the numbers don’t match up so well with the original split-sample analysis that well (mouse over or tap to see the original). However, see the PA level where an 0.5 reliability coefficient is reached? It’s right around the low 300s — same as Kincaid’s method.

Now, the textbook I linked has, on the same page, has an equation that you’re supposed to run a split-sample correlation through in order to come up with a proper reliability estimate:  2r/(1+r).  That’s called the Spearman-Brown formula.  So I converted all the points in the split sample using that, and here’s what I got:

Pizza Cutter 6

What do you know — 0.5 is crossed right around 300 PA.


So, I think the evidence is on the side of Tango, Kincaid, et al in the 0.7 vs. 0.5 debate.  I’m not a statistician or a Ph.D., though, so maybe I’m missing something.  Still, there’s nothing really magical about the 50-50 boundary of chance vs. skill.  You might prefer a level that has a higher skill requirement, in which case, 0.7 is fine.  Let me hear your thoughts on all this, though.

Textbook cited:

Interpreting Assessment Data: Statistical Techniques You Can Use

By Edwin P. Christmann, John L. Badgett

Steve is a robot created for the purpose of writing about baseball statistics. One day, he may become self-aware, and...attempt to make money or something?

Newest Most Voted
Inline Feedbacks
View all comments
Sandy Kazmir
9 years ago

“Part of that rise may have something to with batters getting more comfortable with more PAs, but I’m sure the main cause is the exclusion of pitchers, bench players, and other fill-ins from the higher-PA samples.”

So you’re using different populations throughout the study?