We’re Going Streaking

by Seth Samuels

January 20, 2011

We are happy to present a two-part guest post by Seth Samuels, who takes an in-depth look at a topic that is often a source of disagreement. Part two will run tomorrow.

Last summer, I was catching up with Fangraphs founder (and my elementary school classmate) David Appelman when he mentioned an interest in being able to identify streakiness in baseball players. Baseball announcers and writers are often criticized for psychoanalyzing a player’s current hot or cold streak, even though those streaks may often be a function of small sample sizes. A full season, however, is a much larger sample than five games. So it certainly seems reasonable that some players might tend to be streakier than others over the course of a full year.

As both a Mets fan and an occasional fantasy player, I’ve repeatedly seen my teams — both real and imaginary — bolstered by surges and short-circuited by cold spells. So being able to identify which players are most or least likely to go on streaks would be a useful tool.

I’m certainly not the first person to look at this question. Most notably, Jim Albert and Jay Bennett discussed the subject a bit in chapter five of their excellent book Curve Ball, using randomization to determine that Todd Zeile showed signs of having been a legitimately streaky hitter. They term this trait “streaky ability,” to distinguish it from “observed streakiness,” which they use to describe the small sample mistakes referred to above. In Curve Ball, Albert and Bennett randomly simulate Zeile’s 1999 season, and look at the fluctuations in Zeile’s moving batting average over eight game stretches in those simulated versions of Zeile’s 1999 season. They then compare those fluctuations to the fluctuations from his actual 1999 season, and find that, in the first half of 1999, Zeile was streakier than simple randomness would suggest.

They argue that this indicates that Zeile has a great deal of “streaky ability.” However, Albert and Bennett are quick to point out that their study suffers from selection bias — they specifically chose Zeile because he had a reputation for streakiness. In 2008, Albert returned to the subject in a paper, “Streaky Hitting in Baseball,” in the Journal of Quantitative Analysis in Sports. In this paper, Albert looks at the streakiness of all players in 2005, using batting average, home run rate, and strikeout rate. However, he finds no relationship between streakiness in one category and streakiness in another. So what happens if we look for Albert and Bennett’s “streaky ability” using a more sophisticated metric and over a longer time period?

We can easily adapt the Albert and Bennett approach to all players, with a slight update to the methodology. In particular, I’ve chosen to use wOBA, rather than batting average, because, as most readers of this site are no doubt aware, it does a far better job of capturing a player’s actual value (and fluctuations thereof). For those who are not familiar with wOBA, it is a catch-all stat developed by Tom Tango, which measures a player’s overall contribution to run-scoring by placing extra weight on more valuable hit types, and which is scaled to look like on-base percentage. So a .400 wOBA is just as excellent as a .400 on-base percentage, and a .300 wOBA is just as poor as a .300 on-base percentage.

Being a Mets fan, I’ll use David Wright’s seemingly consistent 2007 and seemingly streaky 2010 to demonstrate the process. As was recently discussed by Bill Petti, Wright seems to have evolved from a very consistent hitter early in his career to a very volatile one more recently, although there is some evidence that this may have been a fluke. I’ll note here that, though Bill and I have somewhat similar approaches, we arrived at them independently — a nice example of synchronicity. My analysis is a bit more technically involved and is ultimately applied to a larger sample, but I’d encourage you to read Bill’s work too.

Using data from Retrosheet, I started by calculating Wright’s moving wOBA for every seven-day period during the 2007 season. This is plotted below in blue, with the red line representing Wright’s full-season wOBA:

For each point on the x axis, I then take the absolute value of the distance between the moving wOBA and the full-season wOBA. Next, I take the average of all of these distances for the whole season, weighted by the number of plate appearances in each seven-day window. The weighting serves two purposes: first, it helps to avoid placing too much emphasis on small samples — if a player comes to the plate thirty times in one window and only five in another, we care more about his performance in the first. Second, it means that if a player is injured and misses playing time, he will not be punished for his .000 wOBA during that time. Using dates instead of games or plate appearances also helps to account for injuries, as a player who goes on a hot streak and then misses a month with injury should not be viewed as continuing the same hot streak when he returns.

The resulting calculation is our raw streakiness statistic. In essence, this boils down to a weighted average area of the blue region in the plot below:

Wright’s raw streakiness in 2007 was .072. For streakier players, we would expect more frequent and extreme divergence from the full-season wOBA, and therefore a higher resulting streakiness statistic. As noted earlier, David Wright appears to have gotten streakier in recent years. Sure enough, in 2010 Wright’s raw streakiness came in at .101. Wright’s 2010 performance is plotted below:

As we can see, Wright’s performance peaked around the same place in both 2007 and 2010, maxing out at .656 in 2007 and .659 in 2010. However, because his full-season performance was so different in the two years, the peak represents a .292 deviation in 2010, compared with only .238 in 2007. Moreover, in 2007, Wright’s seven-day wOBA never dropped below .227. In 2010, it got as low as .113. Clearly, Wright was much streakier in 2010 than in 2007.

But just how streaky is Wright’s performance really? How does it compare to the rest of the league, for example? This question is more complicated than it may seem at first. The problem is that players with a greater range of results will tend to have greater variation in the value of their performance in general, and will therefore exhibit greater fluctuations. Wright will have his share of outs, singles, doubles, triples, and homers. Luis Castillo will rarely do anything other than single or make an out. So we need to make sure that we are not accusing Wright of streakiness just for being a better hitter than Luis Castillo. Therefore, in order to compare Wright’s streakiness to the rest of the league, we first need to compare it to random chance.

The trick is to borrow (and modify) an idea from Albert and Bennett. Let’s go back to David Wright’s 2010 season. Using the Retrosheet data mentioned earlier, we can randomly simulate Wright’s 2010 season many times over, and see how the universe of simulated David Wrights compares to the real one. While Albert and Bennett used a random simulation method, which allows changes to the bottom line, I prefer something called permutation inference. In our simulations, Wright’s overall 2010 performance does not change. He still has exactly 661 plate appearances (excluding intentional walks), 60 walks, 98 singles, 36 doubles, 3 triples, and 29 home runs. The only thing that will change in our simulations is the order in which those things occurred.

We also assume that the dates remain constant. This will allow us to calculated simulated values for Wright’s seven-day wOBA, and compare his simulated streakiness to his actual result. So, for example, here are Wright’s first ten plate appearances of 2010, along with his first ten plate appearances in five different simulations:

Each of those results in the simulations corresponds to a true at-bat from Wright’s actual season. So, for example, in a given simulation, Wright’s first plate appearance may be replaced by his 247th, his second may be replaced by his tenth, and so on, with his actual first and second plate appearances showing up later. There are 661! possible permutations of Wright’s 661 plate appearances. That’s well over 10 x 10100 and far more than we could possibly calculate. Fortunately, by randomly reordering Wright’s 2010 plate appearances a large number of times, we can closely approximate the actual distribution of possible streakiness scores that Wright could have posted. This allows us to figure out how streaky Wright was in 2010 compared to random chance.

Let’s try simulating Wright’s 2010 season 10,000 times. It’s often easy to forget how streaky even pure randomness can be. So, just to give a sense of that, here’s the least streaky simulation of Wright’s 2010 season, in which he posts a raw streakiness of .053:

One might still consider that a pretty streaky ballplayer. That’s a particularly nasty cold streak in July, worse than any the real Wright endured in 2007. By contrast, here is the most extreme case, in which simulated David Wright’s raw streakiness is .126:

That’s a pretty volatile player, to say the least. Ultimately, however, we’re concerned about the distribution of simulations, not the extremes. In all, out of 10,000 simulations, Wright’s actual streakiness was more extreme than 9,312 of them, suggesting that he was streaky, even in comparison to the full range of possibilities. Here is the full distribution of possible streakiness values for Wright in 2010, with the red line representing his true result:

Our methodology finds that Wright’s 2010 season was streakier than 93.1% of possible seasons, given his performance. So we can assign Wright a true streakiness value of .931. Looking back at Wright’s consistent-looking 2007, we see that his true streakiness was .142. So, even after accounting for the possibility that Wright’s streakiness was largely a function of randomness, we find that he did indeed go from being very consistent in 2007 to extremely streaky in 2010.

So, now that we have a framework for assessing streakiness, we can apply that framework across the game, and see what it tells us about the volatility of all players. The use of permutation inference also means that, should we choose, we can apply this same method with other statistics — contact rate, slugging percentage, and ERA (for pitchers), just to name a few. Tomorrow, I’ll apply this approach to the entire league, and see how David Wright’s streakiness compares with that of other players.

The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at “www.retrosheet.org”.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG