The Enigma: My Journey Through Statistical Artifacts in Pursuit of Hot Streaks

Brett Davis-Imagn Images

A warning up top: This article is about seeking and not finding, about the unique ways that data can mislead you. The hero doesn’t win in the end – unless the hero is stochastic randomness and I’m the villain, but I don’t like that telling of the tale. It all started with an innocuous question: Can we tell which types of hitters are streaky?

I approached this question in an article about Michael Harris II’s rampage through July and August. I took a cursory look at it and set it aside for future investigation after not finding any obvious effects right away. To delve more deeply, I had to come up with a definition of streakiness to test, and so I set about doing so.

My chosen method was to look at 20-game stretches to determine hot and cold streaks, then look at performance in the following 20 games to see which types of players were more prone to “stay hot” or “stay cold.” I started throwing out definitions and samples: 2021-2024, minimum 400 plate appearances on the season as a whole, overlapping sampling (so check games 1-20 vs. 21-40, 2-21 vs. 22-41, and so on), wOBA as my relevant offensive statistic, 50 points of wOBA deviation against seasonal average to convey hot or cold, 40-PA minimum per 20-game set to avoid weird pinch-hitting anomalies, throw out games with no plate appearances to skip defensive replacements — the list goes on and on.

I grabbed a bunch of secondary markers that I could use to guess which skills might make players more or less prone to streaky performance: swing rate, contact rate, ISO, overall offensive performance, strikeout rate, walk rate, chase rate… if you’re going to throw spaghetti at the wall, make sure to grab a lot of spaghetti. I wasn’t sure which statistic would be most promising for my investigation, so I erred on the side of too many.

First, though, I had to see how streaky the league was as a whole. My computer program sliced the data up as per my specifications – 127,216 pairs of 20-game before and after sets – and looked for how often the league as a whole was hot and cold. My results were eye-opening:

Do Hot Streaks Persist? Take One
State Probability
P(Hot) 18.2%
P(Cold) 17.3%
P(Neutral) 64.5%
P(Hot|Hot) 8.9%
P(Cold|Cold) 9.6%
P(Hot|Cold) 25.9%
P(Cold|Hot) 24.3%
Note: Not the final conclusion of this article. Don’t use this table to show that baseball players are anti-streaky.

If you’re not used to the conventions there, let me walk through them quickly. P(Hot) is the probability of a random 20-game stretch being a “hot streak,” defined as 50 or more points of wOBA above a hitter’s seasonal average. P(Cold) is the probability of being cold, naturally, measured as 50 or more points of wOBA below a hitter’s seasonal average. They’re both around 20%, which checks out generally; they’re similar and of reasonable magnitude. P(Hot|Hot), or probability of hot given hot, is the probability of being hot for the next 20 games contingent on having been hot the previous 20 games. It’s half as high as P(Hot), though. That’s startling.

In plain English, that would mean that hitters who are currently hot are meaningfully less likely to be hot in the future than a random hitter. The same is true for P(Cold|Cold); if you’re cold, the data seem to imply, you’re unlikely to remain cold. P(Hot|Cold), the odds that a player currently cold will break out with a huge 20-game set, came in at 26%, much higher than the probability of a random player peeling off a hot streak. Did I just find some kind of heretofore unknown effect?

I started going through the literature. The Book used a different methodology to conclude that there was evidence for marginal but real hot streak persistence. Rob Arthur and Greg Matthews found evidence of pitcher hot streaks, as measured by fastball velocity. All the way back in 1993, S. Christian Albright was finding limited evidence of streakiness in an article published in the Journal of the American Statistical Association. Brett Green and Jeffrey Zwiebel researched a variety of hot-hand hypotheses for the Sloan Analytics Conference in 2016. Both FanGraphs and Baseball Prospectus have myriad articles about the topic. Google “Russell Carleton hot streak” and you can read for days. But no one shared my strongly-mean-reverting conclusion.

If you’ve done a lot of statistical research, you know what I concluded: I done screwed up. I wasn’t sure where exactly, but I knew that I wasn’t going to just accept this result at face value. Maybe the league has changed, or maybe batter performance works differently in the days of high-tech pitching machines, extensive advanced scouting, and tailored plans of attack against every hitter. It was at least plausible that new advancements led to a change in behavior – but it certainly wasn’t likely. I started examining my methods looking for the problem.

The first thing I did is something that you should always do if you’re designing studies like this: I tested my parameters. I didn’t think this was the likely cause, but it’s good practice to check. Twenty-game windows might be weird, so I tried a variety of other lengths. Fifty points of wOBA is arbitrary, so I tried some other increments, as well as some percentage-based definitions of hot and cold. I tried non-overlapping samples, so 1-20 vs. 21-40 and then 21-40 vs. 41-60, instead of sampling different parts of the same stretch repeatedly. I lowered the plate appearance minimum. I raised the plate appearance minimum. I didn’t expect any of this to change the takeaway, and none of it did, but it’s always good to cover your bases.

I spotted the problem with the study fairly quickly, in fact, and I’m curious if you did too. Recall, if you will, that I looked at hitters who put up 20-game stretches much better (or worse) than their overall performance for a year, then looked at the subsequent 20 games. There’s a problem here: the games we’re looking at count as part of a hitter’s overall line. If you had a stretch of a .400 wOBA for 20 games and still had a wOBA of less than .350 overall, that means your performance in the non-streak games had to be meaningfully worse than .350. Thus, our sample of players on hot streaks is disproportionately full of players who were worse the rest of the year, which helped their streaks register as “hot.”

Another way of thinking about it is to imagine the population of players who put up a great wOBA for 20 games, and then a great wOBA for the next 20 games. That’s 40 games out of their season with a very high wOBA; they’d have to be pretty bad the rest of the way to end up with a low seasonal wOBA. That player is less likely to have their first 20 games counted as a hot streak – by playing well in the next 20 games, they’re raising their seasonal wOBA by quite a bit. The same is true in reverse for cold streaks.

This cross-contamination issue made the data look like it had strong mean reversion, but it was just a sampling problem all along. To fix it, I had to tell the computer that it couldn’t look to the future. I had a lot of data, so I made use of it. I only analyzed streaks where the player had at least 400 PA in the previous calendar year, and used their wOBA over that period as their expected wOBA to determine whether they were hot, cold or neutral. In this way, I was now asking the right question, framed the right way: Based on what we know about a player today, if they get hot in the next 20 games, what should we expect for the 20 games after that?

The answer? Basically the same thing that everyone else found. There’s evidence of moderate persistence of both hot and cold streaks:

Do Hot Streaks Persist? Take Two
State Probability
P(Hot) 21.7%
P(Cold) 20.3%
P(Neutral) 58.0%
P(Hot|Hot) 24.8%
P(Cold|Cold) 24.3%
P(Hot|Cold) 14.5%
P(Cold|Hot) 13.4%

This is what I expected to see all along. When a player is currently on a hot streak, their odds of being on a hot streak over the next 20 games are higher than for a random baseline. Likewise, players currently on cold streaks are more likely to be cold in the next 20 games than a random player. As you’d expect, both P(Hot|Cold) and P(Cold|Hot) are low, roughly 14% for each. In other words, when a hitter is seeing the ball well, they’re slightly more likely to be hot in the near future than they would be otherwise.

The effect isn’t enormous. I don’t think it’s enough to change our projections or anything; it decays quickly and isn’t a huge reading anyway. When I measured it in expected points of wOBA, it came out to around three points of increased expected value for the “hot” hitters relative to “neutral” ones. That’s a textbook definition of a real but minimal effect.

As before, I did a ton of parameter checking to make sure that I didn’t cherry-pick a result with my exact selection of constraints. I tried more and fewer games, tried non-overlapping samples, defined hot and cold streaks differently, and so on. The conclusions were robust to all of these changes. In other words, streak stickiness appears to be real but small.

A side note on my data journey here: I used an AI coding assistant to help me slice up the data. It’s dextrous like a surgeon; I tell it the cuts I want to make and the experimental method, and it turns that into a computer program with enviable speed and accuracy. The problem is that it doesn’t know if the method is good or whether the results make sense. It was happy to tell me I’d found a novel property of baseball instead of the truth that I had designed my definitions poorly. It explained the sigmoid function to me flawlessly, calculated Cronbach’s Alpha in milliseconds, and didn’t realize that double-counting and looking ahead to the future in my methodology would result in skewed results. AI can be powerful, but be careful out there!

That problem-solving and parameter-setting took me a few days of work on the side while I worked on my regular stream of articles. From there, I thought I had it made. I’d just plug in my new definition of streakiness, tag hitters in a variety of ways, and see which of these tags were most correlated with higher or lower streakiness. But here, we run into a problem: nothing worked.

More specifically, I fed power, plate discipline, and overall offensive result numbers into a variety of multivariate regressions, with hot streak stickiness as the variable I was trying to predict. The results were just a long string of “not significant” outcomes, more or less.

A few things came close to showing meaningful effects. Batters with high walk rates were slightly more likely to have their hot streaks persist, though below a statistically significant level. Batters who hit for a lot of power are less likely to have cold streaks persist, again not at a statistically significant level, but close to it. My interpretation is that high-power batters have plenty of cold streaks just because home runs are very important to their overall output and happen rarely, but that those streaks are less likely to have signal because they’re happening largely due to the random distribution of high-value hits. High swing rates were associated with more streakiness in general, which makes sense to me. Each of these effects had a p-value of between 0.05 and 0.10; they’re marginally significant, but with small magnitudes of effect.

I went a little bit deeper by having the computer try to guess whether a player’s next 20 games would be hot or not based on all the data it had available at the time, both seasonal aggregates and how a player was performing recently. Maybe some combination of hot players with particular skills are more likely to be hot. Unfortunately, though, the computer looked at the seasonal aggregates and discarded them. The only thing it found predictive was whether the player was currently hot, the same conclusion we had right up at the start.

So that’s where we stand at the end of this, without much clue about what makes a player streaky or un-streaky. The hot hand appears to be real, just like most previous studies have concluded. The effects are small, again in keeping with prior findings, though I think my results are marginally different than the previous findings, perhaps reflecting the changing major league environment. Finally, if you’re looking for a player who is more or less likely to exhibit hot-and-cold hitting in the future, I don’t have any obvious markers to point out to you. I even asked whether hitters who were streakier than normal in year one continued to be streakier than normal in year two. In aggregate, they weren’t.

None of this means that a particular hitter can’t break the broad tendencies. At the population level, though, I’m satisfied that I can’t identify the hitters most likely to run hotter and colder than average in advance. Everyone gets hot, and hitters who are hot tend (very slightly) to keep hitting. I just can’t take that broad finding and apply it specifically to one type of hitter relative to another. Sometimes number crunching is about the journey rather than the destination. I certainly think that this study was.





Ben is a writer at FanGraphs. He can be found on Bluesky @benclemens.

3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
pdunes
32 minutes ago

Isn’t this just calibrating the true talent level of a player? If hot streaks or cold streaks are defined as luck/unluckiness, sustained streaks are typically indicative of an underlying fundamental change, which is that the true talent is different than originally assumed.

shandykoufaxMember since 2024
30 seconds ago
Reply to  pdunes

I don’t think so necessarily that streakiness is defined as luck. There is evidence here (and in other sports) that getting hot is not just positive sequencing, but that athletes can be “locked in” and perform at a level above their true talent level.
While there are certainly 40 game stretches in the Hot l Hot category that are meaningful sticky improvements, I think the magnitude signals that a significant portion are simply hot hitters staying hot (6-7% of major league hitters meaningfully leveling up or down their game every 40 games does not seem likely to me). I am likely missing other arguments in favor of or against your hypothesis so feel free to tell me I’m wrong.
Obviously if we accept the existence of non-luck based hot streaks then we have to bake in the ability to go on those streaks as part of talent level which gets tangled up quickly.