How Best To Predict the Season’s Second Half

June 26, 2018

By actual games played, we’re coming up on the halfway point of the season. By now, statisticians might continue to insist that certain data samples remain small, but they generally don’t feel small. We’ve had three months of baseball, everyday baseball, and three more months of everyday baseball remain. This is the middle of the summer routine, the routine where baseball stretches as far as the eye can see in every direction. It’s a comfortable time of year; games are becoming increasingly important, but there’s still ample time to correct mistakes.

We’re always trying to take stabs at the future. We know that baseball will play out however it likes, but we still attempt to guess what’s going to happen. This kind of thinking is what informs a lot of trade-deadline activity. So I’ve been thinking lately about second-half predictions. How can we know what’s going to happen in the standings? First things first: In a sense, we can’t. Baseball has an endless capacity to make the analysts look stupid. But it’s not like we don’t know anything. So I’m returning to a project similar to something I’ve done before.

The idea is to best predict second-half team winning percentage. And here I mean the actual second half, and not the so-called second half that comes after the All-Star break. Second-half winning percentage, then, is the dependent variable. I’m trying out three different independent variables: first-half winning percentage, first-half Pythagorean winning percentage, and second-half projected winning percentage. When I’ve done this before, I’ve used preseason team projections instead, but this time I’ve made use of the Wayback Machine, searching for our Playoff Odds page and getting as close to the beginning of July as possible. I don’t recall ever really testing these updated projections. They’re the ones based on Steamer, ZiPS, and our manually-adjusted depth charts. I was able to go back to 2014, which gives us a total of 120 team-seasons. Let’s play with some plots!

First, here’s the relationship between first-half winning percentage and second-half winning percentage. The halves aren’t exactly even at 81 games per team on each side, but it’s close enough:

I don’t know what kind of relationship you were expecting. It’s obvious there is one — baseball isn’t totally random — but this isn’t one of those things that’s tightly and perfectly linear. The slope of the line, if you’re curious, is 0.55, which means for every point you go up on the x-axis, you go up about half a point on the y-axis. This is all best understood in context, so here’s the relationship between first-half Pythagorean winning percentage and second-half winning percentage:

Very similar deal. Very similar appearance. And that isn’t surprising, since winning percentage tends not to deviate from Pythagorean winning percentage by all that much. The slope of this line is 0.54. So there’s only one thing left: projected second-half winning percentage and actual second-half winning percentage.

The relationship is tighter. It’s still very far from perfect, but you see the higher R2, and the slope of this line is 1.04, which is almost double the previous slopes. Based on this evidence, the projections contain more meaningful information, and I can drill that point home with a multivariate regression technique. If you’re a normal person, even just reading those words together was unpleasant, but this isn’t so complicated to understand. I was able to combine all three variables into a best fit. It yields an R2 of 0.39. This is, again, in an effort to predict second-half winning percentage. The formula settles on the following weights:

First-half winning percentage: 0.14
First-half Pythagorean winning percentage: 0.12
Second-half projected winning percentage: 0.75

Compared to first-half winning percentage, the second-half projection is almost six times more important. Compared to first-half Pythagorean winning percentage, the second-half projection is more than six times more important. You might say this shouldn’t be surprising. And you’re right — this shouldn’t be surprising. Updated projections include the most up-to-date roster outlooks, and they also update based on early results. But this does put the significance of actual first halves in perspective. And remember, a strongly negative first half would point to midseason subtractions, and a strongly positive first half would point to midseason additions. That’s something the projections wouldn’t know anything about. Still, they perform the best.

For more information, I looked at the 20 teams with the biggest positive differences between first-half winning percentage and projected second-half winning percentage. The relevant numbers:

First-half winning percentage: .596 (average)
Second-half projected winning percentage: .511
Second-half actual winning percentage: .517

And on the other end, I looked at the 20 teams with the biggest negative differences between first-half winning percentage and projected second-half winning percentage. The relevant numbers:

First-half winning percentage: .417 (average)
Second-half projected winning percentage: .497
Second-half actual winning percentage: .483

In the first case, you see a slight degree of overachieving, on average. In the second case, you see the opposite of that. But not only might that just be noise — there’s also the trade consideration, as mentioned earlier. In the first case, those are teams likely to try to get better, and in the second case, those are teams likely to try to get worse (in the short-term). Projections don’t know anything about upcoming trades. That’s information that’s contained instead within the actual results. But, well, you can see where I’m going with this. If you want to know what’s going to happen in the second half, you need to understand it’s impossible to predict. But if you want to give yourself even a chance, you should place by far the most weight on the updated team projections. I know it’s tempting to believe in a team’s actual to-date results, but they don’t mean quite as much as you’d presume. The projections are able to keep a cooler head.

As far as this year is concerned, I don’t know how much this matters for, say, the Mariners — they look like a textbook overachiever, but there’s a huge separation between them and the closest wild-card competition. Those teams don’t look so strong, either. Maybe this is more relevant in the NL, where the Braves, Brewers, and Diamondbacks lead the Nationals, Cubs, and Dodgers, respectively. The rest of the way, the Nationals, Cubs, and Dodgers have by far the strongest projections. That doesn’t mean the first half doesn’t matter at all. That doesn’t mean there aren’t various reasons to think better, or to think worse. Just remember the projections don’t come out of nowhere. They will always tell you exactly what they believe.

14 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

EltnegMember since 2020

7 years ago

Sometimes I feel bad because Jeff writes a really nice article and my main reaction is “ooh, look at that shiny outlier data point.” This is one of those times– I forgot how good Cleveland was in the second half last year.

channelclemente

Reply to Eltneg

Especially since, despite the correlations, nothing explains more than 20-35% of the variance. If you were betting on the other side of that coin flip, I’d double down.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG