How Can We Better Study Team Depth?

We know that Billy Beane has joked before that his stuff doesn’t work in the playoffs. And we know that, at least this time around, his Athletics team is built on depth and getting value out of the back end of his roster. These things seem to go hand in hand: your sixth starter and sixth infielder may mean a lot during the season, and they may not even make your post-season roster. But can we study this more rigorously in an effort to estimate the true value of depth?

My first idea was to look at ‘slots’ on a team. Do wins at the top of your roster mean more than wins on the bottom? In other words, would WAR from your top five players be better correlated with team wins than wins from your 20th through 25th men? Theoretically, they shouldn’t — runs are runs and wins are wins. But, with the help of Steve Staude, we can test that assertion.

Instead of running a straight one-to-one correlation for each, we tried to use multivariable regression. The inputs were team WAR (back to 1974) from slots one through five, the middle and the last five. The equation that came out of the beep-boop-bop machine weighted the top five 1.13 times to the bottom five’s .827! Success! Wins at the top of the team are worth almost a third more than wins at the bottom!

Not quite. First, this weighted equation’s r-squared was .738, and if you just take all team WAR, the r-squared for the relationship of WAR to wins is .736. Weighting the top wins did not really improve the relationship of WAR to wins, in other words.

There’s another issue with the approach. Matt Swartz brought up the interesting problem that this method of attack is really only testing the model we’ve built. The model of WAR suggests that runs are runs and wins are wins, no matter where you find them. If we really found something here, it might say more about WAR than it would about the true value of depth.

And, really, one aspect of the original question was if this meant more in the post-season than the regular season. But postseason samples are ridiculously small. Could we, perhaps, bucket teams into “deep,” “normal,” and “shallow” teams by the skew of their respective WAR contributions, and then see how teams in each of those buckets fairs in the postseason? That’s a ton of work… and we’re still possibly testing the model instead of getting at the heart of the problem.

Moving away from WAR opens possibilities, but also other problems. Would a team with three guys with .350 wOBAs, three with .330 and three with .300 fair better than a team with nine .330 wOBA players? Here the problem is, of course, that the best players would get more plate appearances at the top of the lineup and are therefore of course more valuable once these rates turn into counting stats. And you also lose the defensive side of the ball when you focus on wOBA, or wRC+. Not to mention pitching.

Lastly, the team’s place on the win curve seems important to this discussion. Would overpaying for a backup middle infielder make more sense for a contender looking to shore up their depth? And would that player get fewer opportunities on a deep roster, and therefore have a lower WAR, and therefore make the team look less deep than it actually was?

Depth is a difficult concept to nail down in the numbers. But when you see a team like the Athletics sign Nick Punto when they already had Alberto Callaspo and Eric Sogard, or when you see the Nationals react to 2013 by acquiring one of the better fourth outfielders in baseball, you know that the back end of the roster deserves attention. Is there another way to construct our inquiry in order to better get at the true value of team depth?

With a phone full of pictures of pitchers' fingers, strange beers, and his two toddler sons, Eno Sarris can be found at the ballpark or a brewery most days. Read him here, writing about the A's or Giants at The Athletic, or about beer at October. Follow him on Twitter @enosarris if you can handle the sandwiches and inanity.

newest oldest most voted

The r-squared may not have changed, but what about the std error of the variables? Perhaps there is less variability in it’s predictive value if you weigh the WAR ‘slots’ rather than just using WAR totals.

That would certainly be useful, if there is a difference.