# Modeling Salary Arbitration: Stat Components

This post is part of an ongoing arbitration research project and is coauthored by Alex Chamberlain and Sean Dolinar.

Feb. 25: 2015 MLB Arbitration Visualized

* * *

A couple of weeks ago, we introduced a couple of regressions that modeled arbitration results using a basic formulae predicated on wins above replacement (WAR). Ultimately, the models estimated that an arbitration-eligible pitcher could expect his salary to increase by 14 percent, and his raise in salary to increase by 56 percent, for each additional WAR. A hitter could expect increases of 13 percent and 46 percent, respectively.

The models, however, were incomplete: they did not incorporate any other stats aside from WAR. This was by design, as we wanted to introduce simple one-variable equations for the sake of demonstration. WAR is, conveniently, a comprehensive variable that attempts to summarize a player’s worth in one easily digestible number. But what about the effects of a player’s age or arbitration year?

Moreover, the r-squared statistic — a quick-and-easy check of a model’s validity — for each specification is not especially strong, clocking in anywhere between .30 and .56. This is partly a result of specifying only one explanatory variable, so including more variables — which we have done in this post — should improve the goodness of fit of the models, assuming the variables are relevant.

With that said, we have new-and-improved models to share with you: one comprised of composite statistics and another comprised of traditional statistics. They are all vanilla, linear ordinary least squares (OLS) regression models, and it is important to remember that the values for each stat can only be used in the context of that specific model.

For each player, we specify…

• a composite statistic, such as wins above replacement (WAR) for batters and RA9-WAR for pitchers, to measure overall performance (RA9-WAR uses runs allowed per nine innings rather than FIP);
• a service statistic, such as plate appearances (PA) and innings pitched (IP), to measure playing time;
• a “glory” statistic, such as home runs (HR) and saves (SV), to account for baseball’s affinity for traditional statistics and social constructs;
• arbitration year (for pitchers*), indicating a player’s total service time;
• and his age (for hitters*), to measure as best we can the number of years for which he has inhabited the earth.

We identify these particular stats not only to cover as much analytical ground as possible but also minimize the use of stats that have high correlation among themselves (multicollinearity). We want to isolate different aspects of player performance or value as best we can.

We originally used a model that specified a lagged dependent variable, i.e. a player’s salary as a function of his prior-year salary. The incorporation of the lagged salary variable introduces bias into a regression. We circumvent the issue by proceeding with only models that specify a player’s raise in salary in relation to his previous year’s performance. Ultimately, we avoid the bias and more accurately estimate the influence of the stats.

*See NOTES section below for logic behind using arb year for pitchers and age for hitters.

### Pitchers

Let’s start with the constant term. It’s negative, which occurs because of our inclusion of innings pitched (IP) into the equation: the model accounts for a certain minimum playing time (about 23 IP, worth about \$11,000 each) a pitcher would presumably accrue before going through the arbitration process.

Each additional RA9-WAR a pitcher accumulates, regardless of how many innings he throws, is worth about \$250,000. On top of this, every save is worth about \$50,000 — an unfortunate side effect for most young relievers, especially those on teams that employ veterans with, ah, experience in the closer’s role. This also provides a rational reason why managers still adhere to the traditional closer role: pitchers expect to be given a chance to earn their payday.

### Batters

This time, the constant term is positive and rather large. This occurs because we include age in the model, making the de facto baseline 24 years old. We think age, which causes a decrease in salary of about \$72,000 for each year, captures some effects the model may otherwise ignore. A hitter who is older during his arbitration years — a journeyman, if you will — generally has a lower talent ceiling than a younger player, so an older player is being penalized not for his age but for his inherent skill.

Meanwhile, each win above replacement (WAR) is worth roughly \$200,000. Every home run is worth almost an additional \$50,000. It’s merely a coincidence that both glory stats (saves and home runs) carry roughly the same value.

Arbitration, at its heart, seems to be a very simple process. The Collective Bargaining Agreement outlines fairly simple criteria for a player’s compensation via arbitration, starting on page 20.

Thus, perhaps arbitration models ought to be as simple as the arbitration process itself. We specified an alternative — and, to some, a maybe more intuitive — model focusing on strictly traditional statistics such as runs batted in (RBI) and batting average for hitters and wins and strikeouts for pitchers. Before the dawn of sabermetrics, teams and players’ agents had to negotiate on a salary somehow. The stats on the backs of baseball cards likely prevailed.

It’s worth noting, however, that the equations depicted below are likely less reliable than those for the models that use composite statistics: many of the traditional statistics correlate with one another to some degree, rendering the coefficient estimates less efficient (more noisy). Still, the models account for almost as much of the variance as do the models using composite statistics. This intersection of description and inefficiency indicates that, if arbitration panels still once relied (and, perhaps, still rely) on traditional statistics, the process may be a bit flawed.

We don’t want to burden you with too much information, but for a visual depiction of how explanatory variables correlate with one another, please click here for pitchers and here for batters.

### Pitchers

Note that a save is still worth about \$50,000. Every win is worth about \$60,000, and strikeouts are worth about \$3,300 apiece.

Innings pitched (IP) and earned run average (ERA) are a little more technical: the values for IP and the interaction term, IPxERA, indicate that each additional IP increases the raise as long as the pitcher notches an ERA better than 7.19. So, basically, as long as a pitcher is not miserably bad, he will be compensated for his service. The IPxERA, which is simply IP multiplied by ERA, effectively rewards pitchers who have a low ERA and pitch a lot of innings.

### Batters

A home run (HR) is worth a little less this time around — about \$46,000, or about \$4,000 less than the composite statstics models — but this can likely be attributed to the high degree of correlation between HRs and RBI, a problem we mentioned earlier. The model reduces the value of HRs to avoid double-counting the effect of home runs. If you consider that each HR produces at least one RBI, then each HR is worth no less than \$56,000.

We introduce another interaction term in order to penalize a player for hitless plate appearances (PA). With that said, the coefficients for PA and the interaction term, PAxAVG (PA multiplied by AVG), indicate that each additional PA will increase the arbitration raise as long as the batter hits better than .178. So, again, as long as a hitter is not miserably bad, he will be compensated for his service.

## Salary Results

Finally, after spending so much time trying to explain the raise players get through arbitration, we added the raise to their previous year’s salary:

Salary = Arbitration Raise + Previous Yr Salary

Below is a plot of the predicted salaries from the equation above versus the actual salaries. The model predicts the salaries rather well. However, we must caution that this model was built for inference and understanding of the arbitration process and not for making predictions for the upcoming off season.

NOTES: You may have noticed that arbitration year is included in the models for pitchers but not hitters, and age for hitters but not pitchers. We think age is statistically significant for hitters and not pitchers because pitchers occasionally defy age and round into form when they’re older — Corey Kluber and Matt Shoemaker are salient examples of talented, non-prospect, arbitration-eligible (or pre-arb) pitchers — whereas hitters abide by well-documented aging curves. Moreover, prospect hitters have a smaller window of time to live up to expectations or else become organizational filler or a long-term Quad-A guy. On the other hand, arbitration year probably captures omitted variable effects for pitchers, thereby making an assumption about how a pitcher with X amount of service time will typically perform.

We are open-source proponents, so please feel free to utilize our data and codes, which can be accessed here via GitHub.

I code a bunch of things here. I really need to update my blog about statistics at stats.seandolinar.com.

Guest
Joe

The PA and IP graphs look pretty clearly non-linear, did you think about trying to use log scales for those instead?