Modeling Salary Arbitration: Introduction

April 24, 2015

This post is part of an ongoing arbitration research project and is coauthored by Alex Chamberlain and Sean Dolinar.

Feb. 25: 2015 MLB Arbitration Visualized

* * *

Sean and I share a mutual passion for knowledge and understanding how things work. Said mutual passion is magnified when regarding baseball-related matters. With that said, the mysterious arbitration process intrigues us. We joined forces to try to crack the code, so to speak, and we would like to share the fruits of our labor with you.

Players with anywhere from three to six years of service time are eligible for salary increases based on performance. Teams and players typically reach settlements outside of arbitration, but if they can’t agree on a salary figure, both sides enter the formal arbitration process, as described here by FOX Sports.

Therein resides the questions intrinsic to the process: How do teams and players decide what is an appropriate dollar-value raise in salary? How does an arbitration panel decide in favor of one side or the other?

We are, by no means, trailblazers in this realm of study. Matt Swartz seems to have not only figured out the financial underpinnings of arbitration but also created a model that predicts salaries of arbitration-eligible players with considerable accuracy. His research is licensed exclusively to MLB Trade Rumors (MLBTR), which is unfortunate for those interested in what’s happening inside the black box. Until now!

Deconstructing arbitration isn’t a particularly difficult task. Ultimately, the fundamental equations (plural, for batters and pitchers) that explain arbitration can be comprised of mostly traditional statistics. In other words, the secret to arbitration is not a sabermetrically complex formula; rather, it can be found on the backs of baseball cards: plate appearances, saves, innings pitched, etc. While we will ultimately use these stats to describe arbitration, for the purpose of this introductory post, we will use WAR, which summarizes all of these stats to some degree.

We borrowed MLBTR’s comprehensive arbitration data dating back to 2011 and paired players’ salaries, whether settled by arbitration or outside it, with their prior-year statistics. We adjusted salaries for economic inflation using the Consumer Price Index (CPI) but not league inflation, which has increased more quickly. Ideally, this adjustment makes the comparison of arbitration salaries over time more accurate.

Using this data, we specified a number of mathematical models to not only explain the inner workings of the arbitration process but also predict players’ salaries settled via arbitration. The fundamental model presented in this post exclusively concerns wins above replacement (WAR), which attempts to capture player value in one number. We specify the model in a couple of slightly different ways.

We would like to note here that we specify a exponential model because the residuals fit the data better than a strictly linear model would. That’s not to say the data can’t be modeled linearly, but we reduce the margin of error (that is, we minimize the root-mean-squared error, or RMSE) by using a log-linear formulation.

Salary vs. Career WAR

We modeled a regression that captures the exponential relationship between a pitcher’s career WAR against his salary:

The equation, when simplified, yields an expected salary increase of 14% per WAR. The non-linear relationship between salary and career WAR indicates that a player with 10 WAR will benefit more from a one-WAR increase than a player with zero WAR (roughly $627K to $186K). We modeled the same equation for batters, too, and the the magic number is 13%.

Raise in Salary vs. Prior-Season WAR

We also modeled the relationship between a player’s raise in salary and the WAR he accumulated in the previous season. (It’s important to note that we removed zero and negative changes in salary since we were taking the natural log of the salary change. The imputed data represented less than 5% of the total data set.)

A pitcher can expect a 56% increase in his raise in salary for each additional WAR he accumulates. In other words, he can expect his next raise in salary to be 56% larger than his most recent raise. Likewise, a batter can expect a 46% larger raise for each additional WAR accumulated.

As you might notice, these models have a lot of noise in them. The best model only describes 56% of the variance of the salaries. Using only one statistic, even a comprehensive one in WAR, leaves out a lot of details. Some of these are arbitration year, position, and statistics that are actually used to compare players during the arbitration process. For example, a player will generally have an increase in salary as he moves through the arbitration process regardless of WAR. Accounting for these other variables is important to produce a more accurate model.

These equations will eventually become more elaborate, especially when we present models that predict a player’s expected salary as settled via arbitration. Ultimately, we want the specifications from this post to simply illustrate how a player’s performance relates to his arbitration-settled salary; they ought to be used for inference (that is, a generalization about a population) rather than for prediction.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG