This post is part of an ongoing arbitration research project and is coauthored by Alex Chamberlain and Sean Dolinar.
April 24: Modeling Salary Arbitration: Introduction
Feb. 25: 2015 MLB Arbitration Visualized
* * *
A couple of weeks ago, we introduced a couple of regressions that modeled arbitration results using a basic formulae predicated on wins above replacement (WAR). Ultimately, the models estimated that an arbitration-eligible pitcher could expect his salary to increase by 14 percent, and his raise in salary to increase by 56 percent, for each additional WAR. A hitter could expect increases of 13 percent and 46 percent, respectively.
The models, however, were incomplete: they did not incorporate any other stats aside from WAR. This was by design, as we wanted to introduce simple one-variable equations for the sake of demonstration. WAR is, conveniently, a comprehensive variable that attempts to summarize a player’s worth in one easily digestible number. But what about the effects of a player’s age or arbitration year?
Moreover, the r-squared statistic — a quick-and-easy check of a model’s validity — for each specification is not especially strong, clocking in anywhere between .30 and .56. This is partly a result of specifying only one explanatory variable, so including more variables — which we have done in this post — should improve the goodness of fit of the models, assuming the variables are relevant.
With that said, we have new-and-improved models to share with you: one comprised of composite statistics and another comprised of traditional statistics. They are all vanilla, linear ordinary least squares (OLS) regression models, and it is important to remember that the values for each stat can only be used in the context of that specific model.
Non-Traditional Statistics
For each player, we specify…
- a composite statistic, such as wins above replacement (WAR) for batters and RA9-WAR for pitchers, to measure overall performance (RA9-WAR uses runs allowed per nine innings rather than FIP);
- a service statistic, such as plate appearances (PA) and innings pitched (IP), to measure playing time;
- a “glory” statistic, such as home runs (HR) and saves (SV), to account for baseball’s affinity for traditional statistics and social constructs;
- arbitration year (for pitchers*), indicating a player’s total service time;
- and his age (for hitters*), to measure as best we can the number of years for which he has inhabited the earth.
We identify these particular stats not only to cover as much analytical ground as possible but also minimize the use of stats that have high correlation among themselves (multicollinearity). We want to isolate different aspects of player performance or value as best we can.
Read the rest of this entry »