Linear Weights + BaseRuns = Good

In my last article, I explained how wOBA’s current implementation changes the value of walks, singles, home runs, etc., annually due to changing league characteristics.  Does this mean that the value of an event is the same for every team in the league each season?  Or in every park in the league?  No way.  If you’re talking about a weak offense in a high-offense era, then the overall constants for a weak offensive era are probably more applicable to that team.  However, it’s not really the point of standard wOBA to guess the run-producing contribution of a particular player to a particular team; I think it’s probably more accurate to say it’s about his probable productiveness in a typical team (although park effects aren’t taken into account, so not exactly… that would be more true of wRC+).

Anyway, Tom Tango realized this limitation, and produced a table that shows how the values change depending on a team’s runs scored.  He accomplished this system of “Custom Linear Weights” (“a necessary offshoot” of linear weights, he says) by making use of David Smyth’s BaseRuns formula, which is, in simplest terms, Runs Scored = base runners * (% of base runners that score) + home runs.  Home run hitters are not considered base runners, in this equation, by the way.  Makes perfect sense, right?

Tango realized that BaseRuns had a better handle on the team run-scoring process than his basic linear weights system (and all the other run estimators), so he translated the results of BaseRuns in various run environments into linear weights.  Specifically, the BaseRuns formula told him how many runs the team should score, and the linear weight value of each hit came from how many additional runs BaseRuns expected the team score if it had one more of that type of hit (the marginal value of each hit type).  Here are just the basics of his results, in graphical form:

LWTS by runs

You may have noticed the numbers don’t line up with those in the chart at the end of my last article; that’s because this is without the value of the out subtracted. You’ll also see more or less what I was talking about in the previous article, which is that the run value of each event rises with greater scoring — particularly the non-home run events.

Speaking of outs…

What’s an out really worth?  In the version of linear weights (LWTS) that deals with outs, the default value of an out is about -0.29 runs.  But that’s based around a typical team.  And it’s a pretty abstract concept — it’s all about the cost of “what could have been”.  That cost is entirely dependent on the team; as the great mathematician and/or funk singer Billy Preston often said, “nothin’ from nothin’ leaves nothin,” so if a hypothetically horrible team (or player) can only be expected to make outs, then nothing is really lost when they do in fact make an out.  Hitting into a double play or making an out on the base paths… that has to be considered a little differently, since that’s effectively taking away part of the value of somebody’s walk, single, etc.

Anyway, here’s the chart for when the value of not making an out at the plate is included:

LWTS by runs and outs

You may have noticed that the slopes are a bit steeper now, as the value of an out goes from -0.06 at 1 run per game to -0.594 at 10 runs per game.

I’m a little more inclined to look at these shifts in the values of the hits from the angle of synergies between on-base percentage (or more accurately, not making outs), extra base tendencies, and base running rather than from the angle of runs per game.  I think this way gets the cause-and-effect order correct, plus it seems a little too close to being circular otherwise (try googling “recursion”… those pranksters).

Consider how many base runners per inning you can expect given the proportion of batters who make it on base and aren’t put out afterwards (I call it “non-out rate,” but just consider it on-base percentage with no screw-ups involved):

Runners v OBP

Assuming no outs on the bases, as OBP approaches 1.000, runners per inning approaches infinity.  I cut the graph short of infinity, because I figure I’ve already put you loyal readers from my FanGraphs Community Research articles (1, 2, and especially 3) through enough endless scrolling…

When you think about it, the closer to 1.000 a team’s OBP becomes — at least past a certain point — the closer to 1 the value of any on-base event should be.  If the bases are always loaded, then even a walk is always going to drive in a run, and you’re always going to be driven in by somebody behind you.  Sure, a home run with the bases loaded is going to drive in 4 runs, but then you have to consider that the hitters behind the home run hitter would have driven in the runs anyway.  Basically, it’s a communist utopia of hitters, where all varieties of hits, and even walks, have equal worth.  But it only applies at practically impossible team OBP levels (long-term, anyway), where everybody is performing a lot better than they’re realistically capable of, somehow.

Anyway, if you were wondering, the line follows the formula: Runners per inning = 3/(1-OBP) – 3.  So that you can see why this makes sense:

Reached Base

Plate Appearances

OBP

3/(1-OBP) -3

0

3

0.000

0

1

4

0.250

1

2

5

0.400

2

3

6

0.500

3

4

7

0.571

4

… and so on.

You don’t have to be a rocket scientist to figure out that if you have a lot of plate appearances in a single inning, you’re going to score a lot of runs.  Each plate appearance over 6 in an inning is automatically worth a run (a maximum of 3 on base and a maximum of 3 outs).

Now, what if runners get thrown out trying to stretch a single into double, or trying to steal?  You could then have a 1.000 OBP in an inning, yet only have 3 plate appearances total… so that screws that idea up.  Then there’s double and even triple plays to deal with (OK, at least you know there won’t be a 1.000 OBP for the whole inning in those cases).  These events complicate things considerably…  so not to distract you from that little problem or anything, but… HEY, WHAT’S THAT BEHIND YOU?!  Seriously, though, I’ll attempt to address that issue in a future article, but it’s probably not going to be pretty.

Moving on, here’s a relationship that Tom Tango pointed out:

Runs per OB v OBP

Runs per Time on Base (R/OB), defined as R/(H + BB + HBP), has a fairly strong positive correlation to OBP.  Recall the BaseRuns principle that runs = (base runners except HR) * (score rate) + HR; R/OB is the scoring rate, so the implication here is that having a higher OBP actually raises both parts of that equation.  With a little bit of algebra, you can therefore say that total runs has a strong relationship to OB^2 / PA.  Specifically, 1.1 * OB^2 / PA turns out to be a pretty decent estimator of runs, with a 0.929 correlation to runs and a Mean Absolute Error of 30 runs over a season (1960-2012).

If you modify the above graph to instead compare OBP to (R – HR) / (OB – HR) — BaseRuns style, considering HR separately — the R^2 shoots up to 0.5181.  The conversion now leads to the formula:

R = 0.955 * ((OB – HR) * OB/PA + HR)

… which turns out to be a better estimator of runs; a 0.962 correlation to runs and a Mean Absolute Error of 22 runs over a season.  (for the uninitiated in stats: the MAE of 22 means the formula guesses runs per season correctly to within an average of 22, and the 0.962 correlation is extremely strong, since 1 is as high as a correlation can be).

Now’s as good a time as any to get into more details regarding David Smyth’s BaseRuns.  Paraphrasing the simple version of the formula:

Runs = (OB-HR) * (runner advancement estimate) / (runner advancement estimate + Outs) + HR

The middle section of the formula represents the score rate.  This means what I essentially showed earlier is that OBP (or OB/PA) works quite well as a stand-in for the score rate component of BaseRuns.  Interesting, no?  BaseRuns is basically as good as it gets amongst run estimator formulas, but it’s really only a little bit better than that formula, with a 0.973 correlation to runs, and an MAE of 18.2.  Of course, the accuracy of my formula there could be partly due to the tendency of high-OBP teams to also have more power.

The score rate component of BaseRuns is probably the only component that can stand to be improved upon.  In the simple version of the formula, it’s:

((1.4*TB – .6*H – 3*HR + .1*W)*1.02) / ((1.4*TB – .6*H – 3*HR + .1*W)*1.02 + (AB – H))

(AB-H) is the aforementioned Outs component, by the way.  Anyway, as you can probably imagine, scoring rate isn’t entirely solved by this equation.  There’s a more complex version that deals with steals, caught stealing, double plays, hit by pitches, and intentional walks, but it only makes for a slight improvement.  The truth of the scoring rate is more complicated than that.

So, the BaseRuns formula holds up to extreme run environments better than the other run estimators you probably know and love, which is why Tango used it to derive runs-per-game-based linear weights.  However, increased runs per game don’t cause higher linear weights; a better, more synergistic offense is the root cause of both.  Therefore, what we want is to start over and take these synergies more into account next time.  When we do that, and we see how the whole can be greater than (or less than) the sum of its parts, we can transcend linear weights, BaseRuns, and the combination thereof.  Next time, I’ll show you a bit of how that can be done.





Steve is a robot created for the purpose of writing about baseball statistics. One day, he may become self-aware, and...attempt to make money or something?

27 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Dan
11 years ago

Why would FG let you post these. Now, whenever they analyze a trade or acquisition we masses will shout “Bout what about in team x’s particular run environment?”

Great article, again.

philosofoolmember
11 years ago
Reply to  Dan

There is actually a response to the “What about particular run environment x?” It goes like this: while the run value of the player’s performance will be lower/higher than expected in the current average environment, (1) half of all games played are on average in an average environment,(2) valuable skills are valuable in every environment and (3) the relative differences in those values is typically small. Then you say “the burden of proof is on you to show that these differences undermine the present analysis.”

Dan
11 years ago
Reply to  philosofool

By run environment, I mean a certain team’s run scoring, not park factors.

Also, it was a joke.

philosofoolmember
11 years ago
Reply to  Dan

That was a joke?

I was a total bonehead to mention park factors, but the response, otherwise, is still valid.

Baltar
11 years ago
Reply to  Dan

I laughed.