Why Our Pitcher WAR Uses FIP, Part Two

This post builds on the one I wrote a few hours ago, so I’d encourage you to read that if you haven’t yet. If you really don’t want to follow the link, this is the paragraph where we’re picking up from:

In the end, we had to choose between two different methods – assuming that the pitcher had no responsibility for the outcome of a ball in play, or attempting to approximate the amount of time that the result was due to the pitcher or the fielder. Ideally, we’d be able to do the latter – which is how Sean approaches it – but I just don’t think we currently have the tools available to make an accurate enough judgment on how to apportion that responsibility.

Clearly, some hits on balls on play are the “fault” of the pitcher. He throws a fastball down the middle in a 3-1 count and the hitter whacks it for a double in the gap – that’s on him, certainly. However, most hits are not of that variety. Instead, they’re ground balls in between two defenders or fly balls that fall near a chasing outfielder before he can get to it. In those instances, we don’t really know how much responsibility for the hit should go to the pitcher or the fielder. Would Elvis Andrus have gotten to that grounder up the middle that Yuniesky Betancourt didn’t get close to? Maybe, maybe not. Did Carl Crawford run down a shallow popup that Juan Rivera would have had to pick up on the third bounce? Perhaps. We don’t have the luxury of having a control group for each ball in play. All we know is whether the guy who happened to be the defender on duty at the time was able to make the play or not.

So, what do we do a specific pitcher’s results on balls in play? This was the thing that I wrestled with the most while we were designing WAR for pitchers a few years ago. I can see an argument for doing it in one of two ways, though I think both have problems.

1. FIP-based WAR, which is what we ended up using, essentially admits that we don’t have enough information about dividing responsibility for the results of balls in play, and so it ignores them.

2. RA-based WAR, which is what Sean ended up using, attempts to adjust for defensive contribution by taking a team’s overall Total Zone rating and assigning an expected defensive debit or credit to each pitcher based on how the team performed on the season as a whole.

I get why Sean did it the way he did it, and I understand why there are people who prefer that path. It appeals to our inherent sense of runs allowed being a record of what actually happened, and presents the possibility of achieving the ultimate goal – a pitcher’s total contribution to run prevention with the effects of his teammates factored out. The problem, though, is a pretty big one, and the one that caused me to lean away from RA-based WAR for our purposes here. It assumes that the distribution of defensive performance was even for each pitcher on every team, which is quite obviously not going to be true. Simply put, it is not a record of what actually happened – it is an assumption of what might have happened if all defenders on a team were evenly skilled and were perfectly consistent on a day-to-day basis.

We can simply look at the distribution of run support for a pitcher on any given team to see that the assumption of even performance is not going to be true. If we use the Yankees rotation as an example, we see that the Yankees averaged 5.30 runs per game this year. Their distribution by starting pitcher is below:

CC Sabathia – 5.89 R/G
A.J. Burnett – 4.29 R/G
Phil Hughes – 6.75 R/G
Andy Pettitte – 6.00 R/G
Javier Vazquez – 4.12 R/G

No pitcher is actually within half a run of the team average. Burnett and Vazquez are over a run per game lower than the overall total, while Pettitte is nearly three quarters of a run per game higher and Hughes is a run and a half per game higher. If you built a metric that worked off the assumption that the Yankees offense scored the same amount of runs per game when Vazquez was on the mound as when Hughes was on the mound, you’d probably draw some pretty inaccurate conclusions. There is no reason to think that defensive performance is any more consistent on a day-to-day basis. If anything, there are reasons to believe that it would vary even more than offense.

In general, a team will run out a similar line-up on a day-to-day basis, and each guy will get about the same number of plate appearances per day, as required by having batters take turns in order. That boundary does not hold with defenders, however. There is no rule that says each player on the field get an equal number of opportunities each day. In fact, given that pitchers have different tendencies in terms of groundball and flyball rates, it’s nearly guaranteed that the defensive opportunities will not be equal between pitchers.

Using aggregate stats from a team’s entire season simply won’t give you the kind of detail needed to accurately determine the quality of defense that was played behind a given pitcher in a season. Doing pitcher WAR that way provides an end result that does not match what actually happened. It is not an accounting of what actually happened.

Since neither method gets us to that goal of accurate accounting, we’re left with a choice of two paths, both with structural problems that can’t be avoided based on the data we currently have access to. Personally, I prefer FIP-based WAR because it is easier to adjust for what we know is not included – BABIP and sequencing, essentially – than it is to take a defense-adjusted RA based WAR and make adjustments for places where the assumption of defensive distribution equality does not hold.

Let’s use Francisco Liriano and Cliff Lee as examples. Liriano’s RA results don’t match his FIP in large part because he has a .340 batting average on balls in play. Since our version of WAR doesn’t hold that against him, he comes out looking really good. An RA-based version not only holds his actual BABIP against him (by starting with runs allowed), but it then penalizes him further because the Twins have an above-average defense, and the assumption is that he got proportionate help from the guys behind him.

What is more likely – that Liriano gave up contacted balls that should have resulted in a .350 to .360 BABIP, and the good glove Twins helped bring that down to .340, or that the guys behind him didn’t make as many plays for him as they did when Carl Pavano or Brian Duensing was pitching? Considering that he posted a basically league average 19.1% line drive rate, I’m more inclined to believe that the latter is closer to the truth. We don’t know exactly what kind of defensive support Liriano got this year, but based on what we know about a pitcher’s control over BABIP, I think we’re better off assuming that there were some issues behind him that hurt him than we are assuming that the Twins defense supported him equally as well as they supported his fellow pitchers.

Lee’s case shows the other side of the coin that FIP ignores – when those hits occur. While he has a normal .302 batting average on balls in play, it is not at evenly distributed within the base-out states. His BABIP is just .257 with the bases empty, but jumps to .350 with men on base, and is .333 with runners in scoring position. Because of that split in when his balls are being turned into outs, he has a LOB% of just 67.9%, well below average and far below what pitchers of his quality have posted this year.

For Lee, it hasn’t been a problem of too many finding holes, but simply those balls finding holes at the wrong times. It’s possible that those hits were a result of poor location, but given that he ran a 10.00 K/BB ratio with men in scoring position, it doesn’t seem like Lee suddenly lost his command when men got on base this year. Maybe he did – I don’t know. But should we assume that a pitcher’s BABIP with RISP is under his control? That’s really the driving force of Lee’s ERA this year, we have to acknowledge that it is certainly possible that his defenders let him down in those critical situations.

It is also possible that he let himself down. We just don’t really know who is at fault – pitcher or defenders. FIP blames BABIP entirely on defense. That’s definitely wrong. Defense-adjusted RA assumes that each pitcher got the same support from their teammates. That is also definitely wrong.

So, we’re left with two imperfect options. Which should you prefer? I can’t answer that for you. They both have strengths and weaknesses, and both are valid attempts to answer the question that we’re really trying to get at. I prefer the FIP-based implementation because it’s easier to make mental adjustments from that number, knowing that BABIP and sequencing are not included, than it is to try and back out of a metric that is already attempting to account for defensive support and find out where it might have missed the mark, but that’s a personal preference more than it’s a right or wrong thing.

WAR is not perfect, and it’s less perfect for pitchers than it is for hitters. Separating out defense from pitching is hard, and we don’t have it all figured out yet. We don’t encourage you to use any version of WAR as the be-all, end-all of analysis. We think its a pretty nifty tool, especially if you understand its limitations, and it does a good in most instances. However, it’s not perfect. Our version isn’t perfect, and Sean’s version isn’t perfect. We’re both trying, and we’re trying from different angles. Rather than focusing on why the differences make both “wrong,” maybe we should admit that its nice to have both perspectives?





Dave is the Managing Editor of FanGraphs.

132 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
AndyS
13 years ago

So I buy the argument for using DIPS, but specifically, you don’t address the biggest question mark – why FIP? Why not xFIP? or tERA? or tERA* or tERAr or SIERA? Or maybe a weighted combination of the bunch weighted on accuracy?

Mike K
13 years ago
Reply to  AndyS

From the Glossary:

“If and when a new metric like tRA is proven to be significantly more effective in valuing pitchers (and I’m hopeful that it will be, given more data exploration on the topic), we won’t be standing here as guardians of the infallibility of FIP.”

I don’t even think SIERA was published yet.

AndyS
13 years ago
Reply to  Mike K

SIERA was published, and xFIP has been shown to be more predictive of a pitcher’s ability than FIP.

suicide squeezemember
13 years ago
Reply to  Mike K

I think the key word there is “valuing.” You can expect someone with 15% HR/FB to be better than that in the future, but he still gave up those home runs this year, and that should be accounted for.

Hark
13 years ago
Reply to  Mike K

@Andy S

WAR isn’t a predictive value, it’s a descriptive one. xFIP may be more predictive, but FIP is closer to what he actually did. And that’s what WAR measures–how valuable an individual player was to his team, not how valuable he’s going to be. We shouldn’t use predictive stats for that.

Rich
13 years ago
Reply to  Mike K

If you want it to be descriptive, it shouldn’t be using FIP, it should be using ERA or some such.

FIP’s idea is to be predictive, and it does a poor job at it.

Fire_Jim_Hendry
13 years ago
Reply to  Mike K

@ Rich

You are right that FIP is at the low end of the descriptive-predictive scale. However, while ERA is descriptive, it still is measuring a combined pitching/defense measurement, and as such is not a good choice for pitching WAR. If you adjust for defense, you get something closer to an adjusted RA value WAR. It seems then that RA adjusted WAR, with additional batted ball adjustments is the best fit.

Rich
13 years ago
Reply to  Mike K

“You are right that FIP is at the low end of the descriptive-predictive scale. However, while ERA is descriptive, it still is measuring a combined pitching/defense measurement, ”

So is FIP, despite the insistence otherwise. FIP’s denominator is IP, which is heavily infuenced by a team’s ability to turn batted balls into outs. Replace IP with PA, and we might be getting somewhere (which is somethign SIERRA has done)

FIP is no more Defense – Independant than ERA.

Nadingo
13 years ago
Reply to  Mike K

Your point about PA over IP is valid, but it’s a huge (and incorrect) leap to get from there to claiming that FIP is no more defense-independent than ERA. ERA explicitly makes judgments about when runs are “earned” based on a hugely flawed metric — whether a defensive play was an “error” or not. And that’s just the tip of the iceberg.

Kevin S.
13 years ago
Reply to  Mike K

That’s just silly – ERA equates defense-independent outcomes with defense-dependent outcomes. FIP doesn’t. If you want to argue that FIP isn’t completely defense independent because of the IP issue, that’s fine, but it’s absolutely more defense-independent than ERA. Try not to get carried away next time.

Rich
13 years ago
Reply to  Mike K

A stat can not be “more independent”.

It either is, or it isn’t. FIP isn’t. FIP is dependent on defense in a different way than ERA, but its not any less dependant.

CircleChange11
13 years ago
Reply to  Mike K

more independent, closer to perfect, exactly like except, … a whole bunch of weird phrases that we commonly use.

Why don’t many point out how much FIP relies on hitter skill? K’s and strikeouts seem to be two things that hitters have a tremendous amount of influence/control over … which is why we see the same guys K and BB about the same rate annually.

If pitchers controlled strikeouts and walked, there’d be a bunch of one and none of the other.

That’s not really an issue with Fielding Independent Pitching, which is what FIP is … but the interpretation of it as being “things pitchers control”.

Like I said, if pitchers controlled HR, BB, and K … the stat line for every IP would read “0 HR, 0 BB, 3 K”. They don’t.

It’s another case of the stat not being the problem, but the interpretation/usage of it.

I do, however, don’t think that one can just say “WAR” anymore, I think one has to designate fWAR or rWAR, at least for pitchers. They seemingly measure two different things, but like most stats … most often the same talented group of guys will be in the top 15. But, that’s a pretty low standard to set. With that standard WHIP and ERA would be high levle stats.