Archive for Research

Are Popups a “Skill”?

In light of yesterday’s article on Infield Fly Balls and xFIP, there were some questions and debate about if popups are something under a pitcher’s control, or what you might call a skill. After reading the comments, I was somewhat doubting that my research might not have been thorough enough to essentially rule them out as a skill (which is what I more or less did).

If you look at popups per ball in play on a year-to-year basis, you get a correlation of about .52, which would highly suggest that there is some “skill” in inducing popups. However, there is a very strong positive correlation between popups and outfield-fly-balls (.64), and a very strong negative correlation between popups and groundballs (-.72).

In other words, as outfield-fly-balls increase, so do popups. As groundballs decrease, popups increase. For comparisons sake, line-drives have absolutely no correlation with popups.

Since the correlations are so high, you can basically come up with an expected popup rate based on a player’s groundball percentage. To me, it actually looks non-linear:

So it seems that each player has a dynamic expected popup rate based on his groundball percentage. Now the real question is, do players popup rates diverge from their expected popup rates consistently on a year to year basis?

If you look at the above chart, you’ll see that there’s not much consistency from year-to-year. The correlation is about .18, which pretty much agrees with Mitchel Lichtman’s findings of .14 as quoted in David Gassko’s Batted Ball DIPS article. For comparisons sake, BABIP has a year-to-year correlation of .15.

So what does this mean for popups as a “skill”? I’d say they are sort of a skill that is closely tied to groundball percentage, but from the findings above, that’s about as far as I’d go. While there may be certain pitchers that prove to be exceptions (just as there are exceptions with BABIP), popups in general do not seem to be much of an independent skill.

All batted ball data is from Baseball Info Solutions from 2006-2009 for pitchers with at least 100 innings pitched.


Infield Fly Balls and xFIP

Today I saw a couple gripes around the Internets about xFIP and how infield fly balls are not taken into account. On FanGraphs, overall fly-ball percentage is used to calculated a pitcher’s “normalized” home run rate.

This got me thinking about David Gassko’s Batted Ball DIPS article from five years ago where he writes the following about infield fly balls:

Infield flies per ball in play actually have a slight negative correlation with outfield flies per ball in play. Inducing infield flies is a skill, and while it correlates somewhat weakly year-to-year (Lichtman found an “r” of .140), a small subset of pitchers exhibits clear control over the percentage of their fly balls that are infield pop ups. I would encourage studies looking into who those pitchers are—one thing I have noticed is that extreme ground ball pitchers allow fewer than expected infield fly balls.

What I believe is actually going on here is that fly-ball pitchers in general have higher infield fly-ball rates as measured by Baseball Info Solutions. The repeatability of infield fly balls is basically just a side effect of a pitcher’s total fly-ball rate. Looking at all pitchers from 2006-2009, here’s what you get when you bucket FB% in increments of 5%:

FB% Bucket     IFFB%    HR/FB%   HR/OFFB%
< 25%          7.1%     11.1%     11.9%
25% - 29%      7.8%     10.9%     11.7%
30% - 34%      8.9%     10.2%     11.2%
35% - 39%      9.7%     10.2%     11.3%
40% - 44%     10.5%     10.0%     11.2%
45% - 49%     11.6%      9.8%     11.0%
>= 50%        12.2%     10.0%     11.4%

So, while it’s pretty clear that overall FB% is impacting IFFB%, I’m not sure things are quite so obvious with home runs. It seems to me that home-runs-per-total-fly-ball plateaus at about 10% starting in the 30%-plus range. And for home-runs-per-outfield-fly-ball, things look pretty similar, except everything is about 1% higher because of the removed IFFBs.

So getting back to xFIP, does it really matter whether or not you exclude popups? The answer is, not really. You’re going to get almost the same results because HR/OFFB on average exhibits more or less the same issue as HR/FB. In fact, the correlation between using OFFB vs total FBs in xFIP is .996. The two, in practice, are virtually identical.

However, when you bucket the data like this, it seems that there is one thing made clear: When an extreme groundball pitcher induces a fly ball, there’s slightly greater chance it will end up a home run. I think it would be particularly interesting to look at the run values of different batted balls types for different buckets of fly-ball pitchers, but I’ll have to leave that for another time.


Strasburg and PitchFx Pitch Types

As I was poking around at Stephen Strasburg’s most recent start in our pitchf/x pages, I noticed that MLBAM was classifying one of his pitches as a two-seam fastball, which I recall was not the case in his first start a week ago. So I went back to check his first start and low and behold, a number of his four-seamers had been reclassified to two-seamers (and a couple to curveballs changeups).

This correction seems to agree with the this note from J-Doug over at Beyond the Box Score:

*Note: Several commenters and analysts (such as Tim Kurkjian) have noted that Strasburg throws both a four-seamer and a two-seamer (or what Strasburg calls a ‘one-seamer’). This makes sense considering the break on his fastballs. However, MLBAM doesn’t yet have enough data (I assume) to separately classify these two pitches, so they both came through as four-seamers. I’m going to rely on MLBAM’s estimation for now, since that’s where the data came from, but feel free to read everything that is labeled “four-seamer” as just plain “fastball.”

And it also seems to match up pretty well with Nick Steiner’s own pitch classifications.

I don’t have anything in particular to note about the pitches that changed in classification, but it is important to note that pitchf/x data is retroactively updated as the pitch classifying algorithms are adjusted for each individual pitcher.


Some More on O-Swing%

Yesterday, Joe Pawlikowski noted that there seems to be an increase in the overall O-Swing% so far this season, which led to some questions about whether the strike zone was being measured consistently from season to season.

Over the course of the nine years Baseball Info Solutions has plotted pitches, there have been some not so small changes in the average O-Swing%. Over the past 3 years the numbers have been stable, but this season it seems O-Swing% is up about 3%. This can sometimes make raw O-Swing% a difficult stat to match up to year to year because the baselines can be somewhat different.

However, when looking at a player’s O-Swing% above average, there is a very strong correlation from year to year and this continues to be the case for the 2010 data. In other words, a typical player’s “plate discipline” does not end up changing that much season to season. Here are the more recent correlations for players with greater than 50 plate appearances compared to BB% or Pitches/PA.

O-Swing% Above Average / BB% / Pitches per PA
2009 – 2010 – .74 / .56 / .66
2008 – 2009 – .74 / .64 / .68
2008 – 2010 – .68 / .53 / .57

As you can see, even when compared to something as seemingly stable as Pitches/PA, O-Swing% is definitely more stable from year to year once you adjust for the baseline.

So the lesson here is that average O-Swing% is important to take into consideration. We’ll be adding O-Swing% Above Average (OSAA for short) to our repertoire of stats starting tomorrow, which will make life somewhat easier when comparing a player’s O-Swing% from season to season.


FIP for Hitters? Defense Independent Offense

While writing on the “three true outcomes” (walk, strikeout, and home run) leaders and trailers from 2007-2009, I was reminded of a toy idea that I’d had earlier to create something like FIP (Fielding Independent Pitching), using the same basic components, except for hitters. I finally got around to doing it recently, and the results were interesting. I’m not saying this is any more than a junk stat. But it might be interesting, who knows?

* You want real sabermetric research? Read Matthew Carruth, Dave Allen, or one of the many other intelligent researches writers here and elsewhere. Trying to waste time at work? You came to the right place. Tom Tango may have created wOBA and FIP, but this a stat that gives me joy.

The basic formula for FIP is ((HR*13+(BB+HBP-IBB)*3-K*2)/IP) + 3.2, where “3.2” is a season/league specific factor to put the league FIP on the same scale as the league ERA. To make it suitable for hitters, I made a couple of minor modifications: 1) I scaled it to RA rather than ERA. The RA scale for the 2009 MLB was 3.52. 2) For IP I used outs made by the hitter (divided by 3 to get on the IP scale): AB-H+SF+SH+GIDP (I left out CS because I want to deal with the pitcher/hitter matchup). Ladies and gentleman, I present the formula for Defense Independent Offense, or DIO:

((HR*13+(BB+HBP-IBB)*3-K*2)/(Outs/3)) + 3.52.

Who (among qualifying hitters) led the league in DIO for 2009? Remember that for hitters, a higher number will be better.

1. Albert Pujols, 9.18
2. Prince Fielder, 8.66
3. Adrian Gonzalez, 8.55
4. Alex Rodriguez 8.32
5. Carlos Pena 8.31
6. Adam Dunn, 8.11

So far, so good, those are some great hitters. Here are the trailers:

150. Yuniesky Betancourt, 4.26
151. Michael Bourn, 4.12
152. Randy Winn, 4.03
153. Cristian Guzman, 3.92
154. Emilio Bonifacio, 3.73

Some of these names — Betancorut, Winn, Bonifacio — aren’t surprising. But what about Michael Bourn, for example? Didn’t he have a decent season at the plate in 2009? Hold on to that thought.

Just as a player’s wOBA can be compared with league wOBA to give up the player’s runs created above average (wRAA), we can compare a players DIO with the league’s runs per game (4.61 in 2009) to produce a DRAA: =(DIO-lgR/G)*(Outs/27).* Here are the 2009 leaders in DRAA and with their wRAA figures for sake of comparison.

* One can also calculate absolute runs created (wRC) with DIO * (Outs/27).

1. Albert Pujols 69.9 DRAA, 69.7 wRAA
2. Prince Fielder 65.6 DRAA, 54.9 wRAA
3. Adrian Gonzalez 62.2 DRAA, 41.5 wRAA
4. Mark Teixeira 53.5 DRAA, 42.9 wRAA
5. Adam Dunn 53.2 DRAA, 35.9 wRAA

The Pujols figures are almost dead-on, and given the crudeness of DIO, Fielder and Teixeira aren’t that far off, but Gonzalez and Dunn seem to be quite overrated by DIO-Runs. The general “in the neighborhood-ness” isn’t that surprising, given that FIP (and thus DIO) are based on linear weights of the relevant events, and wOBA is just linear weights expressed as a rate stat. But what about the discrepancies? Does the perhaps mean we should be rethinking wOBA/wRAA in favor of my awesome new offensive metric, or at least use it more prominently, just as FIP is generally favored (around here) over ERA?

In a word: no. Going back to the origins of DIPS-theory, pitchers generally have little control over balls in play, and thus DIPS, FIP, tRA, etc. are attempts to remove the defense-dependent elements from pitcher evaluation. However, while BABIP generally has less year-to-year correlation for hitters than, say, walk rate, it does correlate far better than for pitchers. That is why traditional linear weights (like wRAA) are preferable for hitters. DIO systematically underrates hitters like Michael Bourn not only because it ignores steals, but because it assumes that the players contributions on balls in play are league-average, whereas Bourn’s contributions in those areas are well-above average. DIO’s also badly underrates hitters like Joe Mauer (40.5 DRAA vs. 54.9 wRAA in 2009) and Ichiro Suzuki (-2.2 DRAA vs. 22.6 wRAA), as well as overrating (still very good) hitters like Adrian Gonzalez and Adam Dunn.

DIO has interesting aspects. It highlights how many good hitters get most of their value from hitting home runs and walking, for example. There is also much to be said for using a rate stat baselined against outs rather than PA (I wouldn’t go so far as to make the mistake of generating a DIO-based Offensive Winning Percentage, although it was tempting). For me, it was worth it just to walk through and see how well the stat did in ranking hitters. Most of all, it was a good reminder of the difference in BABIP as a skill relative to pitchers and hitters. Without reminders like these, I’d be left on my own, like a rainbow in the dark.


Fan Projections: Not All Fans Agree

As you might have guessed, not all baseball fans agree when it comes to evaluating baseball players, and the Fan Projections are a great example of how many different opinions there are of various baseball players.

If you look at all the players’ projected wOBA and the spread of how individual people projected wOBA, you get a standard deviation of about .017 on average. What this means is that assuming the Fan Projections have a normal distribution (which may not be the case), about 68% of the fan projections are within +/- .017 of the fan average when it comes to projecting wOBA. Over 600 plate appearances, that works out to about +/- 8.5 runs above average (wRAA).

It’s particularly interesting to see just which players fans happens to be more or less in agreement about, so here are the top 10 regular players with at least 50 votes where the fans agree the most:

                  wOBA      Std   Votes
Adam LaRoche      .350     .009      65
Brian Roberts     .358     .010      65
Aaron Hill        .348     .011      77
Juan Pierre       .316     .011     107
Bobby Abreu       .365     .012      89
David Wright      .396     .012     153
Matt Holliday     .396     .013      99
Todd Helton       .388     .013      51
Felipe Lopez      .333     .013      71
Ryan Zimmerman    .376     .013      81

With these players, people seem to have a pretty good idea of what to expect. Keep in mind that every additional .001 of wOBA ends up as an extra .5 runs above average over 600 plate appearances.

Here are the top 10 players people disagree on the most:

                  wOBA      Std   Votes
Shane Victorino   .350     .024      65
David Ortiz       .362     .024     105
Ryan Howard       .388     .023     107
Pablo Sandoval    .384     .022     113
Alex Gordon       .364     .021      49
Alex Rodriguez    .418     .020     263
Jimmy Rollins     .342     .020      92
Justin Upton      .393     .019     102
Adrian Beltre     .345     .019     151
Curtis Granderson .376     .019     137

With all of these guys – just on batting alone and not even looking at defense or playing time – you’re looking at at least a +/- 1 win difference within one standard deviation assuming a minimum of 600 plate appearances.

As the ballots keep coming in, we’ll continue to look at the Fan Projections in various ways. There really is a wealth of data in these projections that goes beyond just what goes into the single projection line on the player pages and hopefully we’ll all be able to learn a lot from them.


Were the Yankee Sac Bunts in the 8th Inning Correct?

The answer to that question is complicated. There is no easy yes or no answer and that is not so much because there are so many variables we don’t know the answer. It has to do with game theory. Oh, in case you didn’t watch the 6th game of the ALCS or you forgot, in the 8th inning with the Yankees up 3-2, Swisher bunted with a runner on first (and no out of course) and when he reached on an ROE, Melky bunted with runners on first and second.

Many people, including those who are sabermetrically inclined, typically decry the sacrifice bunt – why give away outs? The conventional (and lazy) sabermetric wisdom used to be that sac bunt attempts were almost always incorrect – at least ever since The Hidden Game of Baseball told us so and legions of sabermetric fans and even sabermetricians looked at the RE and WE tables and noticed that the game state after an out and base runner advance was worse than before – hence the sac bunt is wrong.

The problem of course is that that is a ridiculously simplistic way to answer the question on two fronts. One, the WE or RE before and after a “successful” sac bunt, using a standard table, is based on an average batter in an average lineup against an average pitcher and defense in an average stadium on an average Spring day. At least some analysts recognized that in different contexts, those numbers would have to be revised. However, most of them also noted that the gap was so large between the “before” and “after” state (in favor of the “before” state – which assumes hitting away most of the time), that it would take an enormously bad hitter -like a pitcher – to make it correct to bunt. They would basically be right.

Now, there is a more important and pertinent reason why looking at RE and WE charts and comparing the “before” and “after” numbers do not help you in answering the question as to whether a sac bunt (by a non-pitcher) is correct in any given situation. And that is because a sac bunt attempt obviously does not lead to an out and a base runner advance 100% of the time (or even close to 100%); in fact the average result from a sac bunt attempt is not even equivalent to an out and a base runner advance. Also, the average result varies a lot with the speed and bunting skill of the batter and whether and by how much the defense is anticipating the bunt or not (among other things).

Read the rest of this entry »


Turning Two

Double plays are called the pitcher’s best friend for a good reason. I think we’re all familiar with the huge swing in win expectancy that takes place when a pitcher wiggles out of a one-out, bases loaded jam with an inning ending double play. And it is a skill for infielders; some are clearly better at turning DP’s than others. It takes talent for a shortstop to field the ball quickly and cleanly, transfer to the second baseman, the second baseman must pivot, throw accurately and quickly to first base…you get the idea. Your routine 6-4-3 double play is probably a lot harder than it looks.

One of the components of UZR for infielders is DPR, or double play runs. It is simply (and I’m quoting word for word from the site glossary) “the number of runs above or below average a fielder is, based on the number double plays versus the number forces at second they get, as compared to an average fielder at that position, given the speed and location of the ball and the handedness of the batter.”

I definitely am the wrong person to get into the nitty gritty details of such things, but I can sort through leader boards with the best of them. I wanted to look at just some of the leaders and laggards of the keystone combos. One note before we jump in (and someone correct me if I’m mistaken) but it appears to me a typical shortstop or second baseman is usually about a maximum of plus or minus three runs in pivot or starting double plays, or in other words, the difference between a very good middle infielder and a very bad one is really only about ten double plays a year. So we can say that the ability to turn a double play can be pretty overrated. Range is much, much more important.

Your 2009 Top DP Combo thus far:

Jack Wilson and Freddy Sanchez. Hey, we talked about these two yesterday. Wilson has been worth 1.7 DP runs, Sanchez 1.4, for a total of 3.1 runs saved in turning the double play. Compare this to…

Your 2009 Worst DP Combo:

Hanley Ramirez and Dan Uggla: These two are a pitcher’s worst enemy. Hanley Ramirez has been a -1.4, Dan Uggla an ugly -2.2. That’s -3.6 runs for those of you scoring at home.

Getting back to Jack Wilson for a moment, FanGraphs has UZR data dating back from 2002. Wilson is by far the leader at double play runs with +15.6. Michael Young has been the worst at -7.8, and he wasn’t moved full time to shortstop until 2004.

A word about Dan Uggla — the man is in some sort of DP slump, as the three seasons prior (2006-2008) he led all second baseman with +6 runs. In fact, his ability to turn the DP is what salvaged his defensive value. DP’s aside, Uggla was a -6.7 UZR during those seasons. Brian Roberts was the worst second baseman at -6.1. Roberts was worth 7.7 UZR before factoring DPR, so his lack of ability to turn two offset what other defensive value he added. He’s the anti-Uggla.

Finally, the best keystone combo between 2006 until now was Yuniesky Betancourt and Jose Lopez of the Mariners, who were combined for +9.8 runs, or a full win. The fact that it took one DP combo to total a whole win over three and a half seasons drives home the fact that while that the ability to turn two is important, it is not nearly as important as we might have thought. Being that Yuniesky has been so brilliant at DP’s and yet so bad at everything else is also a reminder that range is waaaay more important.


When Samples Become Reliable

One of the most difficult tasks a responsible baseball analyst must take on involves avoiding small samples of data to make definitive claims about a player. If Victor Martinez goes 4-10, it does not automatically make him a .400 hitter. We have enough information about Martinez from previous seasons to know that his actual abilities fall well short of that mark. Not everything, however, should merit a house call from the small sample size police because there are some stats that stabilize more quickly than others. Additionally, a lot of the small sample size criticisms stem from the actual usage of the information, not the information itself. If Pat Burrell struggled mightily after the all star break last season and started this season with similarly poor numbers, we can infer that his skills may be eroding. Isolating these two stretches can prove to be inaccurate, but taking them together offers some valuable information.

The question asked most often with regards to small sample sizes is essentially – When are the samples not small anymore? As in, at what juncture does the data become meaningful? Martinez at 4-10 is meaningless. Martinez at 66-165, like he is right now, tells us much, much more, but still is not enough playing time. What are the benchmarks for plate appearances where certain statistics become reliable? Before giving the actual numbers, let me point out that the results are from this article from a friend of mine, Pizza Cutter over at Statistically Speaking. Warning: that article is very research-heavy so you must put on your 3D-Nerd Goggles before journeying into the land of reliability and validity. Also, Cutter mentioned that he would be able to answer any methodological questions here, so ask away. Half of my statistics background is from school or independent study and the other half is from Pizza Cutter, so do not be shy.

Cutter basically searched for the point at which split-half reliability tests produced a 0.70 correlation or higher. A split-half reliability test involves finding the correlations between partitions of one dataset. For instance, taking all of Burrell’s evenly numbered plate appearances and separating them from the odd ones, and then running correlations on both. When both are very similar, the data becomes more reliable. Though a 1.0 correlation indicated a perfect relationship, 0.70 is usually the ultimate benchmark in statistical studies, especially relative to baseball, when DIPS theory was derived from correlations of lesser strength. Without further delay, here are the results of his article as far as when certain statistics stabilize for individual hitters:

 50 PA: Swing %
100 PA: Contact Rate
150 PA: Strikeout Rate, Line Drive Rate, Pitches/PA
200 PA: Walk Rate, Groundball Rate, GB/FB
250 PA: Flyball Rate
300 PA: Home Run Rate, HR/FB
500 PA: OBP, SLG, OPS, 1B Rate, Popup Rate
550 PA: ISO

Cutter went to 650 PA as his max, meaning that the exclusion of statistics like BA, BABIP, WPA, and context-neutral WPA indicates that they did not stabilize. So, here you go, I hope this assuages certain small sample misconceptions and provides some insight into when we can discuss a certain metric from a skills standpoint. There are certain red flags with an analysis like this, primarily that playing time is not assigned randomly and by using 650 PA, a chance exists that a selection bias may shine through in that the players given this many plate appearances are the more consistent players. Cutter avoids the brunt of this by comparing players to themselves. Even so, these benchmarks are tremendous estimates at the very least.


Projection vs Projection

It’s almost opening day, and it seems like everyone is talking about projections.

When considering a projection, there are really two questions to be answered – what is the player’s “True Talent Level” right now, and how will he perform next year? Between now and the end of next year, his talent level very well might change, as he’s a year older and might recover from or succumb to injuries. Even then, there’s still the random variance of a single season performance. In this article I’d like to explore how some of the major projection systems work when predicting different subgroups of players.

I tested the following projections: PECOTA (2006-2009), ZiPS (2006-2009) CHONE (2007-2009) and my own Oliver (2006-2009).

By wOBA

The first test was to group the yearly projections to the nearest .010 of wOBA, and then see how that group of players actually performed. There were 468 players who had projections from all four systems, and had at least 350 plate appearances in the major leagues in the following season. As 2009 is yet to be played, and CHONE is not available for 2006, these projections to next year comparisons are for the 2007 and 2008 seasons. All four projections were tested on the same 468 players. The observed results were unadjusted major league stats, so that the results of the test would not be influenced by which park factors or MLE formulas I chose to normalize stats.

To read the results, CHONE of the players would have a wOBA between .375 and .385, averaging .380, 25 of them had 350 or more PAs in MLB in the following seasons, and those 25 players had an average wOBA of .363, so at that level CHONE was .017 high. Oliver was .008 high on 21 projections, PECOTA .027 high on 26, and ZiPS .014 on 26. The last line of the table shows the root mean square error (weighted by number of players). Oliver had the lowest mean error at .006, followed by CHONE .011 and PECOTA and ZiPS at .012 each.

Read the rest of this entry »