A Primer on a New and Improved KATOH by Chris Mitchell November 25, 2015 My role here at FanGraphs is to write about minor league players. Nearly all of my articles focus on the output from my KATOH projection system, which produces long-term forecasts for players who are still in the minor league phase of their careers. Today, I’m unveiling some updates to my model that will be reflected in my analysis from this point forward. I’ve been meaning to work these updates into KATOH for quite some time now, but haven’t had the chance to finish up until now. Some pieces of this took a bit longer than expected, and day job stuff along with this year’s onslaught of prospect debuts pushed things to the backseat a bit. But I’m all caught up now and ready to unveil my new and improved KATOH. Here we go! Rather than just putting out a straight leaderboard, I thought I’d use this as an opportunity to explain some of the inner workings of KATOH. I wanted to say something more insightful than “These are the best prospects because math.” That’s why this piece runs 2,000+ words without reference to a single baseball player. If you’re just interested in the output rather than the nitty-gritty, check back after Thanksgiving for KATOH’s top 100 list. I just wanted to get all of this background stuff down in one place, rather than cluttering future pieces with extra information. ***** Obligatory Technical Details The general framework of my model is largely the same as it’s always been. As I did in the past, I deployed a series of probit regressions to see what factors are most predictive of major league performance. For each player, I generated probabilities that he would achieve certain benchmarks through his age-28 season: play in the major leagues, earn at least 1 WAR, earn at least 2 WAR, etc. These percentages gave me a probabilistic outlook for each player, and enabled me calculate an “expected value” for his WAR through age 28. People often ask me why I choose to use a probit regression instead of the more well-known logistic (logit) model. With my original iterations of KATOH, it was simply a matter of familiarity. I come from an economics background, and probit regressions are used fairly often in that field. But this time, I tested out my data using models built with both distributions, and found that probits generally gave me lower Akaike information criterion (AIC) values. To be frank, there’s very little difference between the two distributions in practice. If you really want to nerd out on the differences between them, I’d recommend this Stack Exchange thread. To guard against overfitting, I backtested my models by testing out of sample. In other words, I trained my prospective models using data through 2000, and then tested them on prospects from after 2000. This ensured that I was building a model that’s optimized to project future prospects, rather than one that’s optimized to project past prospects. ***** Notable Additions Component Stats Regressed to League Average Based on Sample Size In the previous iterations of KATOH, I simply took players’ raw stats, centered them to league average and fed them into the regression machine. This worked fine in most cases, but resulted in a few blind spots. In particular, it put a disproportionate amount of weight on extreme performances that likely resulted from small sample noise — such as high BABIPs for hitters. This time around, I regressed each individual statistic to league average depending on how quickly it stabilizes. Compared to its predecessor, this KATOH is less apt to “believe” in a performance if it comes over a relatively small number of plate appearances, especially if that performance is driven by things like BABIP luck. Hitter’s Defensive Position This particular update was a very long time coming. All of my previous KATOH models only took into account hitters’ offensive stats, and ignored the defensive side of things. To some extent, a hitter’s offensive stats acted as a proxy for his defense, particularly his stolen base numbers. But that proxy was far from perfect, and as a result, there was a lot noise in the data. Getting my hands on any sort of minor league defensive data in a readable format (especially for the years predating the FanGraphs database) wasn’t as easy as one would think. But I managed to track down what I needed, and after some hardcore data cleaning, it’s finally part of my model. With all else being equal, the defensive positions generally rank thusly, from most favorable to least favorable. Catcher Shortstop Outfield Third Base Second Base First Base You probably noticed that I lumped all outfielders into one position, rather than distinguishing between left field, center field and right field. This is because my minor league fielding data did not consistently distinguish between the three outfield positions until a few years ago. Since I didn’t have a lot of data to tell me how much credit a player should receive for playing center field rather than left or right, I did not attempt to break them up in my model. I should also note that while KATOH takes into account a player’s defensive position, it doesn’t consider his ability at said position. So the best defensive shortstop in the minors (whoever that might be) would get credit for being a shortstop, but not for being an excellent defensive shortstop. KATOH treats him as it would any other shortstop at the same level. I realize that this is a flaw, and it’s something that’s on my radar. But as of right now, I haven’t had the necessary combination of time and data to work this in. So just keep that in mind for prospects who fall at either end of the spectrum — good or bad — at their respective positions. Pre-Professional Background Out of all of my updates, adding in defensive positions had the largest impact on the forecasts. However, I’d also argue this addition was the least surprising one. Even if we weren’t quite sure how to quantify it, we all knew that playing a premium position gave players an edge. The most surprising finding — to me, at least — was that a player’s pre-professional background can also help us predict his future. Essentially, given the same minor league stat line, a hitter drafted out of college is much more likely to succeed in the big leagues than a high school draftee; and a high school draftee is slightly more likely to succeed than a player from the Dominican Republic or Venezuela. I think this is due, at least in part, to the amount of professional experience these hitters have under their belts. A 21-year-old Dominican-born hitter might have four or five years of professional experience. A 21-year-old college hitter, on the other hand, is completely new to the minor league lifestyle and might be facing competition far better than he was on the college circuit a few months ago. This was one of the things I looked at in my piece for this year’s Hardball Times Annual, so you can read more about this phenomenon once you get your copy. College pitchers also have an edge over their high school counterparts, but it’s actually the Caribbean-born arms that come out on top. My hypothesis is that “Caribbean-born” acts as a proxy for velocity here. It’s my guess that Caribbean-born minor league pitchers probably tend to throw harder than their American-born counterparts. Many of them wouldn’t have been signed otherwise. Player Height Another interesting addition is player height, which immediately becomes the closest thing to scouting that’s incorporated into KATOH. Given the same stat line, taller players are generally more likely to succeed in the majors than their shorter counterparts. Baseball players’ height data aren’t perfect, of course, as teams and/or players certainly fudge these numbers from time to time. Furthermore, players’ bodies are always changing, even in the height department. We’re sometimes dealing with kids as young as 16 or 17, remember, and height isn’t tracked from one year to the next the same way that strikeouts are. Nonetheless, these likely flawed data add a small amount of predictive value; and bigger tends to be better for both hitters and pitchers. ***** How KATOH Thinks Forecasting minor leaguers involves a series of trade-offs. It’s obviously not a good thing if a hitter strikes out often, but it’s less of a concern if he also hits for power. Similarly, a hitter with weak offensive skills is more acceptable if he’s a plus defender or happens to be very young for his level. The same applies for pitchers: more strikeouts are good, but a strikeout pitcher with a home run or walk problem is less likely to succeed in the majors. To help you internalize how these trade-offs work, I put together the graphics below. I generated them by fiddling around with inputs for current well-regarded prospects and taking down the changes in their forecasted WAR through age 28. They estimate how a change in a single metric influences a player’s KATOH forecast across each of the four levels of full-season ball. The data are most reliable at those levels due to sample size and proximity to the major leagues. For Hitters… And for pitchers… These aren’t hard and fast rules, as KATOH is more complex than what you see above. I prioritized accuracy over interpretability when building my models, which obviously makes interpretation a bit complicated. For example, a change in strikeout rate doesn’t always move a player’s projection by the same amount, but affects each player a bit differently. I transformed many of KATOH’s variables to something other than linear, so it’s more complicated than “X widgets of walk rate leads to Y widgets of WAR.” Still, even if it’s an oversimplification, the above graphic can hopefully give you a general sense of how KATOH thinks. Here are some takeaways in the form of easily digestible one-liners: Strikeout rate matters a lot for hitters, especially when they’re in the low minors. Walk rate matters very little for hitters, especially when they’re in the low minors. For hitters in Rookie ball, I found no evidence that walk rate is predictive at all. BABIP matters more for hitters in the low minors than for hitters in the high minors. Age matters more for hitters in the low minors than for hitters in the high minors. Height matters more for hitters in the low minors than for hitters in the high minors. Strikeout rate matters more than walk rate for pitchers, but only slightly. Pre-professional background matters more for pitchers in the low minors than for pitchers in the high minors. ***** There are still improvements to be made to KATOH. That will always be the case. But rest assured that I will continue plugging away to make it better. I’m fully aware that I’m unveiling an imperfect model here; but if I held off on publishing projections until I had a system I was completely happy with, I’d never write another article for FanGraphs. And while imperfect, this is easily the best KATOH model to date. Even so, it’s important to remember that KATOH is just one tool for evaluating minor league players. Taken by itself, KATOH’s output is relatively useless, especially if used as a substitute for scouting-based analysis. KATOH has (almost) no idea how hard a pitcher throws, how good a hitter’s bat speed is, or what a player’s makeup is like. So it’s liable to miss big on players whose tools don’t line up with their performances. However, when paired with more scouting-based analyses, KATOH can be useful in identifying talented players who might be overlooked by the industry consensus or highly-touted prospects who might be over-hyped. FanGraphs lead prospect analyst Dan Farnsworth has started rolling out his team prospect lists this offseason. Be sure to check out his pieces on the Diamondbacks and Braves if you haven’t already. From here on out, I’ll be publishing companion pieces where I go over each organization’s prospects according to my KATOH forecasts. I’ll catch up on the two organizations I missed along the way. Furthermore, I plan to continue writing up prospects as they’re called up and traded using insights from KATOH. Hopefully you’ll read some of those articles. And if you do, hopefully this primer gives you a better understanding of where the projections come from. ***** Resources Many thanks to Jeff Zimmerman for helping me with my numerous, and often unreasonable, data requests. In addition to data from FanGraphs, I also incorporated data from Baseball-Reference and from The Baseball Cube’s data store.