Let’s Build a Searchable Baseball-Text Archive

Baseball provides its audience with great data. We’re blessed with a large number of observations and very high-quality record keeping. Even before MLB announced a radar and video tracking system designed to capture seven terabytes of data per game, we had box scores going back a century and 40 years of play-by-play logs. Thanks to organizations like Retrosheet, SABR, and Baseball-Reference, baseball’s data is also very accessible.

Unsurprisingly, baseball fans are excited about the full implementation of Statcast because the system brings with it the promise of new and exciting data. Exit velocity! Time to the plate! Route tracking! While I share some of that excitement, Statcast has my attention because of something else it promises to offer: completeness.

If Statcast delivers on its promise, we will have a record of everything that happens on the field. That might not sound particularly groundbreaking, but imagine you wanted to know how many times a player got into and then out of a rundown last year. How would you study that? Unless you have the time to watch a lot of video or could program a computer to parse video on your behalf, you’re probably out of luck. Statcast could theoretically change that.

As you may recall, I spent the 2016 season tracking catcher back picks to first base. I leaned on crowdsourcing because it gave me the best chance to gather a complete data set. Proto-Statcast tracked some back-pick attempts, but not all of them. Baseball-Reference had records of some, as well, but neither source had the full complement of data. I was interested in a part of the game that no one was tracking and had to undertake a lot of effort to create useful data. In particular, I was able to do a reasonably thorough job because I attacked the problem contemporaneously. It required a lot of effort and coordination. If I had asked people on October 1 to remind me of back-pick attempts, I would have had very limited responses. But I asked at the beginning of the year and reminded people often, increasing the odds that the information would be tracked and recorded.

Statcast’s potential ability to track everything means that we might not have to decide what data we want to track ahead of time. In a world with a perfect Statcast system, I could have sat down on October 1 and called up every back pick. Or every rundown. Or every time a center fielder ran 20 feet or more to make a catch. This is exciting because it makes it easier to study aspects of the game that are more esoteric. There isn’t money in tracking back-pick attempts or run-down escapes, but the beauty of Statcast is that it simply collects everything and lets us figure out where to go next.

Certainly, the Statcast system is nowhere near ready to provide that kind of data availability to the general public, but the potential is there and it got me thinking about other baseball-related data collection. In particular, now that we live in a world of unlimited data storage and automated data collection, what should we be tracking and archiving that we aren’t already? I have three suggestions, two of which are merely a matter of effort and one of which would require some convincing.

My first proposal would be to make transcriptions of television and radio game broadcasts available to the public. While the video and audio recordings are archived, searching through those recordings isn’t an efficient way to locate particular details. Having searchable text files available to researchers who want to identify a particular event or find instances of some rare occurrence could make use of the conversation between announcers.

Imagine you wanted to review all the cases in which the third-base coach made a poor decision to send or hold a runner. You could isolate outs at the plate in any reasonable database and then work backward through video archives to review the plays, but you wouldn’t have an easy way to find incorrect holds. Maybe with Statcast you could compare the location of the ball to the location of the runner, but then you lack an ability to determine what the base coach was doing at the time. Currently, your only real option is watching thousands of video clips.

But if you had transcripts of every broadcast, you could search for instances in which the announcers discussed the third-base coach and match that to cases in which there was a runner approaching/passing third base. It would still require some review of the plays, but you could zero in on the relevant cases in a much more direct way.

Having game transcripts available in this manner would significantly improve our ability to retroactively find information that doesn’t fit neatly into a stat sheet or even a system like Statcast. One RSN with which I spoke said they don’t currently transcribe their broadcasts. That said, it wouldn’t necessarily be a foreign concept for a television network. For example, CNN releases transcripts just hours after broadcasts air. There would certainly be less need for urgency when it comes to baseball games. Furthermore, this wouldn’t undercut their business model because no one would treat transcriptions as a substitute for watching or listening to the game, but merely as a complementary tool.

Similarly, every book, article, television/radio show, game story, and tweet would ideally be archived in a single place. While Google does a decent job helping one to track down articles written on particular topics, the ability to drill down beyond the search-engine-optimized text is limited. A really useful article might have appeared on a now-defunct blog in 2007, but the odds of finding that with ease are limited. There is simply too much else on the internet through which Google must cut in order to find that useful bit.

Additionally, our current system provides us very little in terms of comparative power. Say you wanted to evaluate how the coverage of Alex Rodriguez changed over time. You could certainly find individual articles from different points in time, but the tool I’m proposing would allow you to search for every time he was mentioned during a given period. A researcher could then read a truly random selection of articles or utilize an automated text analysis program to conduct his or her research. Imagine being able to search every game story written about the Giants or every column about steroid use?

Creating this system retroactively might be challenging, but going forward you could imagine a system where every baseball site, book publisher, etc., agreed to have their articles automatically captured and retained. Unlike with broadcast transcription, this phase of the project wouldn’t require that we create much new content, merely a new infrastructure for retaining the work that exists anyway. If I know exactly what I’m looking for today, I can find any article I might need. But as more and more content is pushed onto the web, it gets harder and harder to find things of which one isn’t already aware.

Also, to avoid problems with subscription sites and those dependent on ad revenue, there could be a cooling off period before the articles became available for public use. After all, no site is making money on a four-year-old article about the Mariners’ second-tier LOOGY. The same could be true for book publishers, who could require an individual purchase of the text in order to see beyond the search results.

I imagine that no one would object to these ideas in principle. Rather, it would just be a matter of convincing the relevant parties that the initial effort would be worth it in the long run. After all, this is all information that’s available to the public in some format already, we would just be creating a process to retain it in a way that might be useful in a manner other than its original intent.

My third suggestion is a little more complicated. I would also propose that clubs undertake a record retention and declassification process similar to one that occurs across governments. With most present communication occurring digitally and with the desire among clubs to leverage every bit of information they have, teams are already gathering and storing proverbial mountains of internal records. What if MLB standardized the record retention process to ensure that all of that information exists in perpetuity? And additionally, what if we could convince teams to release that information publicly after a given period of time?

Imagine what you could learn from seeing the internal evaluations of every player and all of the different trade scenarios that were discussed at a given deadline? How about the research teams are doing into any number of topics? This information is hidden from view because baseball is a competitive endeavor and teams want to use information asymmetry to their advantage. These records are private to protect teams from each other, but at some point the statute of limitations runs out. At some point, nothing in those files is actually useful to an opponent.

For example, seeing the Tigers’ scouting grades for the 1995 draft would be totally useless to any other team today, but it would be very useful in helping those of us on the outside evaluate the game. Trade negotiations from 2003 also wouldn’t help be of much relevance to winning in 2017, but it would give analysts and researchers insight into how the industry viewed particular players at the time. Wouldn’t it be fascinating to see which teams were onto framing before the public research hit in 2011? Surely every team has seen research that goes well beyond the initial findings by now, so there’s nothing left to gain from seeing the first drafts.

There would obviously need to be firm guidelines in place to ensure that anything that was released was beyond its useful life as a trade secret. We would also want to ensure that only baseball-relevant data was released, avoiding any sort of personal details about players’ health or personal life.

This is a big request and it’s one that doesn’t come with any sort of grand justification. In government, we demand open records laws and declassification because the public has a right to inspect the operation of its government. We can make no such claim in baseball, but that doesn’t mean we shouldn’t ask for something that would enrich the experience of being a fan.

Statcast promises us a massive amount of new data, but there’s plenty of wonderful old data that’s either difficult to access or otherwise remains wholly out of view. Research and analysis goes beyond the construction of mathematical algorithms and fancy graphics. There’s a lot to learn from studying what is said and written about the game. Two decades ago something like this would have been logistically impossible, but we have the tools to capture, retain, and make available a lot of information that will be useful to fans in the future. We can continue to conduct studies in a piecemeal fashion by interviewing old scouts, watching hours of old games, or digging through hundreds of newspapers, or we can take steps to build a system that will give future iterations of ourselves the tools they need to explore.

Neil Weinberg is the Site Educator at FanGraphs and can be found writing enthusiastically about the Detroit Tigers at New English D. Follow and interact with him on Twitter @NeilWeinberg44.

newest oldest most voted

Sabermetricians parsing the commentary of Hawk Harrelson. That’s some ironic #@!& right there.


Chopper two hopper.

Serbian to Vietnamese to French is back
Serbian to Vietnamese to French is back

Maybe this is where I can be of service.