Predicting Secondary Market Prices for Playoff Tickets, Part 2

December 30, 2015

This is a follow-up to my previous post, “Predicting Secondary Market Prices for ALDS/NLDS Tickets”. Now with a complete set of price data from 2011 to 2015, I’ve amended and refined my previous model for ALDS/NLDS ticket prices. I’ve also been able to build additional models to predict prices for ALCS/NLCS and World Series ticket prices in the future.

Before I go further, I’d like to thank Chris from TiqIQ. Chris was nice enough to give me TiqIQ’s complete set of price data from 2011 to 2015 for each year’s playoff teams. Needless to say, without his help, this study could not be completed.

The new set of data is superior to the previous data I collected from TiqIQ’s blog for the following reasons:

It takes into account all the transaction values, instead of only the transactions at the time the TiqIQ blog posts were written; and
It only includes playoff games that were actually played, instead of all possible playoff games (which include prices for games that may not be played); and
We have values for each individual game, instead of only an average value for the whole regular season and the whole ALDS/NLDS.

As before, the statistic that is predicted is the average price of the tickets for each playoff series. Because the final game of each series (Game 5 in the ALDS/NLDS and Game 7 in the ALCS/NLCS and World Series) is guaranteed to be an elimination game for both teams, it commands a premium compared to the other games, so I excluded that data in calculating the average value.

Recall that in the previous post, we used the average price of all regular season games as a predictor variable. Now that we have the regular season data for all games, we are no longer restricted to using the average across all games. After running some calculations, I found that the average price of regular season games in the later part of the season has a higher correlation with playoff ticket prices. Intuitively, this makes sense: towards the end of the season, we have a good idea of who the playoff contenders are. Fans of those contending teams will be excited about the prospect of their team getting to the playoffs, and will accordingly pay more for tickets than earlier in the season. This inflated price is more indicative of the even higher cost of playoff tickets. Thus, I decided to use the average price of the last 20 home regular season games as the predicting variable.

Another useful piece of information is the regular season price of games between the teams involved in the playoffs. In predicting the price of the 2015 NLDS between the Cardinals and Cubs, it seems reasonable to expect that the regular season Cardinals-Cubs price would be correlated to the playoff price. This factor helps account for inflated prices for traditional rivals (such as Cardinals vs. Cubs) and/or for teams that have large fan bases and usually command high ticket prices (such as the Yankees and Red Sox).

Looking back at the code I wrote for my previous article, I realized that I made my model unnecessarily complicated, and that I could have made a simpler model with fewer predictor variables that offered as good a level of performance. I kept this in mind when designing my models this time around.

Also, I used the Leave-one-out cross-validation (LOOCV) technique to validate my models. This is the same technique as in the previous article, although I did not explicitly write about it. For each data point that I predicted, I would create a separate model for each data point trained using all other data points, and then use that model to predict the excluded data point. For example, when predicting the price for the 2015 Royals World Series tickets, I would train a model with all the WS price data except the 2015 Royals, and then use that model to predict the 2015 Royals WS price. Also, typically we measure the performance of a model using mean-squared error (MSE), but I quoted the absolute percentage error in these articles because it is a more intuitive concept that is easier to grasp.

For predicting ALDS/NLDS ticket prices, my model used these factors:

Average price in the last 20 regular season home games; and
Average price in the regular season home games involving the LDS opponent; and
Years since the team last made it to the playoffs; and
Years since the team last won a World Series championship (or years the team has been in existence in its current city, if it has never won a championship).

This model has an R-squared coefficient of .605, and it predicts prices with an average error of 37.6%. Recall that the model from previous article had an R-squared coefficient of 0.723 and an average error of 30.9%.

We may wonder why this new model with more data performs worse, and what went wrong. The short answer is that nothing went wrong. Using more data does not guarantee that you get a better model, and in the large scheme of things, 40 data points is still a relatively small set of data. We are pretty confident that the new data is less noisy than the old data, for the reasons previously mentioned. The fact that the old data fit a model better was likely due more to luck than anything else.

The model for predicting ALCS/NLCS prices used these factors:

Average price in the last 20 regular season home games; and
Average price of the home games in the ALDS/NLDS; and
Years since the team last made it to the playoffs; and
Years since the team last won a World Series championship (or years the team has been in existence in its current city, if it has never won a championship).

Note that, this time around, we are able to take advantage of the actual LDS prices as a predictive factor. Also, unlike in the previous case, the regular-season price of games involving the LCS opponent did not help in the prediction, so that information was not used.

This model’s R-squared coefficient was 0.879, and the average error was 19.3%. This is quite a bit better on both accounts compared to the previous model, although it’s not exactly an apples-to-apples comparison.

The model for predicting World Series prices used these factors:

Average price in the last 20 regular season home games; and
Average price of the home games in the ALDS/NLDS; and
Average price of the home games in the ALCS/NLCS; and
Years since the team last made it to the playoffs; and
Years since the team last won a World Series championship (or years the team has been in existence in its current city, if it has never won a championship).

It was not possible to use regular season price of the WS opponent, because not all WS matchups had corresponding matchups in the regular season.

This model had an R-squared coefficient of 0.932, and an average error of 20.3%. These are similar results to the ALCS/NLCS model, although again, it’s not an apples-to-apples comparison.

This model seems to do a pretty good job of predicting most data points with the exception of the 2014 Royals. This is probably because the 2014 Royals had not been to the playoffs since 1985 (a 29-year gap), while all the other teams had been to the playoffs somewhat recently by comparison, and thus the “years since last playoff appearance” factor had an exaggerated effect on the 2014 Royals. Perhaps if there were additional data points with a large playoff-appearance gap, the model would be better trained in this regard.

We could leave out the “years since last playoff appearance” factor in calculating the model. Doing so improves the average error to 18.6%, but also lowers the R-squared coefficient to 0.824. Looking at the updated chart below, the 2014 Royals data point is predicted more accurately, but at the expense in the accuracy of other data points. Similarly to what we discussed in the last article, we see that it’s impossible to develop a perfect model that will predict all data points perfectly.

Finally, I should mention that all the models I created were linear models with no interaction effects. With, at most, 40 data points at our disposal, I thought that we were unlikely to do better using a more complicated model. Of course I could be wrong, so if I have time in the future I may revisit this problem again.

[Note: Post was edited to include a paragraph about cross-validation technique]

BAL	CHW	LAA
BOS	CLE	OAK
NYY	DET	SEA
TBR	KCR	TEX
TOR	MIN	HOU

ATL	CHC*	ARI
MIA	CIN	COL
WSN	MIL	LAD
NYM*	PIT	SDP*
PHI	STL	SFG