We finally had some time to start on our project of predicting next month’s tax lien volume. Predicting the next value in a time series is often difficult, but we figured we would give it a shot and see what happened.
The data for this analysis was extracted from our internal database of all federal tax liens filed over the past 20 years. We cleaned up the data, ran some SQL queries, and generated our data set of total federal liens filed monthly between January 1990 and May 2014. This gave us 293 months of data to feed into our models. We also added some attributes we thought might be predictive of lien volume such as working days in a month, unemployment rate, and 3 moving averages. For example, we had to use last month’s unemployment rate to predict this month’s lien volume. Why? Because we don’t know unemployment rate until it is released by the government a few days into the following month. Below is the meta data for the table used in our analysis. TotalLiens is a ‘label’ meaning that it is the value we are trying to predict.
Using RapidMiner, Excel, and a little R, we whipped up two models. We won’t bore you with all the details, but we split the data into training and testing sets. Then trained our models with a cross validation, reapplied our models to the training set, and finally applied the models to the testing set to see how they performed out of the sample. This is not perfect data mining protocol, but it gives a decent sense of how we can use different types of models on this data. Here is our process in RapidMiner to train and test the two models simultaneously.
The 2 models tested were a classic Linear Regression and a Neural Net. Below are the out of sample results for the two approaches compared to actual federal lien volume.
For both models, the out of sample prediction was off by about 1,600 liens or 14% per month. That’s not terrible; however, these results are nothing to write home about. The models may improve significantly if we retrain them every month, to predict the following month. That would be interesting to test.
Next time we will explore this problem in a slightly different way. While time series prediction can be difficult, we can always reframe this as a classification problem. In other words, instead of predicting the number of liens next month, maybe we should try to predict whether the number of liens will go UP or DOWN. That might prove more useful. If we know there will be more liens filed next month and liens are correlated to revenue then we know revenue will increase. In the real world it’s more complex, but this gives us a sense of what is possible with predictive analytics.