Adventures in Predictive Analytics

In my last blog, Eliminating the Guessing Game of Yesterday, I shared my recent obsession for business intelligence and predictive analytics. ExtraktData’s technologist and I have been experimenting with our data a lot since then.  Our main goal is to develop a model that can predict the number of federal tax lien filings for the coming month.  That means in any given month we can make an educated guess on how many liens the IRS will file the following month.  For us and many of our clients, sales revenue is highly correlated to lien volume.  Predicting lien volume helps predict sales revenue and that’s quite valuable to our planning process. The prediction models are still in the works so I do not have any results to report, but rather some insights into the building process.

The Data Set

The data set for these models consists of federal tax liens filed against businesses between the years of 1990 to 2012.  Federal holidays and weekends were removed to give an accurate count of how many working days each month the IRS employees had to file the liens. We purposefully held back 2013 data, so we could test what the model predicted versus what actually happened in 2013.  As we develop our prediction model we will be testing numerous hypotheses.


  1. Each holiday has a greater impact than just the loss of a workday because people often take extra time off during those time periods.
  2. December and the summer months are generally vacation months.  Therefore, there may be seasonality based on vacation time.
  3. A relationship may exist between the date a business has to file their annual or quarterly taxes and the volume of liens filed.
  4. Number of workdays in a month effects lien volume. Everything at the IRS isn’t automated (the IRS is not staffed by robots) so fewer workdays equals less liens filed and vice versa.
  5. Because of the IRS Fresh Start initiative (implemented at the beginning of 2012), looking only at historical data for liens above $10K is more predictive than looking at all the data.
  6. The above hypotheses combined with other factors (discussed in previous blogs), such as economic cycles and unemployment rate, will allow us to predict future lien volume.

Our experimentations are definitely a process, but one we are very much enjoying.  What’s the point in having all this data if you don’t explore the possibilities?  So keep following our geeky adventures and let’s see what we find.  We’re not promising miracles, but we will share all the same.

  1. Dylan C Reply

    This sounds like a very interesting and exciting data project. What will you guys be using to perform the prediction analysis? R? I would love to see how this turns out.

  2. admin Reply

    Hi Dylan. We are using RapidMiner to perform the prediction analysis. Thanks we are looking forward to seeing how it turns out as well. I will be posting an update later next week.

Leave a Reply