In my last post, “3 tips for getting started with predictive analytics,” I shared some practical advice for new predictive analytics practitioners. Today I would like to share more thoughts from my past experiences with predictive modeling and focus on ideas that you can apply to your first, or next, predictive analytics project.
When it comes to predictive analytics, I’ve learned a lot of things the hard way—that is by trial and error. I think back to my early experimentation with predictive models and cringe at some of the rookie mistakes I made. There were also many days where I worked late trying to figure out why a piece of code wasn’t working—endlessly searching the internet for clues. Sometimes I got so frustrated, I would have to call it quits for the day and go home to relax (this realization typically dawned on me at 9 p.m., after someone shut the lights off in the office). The next day, fueled by caffeine and stubbornness, I’d be right back at my desk trying to get that model to run. With relentless online research and a refusal to give up, I was eventually able to get a model to work and produce real, actionable results. The feeling of satisfaction was incredible.
That first successful model catapulted me down a path towards a lifelong passion for learning about analytics of every kind (predictive, text, social network, outlier detection, geographical and many more). I still make plenty of mistakes and run into plenty of problems—but part of the fun of predictive modeling is learning from your mistakes and solving those problems.
3 predictive analytics tips that I learned the hard way so you don’t have to
Today’s tips will focus on practical ideas that should apply to most predictive analytics projects.
1. Build a proof of concept model
Let’s start off with proof of concept models, since most projects will require buy-in from other stakeholders. A proof of concept model is also a great way for you (and your management) to assess the viability of a particular modeling project.
The first thing you’ll want to do is test a proof of concept model on a small dataset to see if this outcome is indeed predictable. This “pilot” project will show management that your model has the potential to function on larger scale data.
Let’s walk through the steps of building a proof of concept model using a practical example. Assume you have a business problem identified: You want to know which customers in your customer database would buy your top-of-the-line product.
Step 1: Identify relevant historical data for training your model
To build a proof of concept, you will need some historical data on which to train the model. Generally, you will want to split this dataset into two parts: a training dataset and a test dataset.
Let’s look at how you could pull the needed historical data together to build a training dataset for the aforementioned purchase prediction example. One approach is to use your customer master table and include a data field that simply states “yes” if they bought the top-of-the-line product and “no” if they have not bought the top-of-the-line product. This “yes” or “no” data field is, in fact, the outcome that you would like to predict in your data.
Since there are two possible outcomes in this example, your best bet is a classification model (many types of classification models exist, see Part I of this blog series for an example of a decision tree model).
Step 2: Identify the independent variables on which to base predictions.
You then need to identify independent variables to test in order to identify potential predictors. Determining possible independent variables that can be used to make predictions may take some creative thinking. For example, you might try things like summarizing data from the purchase history table, or location information from the customer master (e.g., zip code, city, state). You could also try descriptive information, like customer industry or firm size.
Below is an example table of how a dataset for predictive modeling might be formatted. Note that this example only shows a few fields. In reality you may test many different fields as independent variables. Most predictive modeling programs require a flat table to do their magic, so ACL™ Analytics is handy for pulling multiple datasets together and preparing them for predictive modeling. You could even write an ACL script that does all the imports, joins and roll ups you need to prepare your data.
Step 3: Train the model, and test predictions against your test dataset.
Once you’ve selected some relevant independent variables, the next step is to fit/train a model based on these variables using your training dataset. And then you need to test the model’s predictions against the testing dataset of known outcomes (i.e., historical data). It’s important to assess how many predictions the model makes correctly on the second testing dataset. By comparing your model’s predictions on the test dataset vs. actual historical data outcomes of the test dataset, you get a side-by-side comparison which reveals the effectiveness of your variables at predicting actual outcomes.
A final thing to consider with proof of concept models is that sometimes they show that a particular model simply doesn’t work. You may find that the outcome you are trying to predict is not supported by data. It is also important to note that models are never going to be 100% accurate. Expect your proof of concept to make some wrong predictions. However, if the model is correct enough to provide actionable information to your organization, then you can use the results to construct a business case for using the model to predict outcomes (with x% accuracy).
2. Don’t be intimidated by jargon
A lot of fancy terminology is thrown around when people talk about data science. When starting out, this can be overwhelming and confusing. I know firsthand from using different data mining tools that each tool has its own jargon for the same things. The best thing you can do is learn some basics, so that when you encounter new jargon you will at least have an idea conceptually of what the main parts of a model are.
Dependent Variable vs. Independent Variable
When someone says “Dependent Variable” (DV), the first thing that comes to mind is a scientist in a lab somewhere doing complicated things. Instead of thinking “Dependent Variable,” think “the outcome I want to predict.” You’re sure to encounter other names for DVs, but the concept is generally the same across many tools and algorithms.
Do the same for “Independent Variable”(IV), but instead think of this as “data fields that might have some useful information for prediction.”
For example, a dependent variable might be: had a car crash (“yes” or “no”). Some independent variables for this might be: driving speed, blood alcohol content and weather conditions.
An easy way to think of dependent and independent variables is to consider them as outcomes (DV) and predictors (IV). Generalizing concepts in predictive modeling like this opened me up to a much wider world of algorithms and tools to use, because I could figure out the basics on my own and then fill in the blanks on how to use a particular model by reviewing literature on it.
In the example table above, the blue column is the dependent variable (the value you want to predict), and the white columns are the independent variables (predictors). The greyed-out column at the end is a field that is not useful as a dependent or independent model variable—and is therefore obscured. You may recognize some of the fields from the decision tree flow chart in part one of this blog post series.
This is made-up data, but if you look online you can find many free datasets to hone your predictive modeling skills on.
3. Find mentors and keep learning
My last piece of advice to offer in this post is one that I have taken to heart. In the beginning, I made it a point to acknowledge that I knew nothing about predictive modeling. I did my best to introduce myself to experts in predictive modeling, and ask them how to solve problems I was encountering. I watched webinars, online videos and read books on predictive modeling. I forced myself to go up to talk to speakers at conferences after their sessions ended and struck conversations with real data scientists (even though they were way smarter than me!).
Asking questions and forcing myself to grow eventually landed me a job in IT—where the whole point of my role was to research and apply predictive analytics. From internal IT experts, I learned how to use data manipulation tools beyond spreadsheets—because better quality data leads to better models. I’ve met some fascinating people over the years who have been great mentors and resources for learning (one of those people connected me with ACL!). From ACL experts, I learned how to write scripts that pull together numerous data sources, all while cleaning the data and preparing it for analysis. Mentors are also great resources for discussing ideas for projects, as they can provide outside perspective and fresh ideas.
In closing, I encourage you to keep seeking knowledge and growing. There is always something new to learn in the field of analytics.
In future posts, I’ll take a look at more advanced topics. Areas include data cleaning, handling unbalanced datasets, models that “cheat,” cross-validation and avoiding the dreaded overfit model. Please email me if you have topic suggestions you would like me to consider at Jeremy_Brown@acl.com.
Sign up to receive email updates from ACL