Three tips for getting started with predictive analytics

Jeremy Brown, MBA, VCA, ACDA-I, ACDA-A

Manager, Controls Advisory, Grant Thornton

Predictive analytics can be an intimidating subject for a new user to learn, but with some time, dedication and research, predictive modeling can be very useful across a broad spectrum of disciplines and industries.

I taught myself predictive modeling over several years, and the best way for me to learn was by doing. I got started learning predictive analytics by applying them to actual business use cases. There were a lot of bumps along my road and I made plenty of mistakes—but I learned from my mistakes and would like to share as much as I can to help you avoid them.

Why predictive analytics?

As a governance, risk and compliance (GRC) or audit professional, you may ask yourself: “Why bother with predictive analytics?” To answer that question for you, I’d like to illustrate with a practical example.

Assume for a moment that you have 1,000 transactions you need to review for potential fraud. You know you only have a business day or so to review them, and you also know you have experienced fraud in the past for this particular type of transaction. The time limit you have to review the transactions would allow you to review maybe 100 of the 1,000 transactions.

Normally you would probably approach this by taking a random sample from the population. You might also perform a few analytic tests to pick out the riskier 100 transactions to try to get better results. This is where predictive modeling comes into play.

If you have past examples of fraud to look at, there is a good chance you could fit a predictive model to the data. Your past threshold, outlier and percentage-based tests are a good starting point for variables to try in the model. You might also consider other data for the transactions.

Assume that you then fit a model to the identified past fraud alongside some known “non-fraud” transactions. You could then use the model to score your 1,000 transactions of new data. Instead of selecting a random, or single test based sample, you can select the records which the model predicts as fraud risks. This may greatly improve your chances of catching fraud, as the sample is completely data-driven—and risk-based. Your model would pick what it thinks are the “riskiest” transactions based on past data, rather than relying on random chance and single analytic tests.

Don’t think that your previously built analytic tests are no longer valuable. Speaking from experience, the best results can be obtained by combining those simple analytic tests with an advanced predictive model. The model and the tests will both miss some fraud, but if you combine the outputs of the two you can learn even more about your data.

Today I would like to focus on practical advice that is useful across many different modeling programs. Here are three tips for new predictive modeling practitioners:

1. Understand the value of explainability

When you start your first predictive model project, you may be tempted to try the fanciest, newest machine learning algorithm you can find; try to resist this temptation. You will need to get buy-in from management and the end users who will ultimately be using the model predictions in their day-to-day work. I have found the single most important thing to getting users to buy into a predictive modeling project is explainability. Share how the model works in everyday terms and choose a simpler model to work with, so you can get acquainted with whichever program in which you build the model much faster.

A good algorithm to start with is the decision tree algorithm, because it produces a visualization that shows exactly how it makes its predictions. Decision trees are also very friendly to new users and can generally handle many different types of data. Most predictive modeling utilities have a decision tree model option. The model works a lot like a flowchart in practice, by reading data fields in a dataset and making decisions based on them progressively. The top of the flowchart can be thought of as the “trunk” of the decision tree. You start at the “trunk” of the tree and, based on data values, you follow the tree down its branches to reach a conclusion/prediction. Take a look at the figure below to see a simple decision tree and how it makes predictions based on data. Many decision tree implementations will look similar to this graphic in form.

Example churn/renewal decision tree - Internet service company customer

Figure 1: Internet service company customer example churn/renewal decision tree

Example decision tree algorithm

In this sample decision tree, a made-up internet service provider has created a model to predict if their current customers are going to renew or cancel their service (also known as churn). The top box of the flowchart is the starting point for making a prediction for a customer and the white-filled boxes contain the decision criteria for each branch. Let’s go down the left side of the decision tree to illustrate how it works.

First (starting in the “Tenure” box), if a customer has been a customer for less than two years, then you follow the tree to the next box. The second box (“Monthly Usage”) considers the customer’s usage of their internet service. If they use 2,000 or more megabytes of data each month, then the model predicts that customer will renew their contract. This is indicated by following the “>= 2000” branch to the blue conclusion box at the bottom containing the “Renew” prediction (this box has a green border, indicating a positive outcome). Likewise, if the customer uses less than 2,000 megabytes of data per month, then the model predicts they will not renew their service subscription (“Churn”).

2. You need past data to predict a future outcome

A common misconception about predictive models is that some assume they are magic. But in reality, there is no voodoo going on behind the scenes in a predictive model. Conceptually, a model learns from past examples of outcomes (or values) and then uses what it learned to predict against unseen data. Whether you are doing classification (e.g., will the cat jump on the counter or not?) or numerical prediction (e.g., how high will the cat jump to get a snack?), the fact remains that you need past data to train a model to make predictions.

The idea sounds simple, but getting the data you need to build a model can be the most challenging part of a project. You can also combine a predictive model with other data analysis techniques to get better results. For example, you might build a model based on past data that predicts against new data to determine whether it thinks a transaction is risky or not. You could augment this process with more traditional outlier detection style reports that look for things like “is this transaction X percent over the average for its category?”

3. Don’t give up

Predictive modeling can be frustrating, especially when syntax isn’t working or results aren’t compelling. Trust me when I say the satisfaction of solving those problems is often well worth the frustration

There are also many online resources for finding solutions to coding problems with common modeling programs like R or Python (reminder: you can now integrate your ACL scripts and datasets with R and Python directly, connecting you with a new world of possibility for statistical modelling, predictive analytics, regression models and more!).

Read part 2 of this post, where I share how to build a proof of concept model and how to decode data science jargon.

Share This