The marketing people have spoken: we love data science. Thanks to the outpouring of joy that greeted a recent squib on 14 key machine learning terms for marketers, I thought I’d share a few more cheats gleaned from my recent report called “Understand Data Science Basics for Digital Marketing” (Gartner clients enjoy here).
Today we’ll touch on the most important concept of them all, which comes in 8 syllables: generalisability.
Fitting a Model to Data
No model is perfect, but some are more useful.
A “perfect” model could be built that simply memorized all the records and labels in the data set; however, it could only be applied to a new record if that record exactly matched a previous example. Such a model is said to be too rigid to generalize well on out-of-sample data. On the other hand, a model that mimicked a coin flip would be equally right (or wrong) against any new data but is by definition no better than chance. Obviously.
The analyst must continually balance between the two extremes of memorizing the training data (known as overfitting) and being no better than chance (known as underfitting). This balance is also known as the bias-variance tradeoff, and it is a key concept in data science. Bias is systemic error, or underfitting; whereas variance describes the tendency of a model to adapt too closely to “noise” in the training data, or overfitting.
This next figure shows how model complexity – that is, increasingly memorizing the training data – causes training error to decline but generalization error to increase.
The purpose of the test phase in the train-test process is to ensure the model finds the sweet spot between overfitting and underfitting. Often, data scientists train and test a model simultaneously. A common technique called cross-validation splits a data set into an arbitrary number of groups (often ten) and trains the model on random subgroups while testing it on different random subgroups, repeating this process many times. The goal of evaluation is to minimize error and maximize generalisability, which are in tension.
There are a number of methods used to evaluate models, and the method chosen depends on the business goal. The marketer often must determine whether the problem calls for maximizing reach or minimizing inaccuracy. For example, a model predicting high-value customers likely to churn should not miss any candidates (maximum reach), but pays a low penalty for other mistakes. On the other hand, a model predicting who should receive a 50%-off coupon pays a high price for inaccuracy (50% off each) if it recommends too many coupons.
Data scientists codify these concepts into a confusion matrix. It contrasts the predictions made by a model with underlying reality, using terms familiar from medical research such as “false positive” and “true positive.” These ideas are combined into important principles:
- Precision – How often is the model right when it makes a positive prediction?
- Recall – How often does the model miss making a positive prediction?
- Area Under the Curve (AUC) – How much does the true positive rate exceed the false positive rate?
And that’s about enough for (Brexit) Friday. Go forth and model. peace, mk