Have I Got an Algorithm for You
There is an “accuracy” epidemic afoot, a swelling chorus of vendors in ad tech, mar tech, mad tech and everything in between claiming that their algorithm is incredibly, almost unnervingly “accurate” — 90%, 95% … it’s only a matter of time before one of them claims to have built a model that warps space and time and is 110% “accurate,” thus silencing the competition for all time.
It’s a competitive play, of course. A vendor who says their tool provides 90% “accuracy” is — one presumes — superior to a tool that provides anemic 89% “accuracy” and so should win the bake off before it begins.
But as my liberal use of “quotes” here attests, I’m about to argue all is not what it seems. “Accuracy” does not mean what you think it means and in fact by itself doesn’t tell you what you actually want to know.
Let’s play a game. Imagine I’m trying to sell you a tool that claims to predict whether or not a person will click on your ad. An ad-click-predictor. I have built a model that takes in whatever information is available about the average ad clicker — browser cookies, settings, location, time of day, whatever — and zings out a prediction: WILL CLICK or WON’T CLICK.
Useful, right? If it’s a good model, it could really improve my advertising performance.
Now, imagine I tell you that my model is 99.95% accurate.
Now, imagine I am not lying. My model actually IS 99.95% accurate.
Well, my friends, I have built such a model. Yes, me. It didn’t take me long. Here is the model:
FOR EVERY PERSON, PREDICT PERSON WILL NOT CLICK ON THE AD
As you may know, click rates on ads are abysmal; here I’m assuming 0.05% (1 in 2,000) will do so. My almost 100% accurate model simply assumes NOBODY will click and will sadly be right 1,999 out of 2,000 times. Its “accuracy” is very high.
And it is a completely worthless model.
Keeping It Real
Let’s make this more real. Cross-device matching is as hot as Derek Zoolander right now. Marketers have realized their customers have 5.2 devices each, on average — laptops, phones, tablets, Rokus — and that it would be ideal to be able to map these devices somehow to the same person.
There are a lot of approaches to cross-device matching (we’ll get into it some other time) but one popular method involves a lot of fancy modeling that comes up with the “probability” of a match. Similar to my ad-click-predictor above, these models will take in whatever is known about various devices and browsers and do some magic and make a prediction: DEVICE X and BROWSER Y belong to the same PERSON Z.
These vendors all promote their “accuracy.” Let’s say it’s 70% or 75%. Now, with our newfound skepticism, we will twitch an eyebrow and know what that means. If the model is presented with DEVICE X and BROWSER Y or PHONE ZZ and TABLET JJ, when it says “THEY both belong to PERSON MM” it will be right 70% or 75% of the time.
Which tells you something about the reliability of the model itself but does not really tell you if it’s useful for you. why? Because it is missing another important metric, the one you want to know:
How good is the model at finding matches?
- “Accuracy” tells you: If the model says “these devices match,” how often is it right?
- What you want to know: How many of the total “device matches” in the real world does the model find?
In other words, “accuracy” doesn’t tell you how good the model is at finding matches. It merely tells you how right it is when it finds one. You know nothing about how many possible matches it overlooked. Since you care about finding all the possible matches, I’d argue this second metric is more important to you.
In fact, machine learners are way ahead of us here. They have terms for these things. What we’ve been calling “accuracy” is known as “precision.” And the second term, the one we care about, is known (somewhat oddly) as “recall.”
Their definitions are based on a 2×2 known (slightly less oddly) as the “confusion matrix,” which is described here. Basically, it describes a 2×2 where the model’s predictions (true / false — or “device match” / “device not match”) are compared to the real world truth (true / false — or “device really matches” / “device really does not match”).
The cells are labeled using terms that may remind you of medical tests:
- True Positive: model says match, is right
- False Positive: model say match, is wrong
- True Negative: model says no match, is right
- False Negative: model say no match, is wrong
Using these four options — there are no others, amigos — we’re at the heart of data science. This simple matrix is used in various forms all the time to determine how good a model is. Getting back to our “accuracy” dilemma, the important distinctions are:
- “Precision” = True Positives / (True Positives + False Positives)
- “Recall” = True Positives / (True Positives + False Negatives)
Seems like a subtle distinction, but it’s critical. And in fact, there is a tradeoff between these measures. Models that are very often “right” tend to be more conservative and so miss a lot of matches (falsely calling a match “no match” — a False Negative) because it is afraid of getting it wrong.
And vice versa: models that catch a lot of the real matches (low in the False Negatives measure) tend to be more profligate and find a lot of matches where they don’t exist (high in False Positives). So in practice, precision and recall are in tension (one’s up, the other’s down).
To get back to our device match example, if a vendor tells you its “accuracy” — or “precision” — is 75%, you should ask what its “recall” is. They will hem and haw and tell you they’ll get back to you. Make sure they do. That’s what you really want to know.
(Thanks to Ari Buchalter, MediaMath’s President of Tech., for giving me the idea for this post.)