Have I Got an Algorithm for You

There is an “accuracy” epidemic afoot, a swelling chorus of vendors in ad tech, mar tech, mad tech and everything in between claiming that their algorithm is incredibly, almost unnervingly “accurate” — 90%, 95% … it’s only a matter of time before one of them claims to have built a model that warps space and time and is 110% “accurate,” thus silencing the competition for all time.

It’s a competitive play, of course. A vendor who says their tool provides 90% “accuracy” is — one presumes — superior to a tool that provides anemic 89% “accuracy” and so should win the bake off before it begins.

But as my liberal use of “quotes” here attests, I’m about to argue all is not what it seems. “Accuracy” does not mean what you think it means and in fact by itself doesn’t tell you what you actually want to know.

Let’s play a game. Imagine I’m trying to sell you a tool that claims to predict whether or not a person will click on your ad. An ad-click-predictor. I have built a model that takes in whatever information is available about the average ad clicker — browser cookies, settings, location, time of day, whatever — and zings out a prediction: WILL CLICK or WON’T CLICK.

Useful, right? If it’s a good model, it could really improve my advertising performance.

Now, imagine I tell you that my model is 99.95% accurate.

Now, imagine I am not lying. My model actually IS 99.95% accurate.

Well, my friends, I have built such a model. Yes, me. It didn’t take me long. Here is the model:

FOR EVERY PERSON, PREDICT PERSON WILL NOT CLICK ON THE AD

As you may know, click rates on ads are abysmal; here I’m assuming 0.05% (1 in 2,000) will do so. My almost 100% accurate model simply assumes NOBODY will click and will sadly be right 1,999 out of 2,000 times. Its “accuracy” is very high.

And it is a completely worthless model.

Keeping It Real

Let’s make this more real. Cross-device matching is as hot as Derek Zoolander right now. Marketers have realized their customers have 5.2 devices each, on average — laptops, phones, tablets, Rokus — and that it would be ideal to be able to map these devices somehow to the same person.

There are a lot of approaches to cross-device matching (we’ll get into it some other time) but one popular method involves a lot of fancy modeling that comes up with the “probability” of a match. Similar to my ad-click-predictor above, these models will take in whatever is known about various devices and browsers and do some magic and make a prediction: DEVICE X and BROWSER Y belong to the same PERSON Z.

These vendors all promote their “accuracy.” Let’s say it’s 70% or 75%. Now, with our newfound skepticism, we will twitch an eyebrow and know what that means. If the model is presented with DEVICE X and BROWSER Y or PHONE ZZ and TABLET JJ, when it says “THEY both belong to PERSON MM” it will be right 70% or 75% of the time.

Which tells you something about the reliability of the model itself but does not really tell you if it’s useful for you. why? Because it is missing another important metric, the one you want to know:

How good is the model at finding matches?

  • “Accuracy” tells you: If the model says “these devices match,” how often is it right?
  • What you want to know: How many of the total “device matches” in the real world does the model find?

In other words, “accuracy” doesn’t tell you how good the model is at finding matches. It merely tells you how right it is when it finds one. You know nothing about how many possible matches it overlooked. Since you care about finding all the possible matches, I’d argue this second metric is more important to you.

In fact, machine learners are way ahead of us here. They have terms for these things. What we’ve been calling “accuracy” is known as “precision.” And the second term, the one we care about, is known (somewhat oddly) as “recall.”

Their definitions are based on a 2×2 known (slightly less oddly) as the “confusion matrix,” which is described here. Basically, it describes a 2×2 where the model’s predictions (true / false — or “device match” / “device not match”) are compared to the real world truth (true / false — or “device really matches” / “device really does not match”).

The cells are labeled using terms that may remind you of medical tests:

  • True Positive: model says match, is right
  • False Positive: model say match, is wrong
  • True Negative: model says no match, is right
  • False Negative: model say no match, is wrong

Using these four options — there are no others, amigos — we’re at the heart of data science. This simple matrix is used in various forms all the time to determine how good a model is. Getting back to our “accuracy” dilemma, the important distinctions are:

  • “Precision” = True Positives / (True Positives + False Positives)
  • “Recall” = True Positives / (True Positives + False Negatives)

Seems like a subtle distinction, but it’s critical. And in fact, there is a tradeoff between these measures. Models that are very often “right” tend to be more conservative and so miss a lot of matches (falsely calling a match “no match” — a False Negative) because it is afraid of getting it wrong.

And vice versa: models that catch a lot of the real matches (low in the False Negatives measure) tend to be more profligate and find a lot of matches where they don’t exist (high in False Positives). So in practice, precision and recall are in tension (one’s up, the other’s down).

To get back to our device match example, if a vendor tells you its “accuracy” — or “precision” — is 75%, you should ask what its “recall” is. They will hem and haw and tell you they’ll get back to you. Make sure they do. That’s what you really want to know.

(Thanks to Ari Buchalter, MediaMath’s President of Tech., for giving me the idea for this post.)

3 Comments
  1. February 17, 2016 at 9:45 am
    Thierry Grenot says:

    Hi Martin,
    this is absolultely right, thanks for the post. ML is a place of precision where each word counts, even if it makes marketers’ life a bit more complicated (which open spaces for ‘good marketers’ of course). Best regards, Thierry.

  2. February 17, 2016 at 4:42 pm
    Ketharaman Swaminathan (GTM360 Marketing Solutions) says:

    Brilliant post. But I suspect it will require math / engineering background to understand the vital difference accuracy and recall. In B2C industries that generate large enough data that can stand this analysis, most marketers I come across have liberal arts or commerce background. I suspect most of them will consider the difference between accuracy and recall – important as I understand it – as hair splitting. Unfortunately, their ad agencies and media buyers also have similar backgrounds and the vested interest in passing along whatever garbage they do understand. It’s only when numbers go the CFO and CEO that sparks fly and some corrective action happens.

  3. March 1, 2016 at 12:59 pm
    Viraj Patil says:

    Its the great and intelligent post. It really took me some time to understand the logic. Thanks for sharing.

Comments are closed.