Blog post

Cross-Device Identity: A Data Scientist Speaks

By Martin Kihn | October 13, 2016 | 5 Comments

Marketing

Trying to determine a person’s digital “identity” over time across devices is the mission of the moment. Every ad tech and mar tech platform, as well as data providers and aspirant start-ups, confuse prospects and customers alike with multi-syllabic descriptions of their deterministic and probabilistic approaches to truth.

In fact, identity matching is harder than it looks. And the payoff may not be as great as it seems. To get the perspective of an expert practitioner, we spoke to Claudia Perlich, Ph.D., Chief Scientist at ad tech provider Dstillery.

One of the most respected names in the field, Perlich has more awards, competition badges and machine learning patents than anyone else on Crain’s annual 40 Under 40 list. Before joining Dstillery (formerly Media6Degrees), she worked at IBM’s Watson Research Center.

IS CROSS-DEVICE IDENTITY MATCHING REALLY USEFUL?

I have a certain amount of skepticism about it. It is certainly the new shiny object that people feel like they have to have. We hear a lot of stories from vendors about these staged campaigns — you know, you’ll serve an ad to a person on their mobile device and another one later on their laptop and then an email three days later. It’s all in a sequence and each message builds on the other. It fits in very well with most marketers’ conceptual understanding of how they should talk to consumers. Conceptually, that kind of sequencing makes sense. But I’m not so sure that it actually works in practice.

claudia-perlich
Claudia Perlich, Chief Scientist

WHY WOULD IT NOT WORK?

I have not seen any convincing study that this kind of sequencing has any real impact. That is, compared to business as usual, just serving the ads based on targeting rules. Sequencing and coordinating is alluring, but I’m not so sure that I’ve seen it make a big difference in execution. It works in theory, but that’s not enough. I know at IBM we were hunting for the so-called “business cycle” for years and didn’t find it. Maybe the elusive “customer journey” is something like that.

WHERE DO YOU THINK IDENTITY MATCHING WORKS BEST?

There are examples. I think it can help for longer consideration purchases that are harder to predict based on just what I see on a single device. For example, say you can determine [from location data] that a person is in a car dealership. Obviously, it’s a very strong signal that they’re in the market for a car. It’s a waste of time to show them an ad for a car then, but what if you could learn from web behavior and other device activity that people who buy new BMW’s are also more likely to sign up for new gym memberships. Suddenly, this piece of location data could inform a better ad buy strategy on the browser.

CAN CROSS-DEVICE HELP YOU WITH MEASUREMENT?

Identity matching is much harder to use for measurement. The data has to be more accurate and more complete. The rule is, you can either have clean data or complete data — but not both. A lot of the geo data you get in our world just is not reliable enough. Sources go dark, technical systems go down for a while. It’s not so bad in a predictive model but for measurement, incorrect data can really influence the results.

ANOTHER CASE FOR IDENTITY IS FREQUENCY CAPPING. WHAT DO YOU THINK?

As a consumer who’s been saturated with certain ads, I can commiserate. But I don’t think it’s as big a problem as people think. There is so much waste already. The irony is, everyone is worried about viewability and underexposure. The latest IAB estimates I’ve seen are maybe 30% of ads are viewable. And they’re also worried about overexposure. Which is it? The irony is that the metrics in the system are just terrible. Video is particularly bad. And whenever the industry comes up with a straw man metric — like video completion — it immediately gets gamed. Immediately, all these bots appear to watch videos. So suddenly everyone in programmatic is optimizing toward bot traffic.

We lament the fate of publishers for a moment, forced to created pages that optimize for “viewability” that are so cluttered and ugly that readers are thrust into the arms of ad blockers. Then I get to the question everybody wants to ask a ninja like Perlich: How does identity matching work – I mean, really work. I had spent some time studying a patent recently and was told that a large number of attributes could go into a probabilistic model, things like location, content viewed, typing speed, system clock, screen resolution, etc.

It turns out that some of those attributes are useful not for cross-device but for determining that a browser seen today is the one you saw yesterday, despite cookie deletion. In other words, for single-device identity. But our topic was cross-device, so I asked her …

HOW DOES IDENTITY MATCHING WORK IN REAL LIFE?

I’m not sure what other people do. They’re sometimes reluctant to talk about their secret sauce. But I think it is primarily all based on I.P. address. That is the only common denominator that you have. There are some complications. The game ends up being that you are trying to detect which [I.P.-identified] networks are likely to be a home. You don’t want work, you don’t want a local coffee shop. You also don’t want a friend’s device who happens to be visiting you. So you look for patterns where a device is accessing a network that is rarely accessed and has a certain pattern, certain times of day, day after day. You’re looking for patterns that indicate “home.”

ARE I.P. ADDRESSES ACCURATE? DON’T THEY CHANGE?

Sometimes, they do. If somebody doesn’t want to get identified, they won’t. There are dynamic I.P.’s and so on. But there’s a lot of structure to the tuples [constant data elements]. Even if the back end is rotating, there is a lot of consistency in the front end data. So it’s useful.

IS THERE A MODELING APPROACH YOU USE?

Believe it or not, it’s kind of like the Internet. It’s a network of device/I.P., I.P./device. Look at Google, the old page rank method they used. You are estimating the weights of the connections. You send some random traffic through [the model] and look for areas of high connectivity with extreme sparseness.

WHAT IS THE ROLE OF DETERMINISTIC DATA?

Like Drawbridge [a cross-device data provider], we have some sources of deterministic matches that we use. It’s at a very small scale. Maybe one or two percent of our devices will have a deterministic match. It comes from companies who have a log-in and an app. We use this data to tune our model and tell us where our weights are on or off. This [deterministic] data doesn’t tend to come from the marketing clients. They either don’t have it or don’t know how to get it to us.

WHAT ABOUT LOCATION DATA? IS THAT USEFUL FOR DEVICE MATCHING?

It depends where you’re getting it from. Unless you are Verizon, you’re getting it from some app. If it’s your own app, that’s one thing. In the vast majority of cases, in ad tech, I’d say maybe eighty percent of the location information available is useless because it is unreliable – the device is not actually where it claims to be. There are a lot of reasons for this. You’ll see clusters of locations, which are centroids calculated based on I.P. addresses that replaced the latitude/longitude. There’s a lot of rounding errors. You’ll see grid structures put on the data so it’s an approximate location. Most bid requests on the exchanges have some kind of location data but for the most part I do not trust it.

ANY PARTING WORDS FOR THE PEOPLE?

I think it’s very important to realize that what we’re talking about here — cross-device identity — is not at all the same thing as identifying a person. All you are doing is attaching different devices to one another. You don’t have to know who that person is at all. I talk a lot to the folks at Data & Society and they’re very concerned about privacy. It’s possible to build an agnostic identity for [marketing and] advertising that is very different from a personal identity.

Please leave a comment below or contact Martin Kihn @martykihn. Claudia Perlich can be reached at claudia@dstillery.com

The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.

Comments are closed

5 Comments

  • Martin Kihn says:

    test

  • Great read, thank you Martin and Claudia.

    Any feelings on the necessity of getting 100% clean and accurate data, vs. making a decision based on 95%?

    I ask because we handle small online business attribution. Often these customers are never going to get to statistical significance, or have data that is completely clean as indicated in your article.

    However, they need to make decisions, and my personal opinion is that “really good but not perfect” data is a lot better than just flying by the seat of your pants.

    • Martin Kihn says:

      Thanks Scott – you’re absolutely right; as they say “done is better than perfect” … there is no perfect data in ad tech (or marketing tech for that matter) outside a lab. It’s the art of the possible.

  • Tim Storsved says:

    Hi Marty, and thank you Claudia,

    Thanks for the info – really interesting.
    Do you think there would be any value in combining IP with time series data, like frequency and time of day to not only identify, but perhaps format targeted advertising for a device or platform?