How Cross-Device Identity Matching Works (part 2)
By Martin Kihn | September 20, 2016 | 2 Comments
Cross-device identity matching is the way marketers try to map devices and browsers to the same consumer to improve personalization and measurement. in our super-popular Part 1, we got half-way across the bridge by describing one common way this is done. Our case study was a particular patent issued to the mobile demand side platform Adelphic.
The post inspired many thoughtful squibs and expansions, including some builds from our friends at Drawbridge, who are well-known for their cross-device identity matching solution. Drawbridge pointed out to me a distinction that might seem obvious but isn’t.
Namely, there are two problems the marketer must solve to perform a successful cross-device identity match:
- Identify a singular device
- Match that device to a person
Of course, you probably can’t do (2) without successfully doing (1). And either or both can be attacked using deterministic or probabilistic methods. It’s possible to use a deterministic (one-to-one) method for (1) and a probabilistic method for (2) or vice versa. Which opens up a wide gate through which parties can march their dogs and ponies.
I have heard vendors say “we use only deterministic methods,” when they were referring only to step (1). A device-matching vendor who doesn’t do any probabilistic modeling at all does not have a model, it has a lookup table — one it likely acquired from someone else who does probabilistic modeling.
There is nothing wrong with probabilities, friends; they are probably inevitable.
All of the major stand-alone third-party matching vendors — Tapad, Drawbridge, Oracle’s Crosswise, Adbrain — use both deterministic and probabilistic methods. Which combined or hybrid approach is exactly what Adelphic describes in its patent.
PROBLEM #1: WHAT DEVICE IS THIS?
Problem (1) may seem simple to solve but is not. Imagine you are a publisher or marketer and a device communicates with your site. You know this device exists — after all, it’s talking to you. But what is it? Would you recognize it if it appeared again? Does it have an identity?
Apple used to communicate a unique device ID with each server call, but it stopped doing this three years ago (citing privacy concerns). In its place, both Apple and Android created a unique, consumer-controlled ID available to apps selling ads. This is called IDFA and AdID, respectively. So apps can choose to share this ID if they want, but only if (1) the consumer downloaded their app, (2) is using it, (3) has not manually opted out of ad tracking (which they can do with both IDFA and AdID), and (4) the app really sells ads.
So it’s not available to mobile browsers, people the app doesn’t want to share with, opted-out consumers, and non-ad-sellers. In other words, it’s not a cure-all. And it is tied to a device, not a person. The IDFA is different on my iPad Mini and on my iPhone 6.
IDFA and AdId are often called “deterministic” IDs, because if you know them, you know the device. What is the long-suffering marketer to do if she doesn’t have either IDFA or AdID? Give up? Well, if the consumer is in a Chrome browser it can be cookied, of course, but what if he isn’t, or it can’t?
IDENTIFYING A DEVICE WITHOUT AN ID
Here we venture into the territory that used to be called “fingerprinting” and was associated with firms like BlueCava. As we’ve said, this is a form of probabilistic device identification, meaning it ID’s unique devices within a range of probability. Setting aside the right name, let’s describe how it works, again using Adelphic’s patent as our guide.
The patent refers to something it calls a “signature.” This is a combination of attributes that collectively may be used to identify a unique device. These attributes are pieces of information that are shared in the course of routine communications with the app publisher or mobile website owner.
We described some of these attributes last time. They include:
- “system-type” data such as OS version, local time, phone model
- “usage-type” data such as headers, user query parameters, referrer, plug-in data, location, URLs viewed
How is this “signature” created? The patent describes it as a kind of list that contains some or all of the above-mentioned system and usage attributes that have been encoded in such a way that they can quickly compared to similar signatures sitting in the master ID database. The attributes can be encoded as numbers, categories, or even distributions.
Of course, it is not simply a list of all attributes. Much of any data set is noise. It is a selection of those attributes most likely to mean something to the system (the selection process is described below).
So we can think of the “signature” as a streamlined version of the attributes that includes the good ones and not the noise. The patent puts it this way:
“The entity identity is generated by applying to the feature data [i.e. attributes] one or more rules … identifying which of the feature data to use to generate the entity identity …”
LINKING A DEVICE TO A PERSON
All this talk of “entities” brings us to a rather subtle point about cross-device matching. In the Adelphic patent, explicitly, an entity can be a device, a person or a household. So in the same way a probabilistic signature for a device can be created from a weighted subset of attributes, a “signature” can be created for a device — or for people.
Why? It’s simple, really. The attributes we get are the only ones we’ve got. We can use them in a step (1) way to identify an unknown device without a deterministic ID in a sea of devices … OR we can use them in a step (2) way to try to link devices we have already identified as belonging to the same person (or household).
All of these IDs and attribute signatures are going into the master ID database. Even if a deterministic ID is available (like AdID), the database will include a signature based on probabilistic attributes. Why? Think about the app world. What if the person of interest pings you from the same device but a different app, one that doesn’t sell ads or otherwise lacks access to the AdID?
What if they use their browser to hit up your Bernese mountain dog tea cozy shop rather than your amazing BMD app? You’re going to be glad you maintain both deterministic and probabilistic device identifiers.
Now, you have been very patient. Some of you have abandoned the ship and are well within site of the cabana. To the rest, I say: It is time to discuss how to match a device to a person.
First, the system will do (1). It has the device. Next, step (2). Who is it? The system will try to find out if there is any personal deterministic data available. Data that can be linked to a person include phone number, email, customer ID. Usually, personal deterministic data is known only if the person has a relationship with the app or site owner or has provided it in the session.
The matcher takes the device ID and the deterministic person ID and sees if there is a match within its master ID database. If so, it will look up to see what it knows — e.g., that this person has been flagged by Target as a super-shopper to get massive deals now! or whatever.
If not, the matcher will try to see if it can match the device ID to someone it has in its master ID database some other way. And you all see this one a-coming … yes, it’s …
One of the more enjoyable sections of the Adelphic patent is its almost rapturous encomium to a concept called Record Linkage. This is not a school of thought encountered often in the digital marketing literature. I mention it here because it turns out to be a rather well-developed method to do exactly what we are trying to do: take two different sets of attributes and figure out if they actually belong to the same person.
The patent points to “A Theory for Record Linkage,” a seminal paper published in 1969. It started a line of development that’s cropping up here. Record Linkage (RL) encompasses both what we’ve called deterministic matching and probabilistic matching.
RL is described like this:
“[It] takes into account a wider range of potential identifiers, computing weights for each identifier based on its estimated ability to correctly identify a match or a non-match, and using these weights to calculate the probability that two given records refer to the same entity.”
In other words, probabilistic matching as described here has two steps:
- Take all the available attributes and figure out which ones deserve more weight, depending on how well they identify people; and
- Go through the master ID database full of signatures and figure out whether the particular device matches any of them
That’s a lot of “figuring out.” We can be more explicit. Step (1) here is a classic machine learning problem and can be done either on labeled or unlabeled data (i.e., records that we have already matched to people or ones that we haven’t). The preferred method mentioned is to take the master ID database and look at devices that have already been matched using deterministic methods (e.g., by email or phone no.).
The system can then look at all the various attribute data also captured with those devices and run machine learning algorithms to estimate the weights for different attibutes. (If you’re interested, the specific algorithm mentioned is EM, or Expectation Maximization.)
The output of step (1) is a “rank score function,” or a formula that can take the atributes on the unmatched device and bump them up against the device attributes in the master ID database (that is, for already-known devices) and calculate a score. This score is a number from 0-1, with 0 meaning definitely-no-match and 1 meaning oh-yes-match-baby. A higher number means more probably a match. This is step (2).
The process is described:
“… the system computes the distance of each feature against a subset of candidate matching records in the database. A matching rule takes the distances as input and makes a decision if the features are mapped to an existing entity identity in the database.”
Some of you may be wondering what this “distance” is, exactly. It is a calculation that varies depending on the data type. Numerical data can simply be subtracted and normalized. Strings can be compared to see how many characters match (e.g., OS versions). Other types of data like location require special sub-functions to handle. Obviously, features that match perfectly have no “distance” at all.
A SIMPLE EXAMPLE
I’ll leave you today with an example. Let’s say the RL process has been run in the past and the output of the model was a scoring function. And let’s say it determined that the attributes that are the best predictors of two devices belonging to the same person are:
- Location (lat/long)
- Time of Day
- I.P. address
So a device shows up. It has an AdID but there is no match in the system. It passes its attributes and the system turns them into a “signature” and pings them up against the master ID database signatures, calculating a distance and determining a score against each. If there is one that achieves a high enough score to be considered a match, the AdID and new feature information is added to the existing match record.
And there you have it: a probabilistic match.
Then the fun begins.