X

Oracle Data Cloud Blog

100% deterministic, 80% wrong: How deterministic data lies to you

This week’s guest blog post is contributed by Audrey Thompson, Senior Director Data Science, Oracle Data Cloud.

In the digital advertising ecosystem, there is a bit of an obsession with deterministic identity data.  Advertisers keep demanding that platforms answer the question, "how much of your identity data is deterministic?" and platforms keep feeling the need to "out-deterministic" their competitors when pitching their solutions. The two assumptions made in this line of questioning are (1) deterministic data is 100% accurate and (2) more deterministic data means better identity.

But wait … is that true? Is more deterministic data inherently better? Is deterministic data undeniably correct?

Spoiler alert: The answer is no to both questions. It’s not true.

Data observed together, at the same time, without the need to impute or infer is classified as deterministic. This can happen if someone completed a form with name, address, birthdate, phone number, and email. It also can happen if someone logs into a website or app with an email address, creating connections between a cookie and email, or mobile advertising ID and email. In the first case, someone wrote the information all at once. In the second case, the information is collected in the same piece of code at the same time.

Since the data was observed together, it is assumed that deterministic data is 100% correct, constantly. But notice that I didn't say someone filled out the form with their name, address, etc., or they logged in using their email. Here are several examples of how deterministic data can lie to you:

  • I use a friend's email address to log in to a website. Now, my cookie is linked to my friend’s email.
  • I borrow a friend's subscription login for Hulu, Netflix, HBO GO, or MLB.TV. Now, my device is attached to my friend’s email.
  • I'm forced to enter an email address on a website, but I enter a fake email like no@no.com. Now, my cookie is attached to a fake email, which may be attached to millions of other cookies where other people did the same thing.
  • I ship something to my parents’ house. My name and email are now attached to my parents’ home address, even though I don't live with them.
  • Without my knowledge, a website creates a value for my email address. Turns out, they do the same thing with millions of other cookies. Now, we’re all connected to each other and connected to the wrong email.

All of these scenarios are considered deterministic, and all create identity connections that are 100% wrong and useless. (Gasp! But it’s deterministic—it must be right!)

To make matters worse, some identity providers simply join deterministic data together to produce reconciled identity. If we did that with the above data, my cookie and my mobile device are attached to me, as well as my friend, my parents, and any number of other people who creatively made up the email no@no.com. Through this method, companies claiming to be 100% deterministic end up producing new links that were never observed together, but are still shopped around as an entirely deterministic solution. And the resulting quality isn’t good: Oracle Data Cloud evaluated a deterministic graph and found that 80% of the deterministic pairs were incorrectly connected and, in one case, an email was deterministically linked to 2.3MM cookies.

An advertiser is wise to be skeptical of a graph built on 100% deterministic data without a method to evaluate the correctness of such data and remove inaccuracies. While it's convenient to use deterministic vs. probabilistic as the ultimate indicator of quality, it has no actual bearing on the rightness of the data. In the end, more deterministic data does not mean better identity. Marketers should demand the data be evaluated for correctness, no matter how it is sourced.

In a world of people-based marketing, poor identity can sabotage even the most perfectly executed campaign. Save your marketing dollars, get identity right, and recognize that using deterministic alone is not good enough.

At Oracle Data Cloud, we don’t take deterministic data at face value. We go to great lengths to identify and scrub deterministic data anomalies like the ones outlined above. Stay tuned for part 2 of this post where we discuss the concept of probabilistic data, along with how Oracle Data Cloud evaluates all identity data with data science to ensure an accurate and defensible view of identity is used throughout our products and services.

Contact The Data Hotline to reach the audiences that matter most to your business. (What's The Data Hotline?)

About Audrey Thompson

Audrey leads the Identity Data Science team at Oracle Data Cloud. Her team is responsible for construction of the Oracle Identity Graph by starting with data at scale, evaluating it for quality, grounding it in reality, and reconciling universally—all while respecting privacy. The result is a sense of a person and all their devices for use by marketers to reach the right person, with the right ad, on the right device, at the right time.

She has worked in the digital marketing world for 6 years with experience constructing data science products for audience, measurement, optimization, and identity.

Stay up to date with all the latest in data-driven news by following @OracleDataCloud on Twitter and Facebook!

Join the discussion

Comments ( 2 )
  • Mateusz Tuesday, September 4, 2018
    Hello Audrey,
    thanks for the article! The subject is really interesting, but it's not easy to derive valuable insights from it, because some important facts are lacking. Let me list them.

    1. You don't mention what deterministic graph you evaluated, where do data come from, what's the time range, how many IDs/connections it has, etc.

    2. (Related to 1.) You don't mention how you obtained the "true ground truth" information which enabled you to verify the deterministic information contained in the graph you evaluated.

    3. You mention deterministic pairs. As I understand, a user is a set of device identifiers connected to a particular e-mail address, and all these identifiers are connected to each other, i.e. they form a clique.

    It's obvious for anyone working with graphs that the number of pairs grows quadratically to the size of a user. You mentioned the graph consisted of the user with 2.3M identifiers. That's actually 2 644 998 850 000 pairs. To get the same number of connections with users of size, say, 15, you would need 25 190 465 238 such users (vs 1 user of size 2.3M). Computing things that way, it's no surprise that most of the connections are wrong, but I don't believe deterministic graph providers don't cleanse their data, as it's obvious that no web data is clean right away. None would buy a deterministic graph with a user that has 2.3M identifiers. On the other hand, it could easily transpire that if you removed i.e. 0.1% of the biggest users from your graph, the proportion would change so now 98% of the deterministic connections would be actually correct.

    Taking above into account, the claim that most of the deterministic data is wrong is pretty bold.
  • George C Monday, September 17, 2018
    Great article! You always expect some data to be fake, but 80% sounds astonishing.

    Can't wait for part 2!
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha