This week’s guest blog post is contributed by Jonathan Seidner, Senior Director, Engineering, Oracle Data Cloud.
In part one of this post, we defined “truth set” in the context of evaluating device maps, and discussed the importance of avoiding duplicated pairs and how a small truth set can be very misleading.
In this second half, we take a look at the problems that can be caused by biased and “dirty” truth sets.
A truth set can be biased in two ways: (1) The truth set is not representative of the general population. (2) The truth set being used for evaluation is not completely independent of the truth set used to generate the device map being evaluated. Let’s examine each of these situations.
If the truth set does not represent the general population, one might very well conclude that the performance is better (or worse) than it really is.
In other words, the performance of the device map with regards to the general population might be very different from the performance measured using the truth set.
As seen in the following chart, for example, the overall device map does a good job of identifying correct device matches (above the dashed line) and non-matches (below the dashed line), but because the truth set is not a good representation of the general population, the device map’s performance within this limited range appears quite poor.
Similarly, it is essential to verify that the truth set used for an evaluation has not been used by the device map provider to train its models.
While challenging in certain situations, this is crucial to attain unbiased measurements. Otherwise, some degree of overfitting will occur, rendering the measurements worthless.
No truth set is 100% correct. When one uses a truth set for evaluation, it is essential to properly clean the dataset as much as possible.
Some examples of “dirty” truth sets are those which contain users with an inordinate number of devices or those which contain shared devices, i.e., in which a specific device is associated with more than one user.
Another example of a logical problem one can encounter in a truth set is when the truth is compiled from multiple sources that do not use consistent/universal identifiers to identify the individual with whom multiple devices are associated.
In such a case, if source A indicates that the user ID for device X is 0120 and source B indicates that the user ID for device Y is 9887, it does not necessarily mean that devices X and Y do not belong to the same person.
In fact, it could very well be the case that they do belong to the same person, but that each data source associated the device with different identifiers. This occurs because the user ID “space” is different in the two truth set sources.
These and other anomalies need to be resolved for the truth set to be considered “clean” and safely used for evaluating a device map.
When it comes to evaluating probabilistic device maps using a truth set, there are a number of potential pitfalls that can render the results misleading. It is important to be aware of all the relevant technical issues and to make sure that each one is addressed before reaching conclusions about the performance of a particular device map.
If you are considering evaluating a third-party device map, we would be happy to help you ensure that your measurements are accurate and unbiased. Our expertise in this field is unmatched; we have helped many companies correctly perform this kind of analysis.
Try the Oracle Crosswise complete device map for one month, updated weekly, in your own systems. Sign up to get a free month of Oracle Crosswise.
Stay up to date with all the latest in data-driven news by following @OracleDataCloud on Twitter and Facebook! Need data-related answers for your next marketing campaign or client partner? Contact The Data Hotline today. (What's The Data Hotline?)
Image: ESB Basic/Shutterstock