We are starting to see a significant change in the way we analyze data as a result of the growth of interest in big data and the newer concept of Internet of Things. Ever since databases were first created everyone has been obsessed, quite rightly so, with ensuring queries returned the correct answer - i.e. precise, accurate answers. This key requirement is derived from the need to run operational, transactional applications. If we check our bank balance online we want the figure we see to be accurate right down to the last cent and for a good reason. Yet increasingly both as part of our online as well as offline experiences we deal with numbers that are not 100% accurate and somehow we manage to make good use of these approximate answers. Here are a couple of examples of where we already using approximations: route planning on our smartphones and crowd counting information in newspapers:
If I need to get to San Francisco airport from our offices in Redwood Shores to catch a flight then I could use any one of a large number of route planners. My personal favorite at the moment is Waze purely and simply because it is relatively new and I like the icons it uses. Punch in my starting point and destination and Waze will offer me three different routes and three different journey times. I can see other users along my route and they will be used to update the details about my route: level of traffic flows and time taken between specific points.
What this means is that the time shown for each route is an estimate based on the information available at that time. Additional information along each route is provided using colour codes as follows:
This information feeds into the calculation for Estimated Time of Arrival (ETA). Now the critical point here is that the ETA is, as the first letter suggests, an “estimate”. What is interesting is that Waze does not tell me is how accurate or how confident it is about each estimate that it has calculated. The question is do I care? If I select the first route and take 101 Northbound does it really matter if it takes me 14mins or 18mins? I can guarantee you that I build in some slippage time because I frequently travel this route and I sort of know what the traffic flow is like during the day - I can see 101 from the seventh floor of 400 Building so a quick visual check is relatively easy.
What would I do if Waze returned a time of 6 hours for the 101 Northbound route? First I would look for alternatives and if they were also showing excessively long journey times then I might do some web searches to see if there was incident that was affecting the traffic. I might decide to check times for the Caltrain and decide to go by train to the airport.
At the end of the day, Waze provides me with a good approximation of how long it will take me to get to SFO from Redwood Shores - it is most definitely not giving a down-to-the-second precise figure for travel time. However, even with an approximation it means I can plan out the rest of my day up to the point when I need to leave for the airport. If when, I am about to leave, the travel times are looking a little too long then I will look for alternative ways of getting to my destination.
This approach of using approximations applies to a whole series of different activities within our daily lives. The Caltrain station shows the expected time of the next train and keeps on updating it right up until the train actually arrives in the station. When news feeds report the number of people at large scale public events, even using video/CCTV surveillance, they always provide rounded numbers: “over 5,000 people attended today’s rally in…..”. They don’t report that 5,732 people attended an event. The surveillance images and videos will record pickup everyone at the event. The issue is the time it takes to identify and mark each individual person. In the days before big data processing the public authorities would simple use a square that equated to a particular number of people and see how many complete squares could be applied to a specific picture. There is an interesting article on “crowd counting” here: http://www.popularmechanics.com/science/a7121/the-curious-science-of-counting-a-crowd/
Okay, but what has all this got to do with analytics?
Approximations are actually everywhere yet we rarely use them in business. Analytics is seen as dealing purely with precise, accurate (100% accurate) figures. In most cases you definitely do not have an option. The Securities and Exchange Commission does not allow companies to post approximations of their accounts even if you could provide “confidence levels” and “error rates” for the data feeding into key reports such as P&L, cash flow and balance sheet. So are they are real-world business scenarios where approximate answers are good enough? In my opinion there are quite a few use cases where approximations can speed up the decision making process:
1) Discovery analytics: data analysts often slice and dice their dataset in their quest for interesting trends, correlations or outliers. If your application falls into this type of explorative analytics, getting an approximate answer within a second is much better compared to waiting twenty minutes for an exact answer. In fact, research on human-computer interaction has shown that, to keep business users engaged and productive, the response times for queries must be below 10 seconds. In particular, if the user has to wait for the answer to their query for more than a couple of seconds then their level of analytical thinking can be seriously impaired.
2) Market testing: most common use case for market testing is around serving ads on websites. This is where two variants of a specific ad (each with a group of slightly different attributes such as animations or colour schemes) are served up to visitors during a session. The objective is to measure which version generates a higher conversion rate (i.e. more click-throughs). The analytics requires counting the number of clicks per ad with respect to the number of times each ad was displayed. Using an approximation of the number of click-throughs is perfectly acceptable. This is similar to the crowd-counting problem where it is not really necessary to report exactly how many people joined a rally or turned up to an event.
3) Root cause analysis: contrary to perceived wisdom, this can in fact be accomplished using approximations. Typically RCA follows a workflow model where results from one query trigger another query, which in turn triggers another related query. Approximations are used to speed up that the decision as to whether or not to continue with a specific line of analysis. Of course you need to incorporate the likelihood of edge cases within your thinking process because there is the danger that the edge values will get lost within the general hashing process. This comes back to the understanding your data in much the same way you can evaluate alternative routes offered up by Waze.
There are many other examples of using approximations as part of analytical process. What is important to realize is that approximations can be very useful within the analytical process. However, many people will raise issues with using approximate results, such as…
I have heard this mentioned quite a few times, even by Oracle Ace Directors, following the release of approx_count_distinct in Database 12c (184.108.40.206). The implication being that an approximate answer is factually incorrect. My assumption is that most people (including Ace Directors) assume that the database is not processing the complete data set and this seems to be a very common misconception. At a recent conference someone asked if you needed to make sure that optimizer statistics were to be up to date for at table before using approx_count_distinct. For the record: the answer is NO!
Approximate processing does not use sampling. When computing an approximation of the number of distinct values within a data set the database does process every value for the specified column. The mathematical approach that we have used for approx_count_distinct is called HyperLogLog (HLL) and I will let you Google this term if you want more information but here is a quick summary:
The Hyperloglog algorithm uses a randomization technique. The randomization is achieved by applying a hash function to every value of the column. The algorithm observes the maximal number of trailing zeros among all the hash values. It's easy to see that in random data, a sequence of n zero bits will occur once in every 2^n elements on average. Therefore if the maximal number of trailing zeros we have seen is n, we estimate the NDV to be 2^n.
This is not a great estimator though. At best it can give us a power of two estimate of the NDV. To reduce the large variance, stochastic averaging is used. After we apply the hash function to each value, we use part of the output to split values into one of many buckets. Supposing we want 1024 buckets, we can take the lower 10 bits of the hash value as a bucket number, and use the remainder of the hash value to count trailing 0s.
A natural question that arises is how accurate is an approximate answer?
At this point we need to introduce some additional terminology that you need to get comfortable using: confidence interval and error rate. It is these terms that I expect are creating the impression in many peoples minds that an approximate answer is an incorrect answer. In reality what we are saying is that the answer has a range of values. There is a specific formula for computing the error rate when using HLL which is dependent on the number of values being counted and the number of hash buckets used. In very broad general terms during our test the level of accuracy of the estimated number of distinct values vs. the actual number of distinct values has been in excess of 97% with 95% confidence. Therefore, we are 95% confident that the approximate result is within +3% and -3% of the answer.
If we counted the number of distinct sessions within a weblog for a given period of time and approx_count_distinct returned a value of 36,383 then we can conclude that we are 95% confident that the actual value would be between 35,291 and 37,474. Business users who have adopted six-sigma data-driven approach and methodology are used to these two concepts. There is an excellent description of the two important concepts on the iSixSIgma site: Margin of Error and Confidence Levels Made Simple.
Currently for approx_count_distinct, we do not allow you to specify an error bound operator and we do not expose the error estimate. The reason for this is that the approx_count_distinct algorithm we are using to compute values. The function uses the hyperloglog methodology which gives a probabilistic error. This means that there is no guarantee that the error is under a certain number. The best we can say is with x% probability (i.e., confidence level - є) the estimate is within [actual – є, actual + є].
In reality we all use approximations in our everyday lives: from satnavs to surveys in newspapers and magazines. It’s a relatively small step to accept the same approach in relation to a range of business problems. In the title for this post I posed the question: Is an approximate answer just plain wrong?
The answer is obviously NO! Approximate answers are just different. They definitely have a valuable place within a data warehouse and big data projects and as a part of a wider business intelligence context. They are a faster more efficient way to analyze big data. It makes sense to embrace approximate analytics within your projects. Get using approx_count_distinct, you know it makes sense!