Data lakes have been a key aspect of the big data landscape for years now. They provide somewhere to capture and manage all kinds of new data that potentially enable new and exciting use cases. You can read more about what a data lake is or listen to a webcast about building a data lake in the cloud.
But maybe the key word in that first paragraph is “potentially”. Because to realize that value you need to understand the new data you have, explore it and query it interactively so you can form and test hypotheses.
Interactive data lake query at scale is not easy. In this article, we’re going to take a look at some of the problems you need to overcome to make full productive use of all your data. That’s why Oracle acquired SparklineData to help address interactive data lake query at scale. More on that at the end of this article.
Hadoop has been the default platform for data lakes for a while but it was originally designed for batch rather than interactive work. The development of Apache SparkTM offered a new approach to interactive queries because Spark’s modern distributed compute platform and is one or two orders of magnitude faster than Hadoop with MapReduce. Replace HDFS with Oracle’s object storage (Amazon calls it S3 while Microsoft refers to Blob Storage) and you’ve got the foundation for a modern data lake that can potentially deliver interactive query at scale.
OK, I said “potentially” again. Because even though you’ve now got a modern data lake, there are some other issues that make interactive query of multi-dimensional data at scale very hard:
Let’s look at each one of these in turn.
Interactive queries need fast response times. Users need “think speed analysis” as they navigate worksheets and dashboards. However, performance gets worse when many users try to access datasets in the data lake at the same time. Further, joins between fact and dimension tables can cause additional performance bottlenecks. Many tools have resorted to building an in-memory layer but this approach alone is insufficient. Which leads to the second problem.
Another way to address performance is to extract data from the lake and pre-aggregate it. OLAP cubes, extracts and materialized pre-aggregated tables have been used for a while to facilitate the analysis of multi-dimensional data. But there’s a tradeoff here. This kind of pre-aggregation supports dashboards or reporting, but it is not what you want for more ad-hoc querying. Key information behind the higher-level summaries is not available. It’s like zooming into a digital photograph and getting a pixelated view that obscures the details. What you want is access to all the original data so you can zoom in and look around at whatever you need. Take a look at this more detailed explanation about pre-aggregating data for deep analysis.
Data lakes can grow quite large. And sooner or later you’re going to need to do analysis on terabytes, rather than gigabytes of data at a time. Scaling out to this kind of magnitude is the kind of stress test that plenty of tools fail at because they don’t have the distributed compute engine architecture that a framework like Spark brings natively to operate at this scale.
Scaling out successfully is part of the problem. But you also need to scale back down again. In other words, you need an elastic environment, because your workload is going to vary over time in response to anything from the sudden availability of a new data set to the need to analyze a recently-completed campaign or the requirement to support a monthly dashboard update. Elasticity is partly a function of having a modern data lake where compute and storage can scale independently. But elasticity also requires that tools using the data lake have the kind of distributed architecture needed to address scale out.
Finally, getting the most out of your data is not a job for one person or even one role. You need input from data scientists as well as business analysts, and they will each bring their requirements for different tools. You want all the tools to be able to operate on the same data and not have to do unique preparations for each different tool.
Oracle acquired SparklineData last week, and we’re excited because Sparkline SNAP has some innovative solutions to these problems:
We’re looking forward to integrating Sparkline SNAP into Oracle’s own data lake and analytics solutions and making it available to our customers as soon as possible.
So when would you want to use this technology? There are lots of use cases, but here are three to think about:
1. Data from machines and click streams falls into event/time-series data that can quickly grow in size and complexity. Providing ad-hoc interactive query performance on multi-terabyte data to BI tools connecting live to such data is impossible with current data lake infrastructures. SparklineSNAP is designed to operate and analyze such large data sets in-place on the data lake without the need to move and summarize it for performance.
2. Perhaps all the data you want to work with isn’t currently in a data lake at all. If you have ERP data in multiple different applications and data stores, doing an integrated analysis is a nigh-on impossible task. But if you move it all into object storage and make it accessible to Sparkline SNAP, you can do ad hoc queries as you need, whether the original data came from a single source or from 60 different ones.
3. Finally, maybe you’re already struggling with all the extracts and pre-aggregation needed to support your current in-memory BI tool. With Sparkline SNAP you can dispense with all that and work on live data at any level of granularity. So not only can you save the time and effort of preparing the data, you can do a better analysis anyway. There’s more information in this article on pre-aggregating data for deep analysis.
If you’d like to get started with a data lake, then check out this guided trial. In just a few hours you’ll have a functioning data lake, populated with data and incorporating visualizations and machine learning.