Over the last 18 months or so, while Oracle Big Data Discovery was in early stages of design and development I got to ask lots of customers and prospects some fundamental questions related to why they struggle to get analytic value from Hadoop and after a while common patterns started to emerge that we ultimately used as the basis of the design for Big Data Discovery. So here’s what we learned.
1. Data in Hadoop is not typically ‘ready’ for analytics. The beauty of Hadoop is that you just put raw files into it and worry about how unpack them later on. This is what people mean when they say Hadoop is "schema on read". This is both good and bad. On the one hand it’s easy to capture data, but on the other, it requires more effort to evaluate and understand it later on. There is usually a ton of manual intervention required before the data is ready to be analyzed. Data in Hadoop is typically flowing from new and emerging sources like social media, web logs and mobile devices and more. It is unstructured and raw, not clean, nicely organized and well governed like it is in the data warehouse.
2. Existing tools BI and data discovery tools fall short. We can't blame them because they were never designed for Hadoop. How can tools that speak ‘structured’ query language (SQL) be expected to talk to unstructured data in Hadoop? For example, how do they extract value from the text in a blog post or the notes a physician makes after evaluating a patient? BI tools don't help us find interesting data sets to start working with in the first place. They don't provide profiling capabilities to help us understand the shape, quality and overall potential of data even before we start working with it. And what about when we need to change and enrich the data? We need to bring in IT resources and an ETL tools for that. Sure, BI tools are great at helping us visualize and interact with data, but only when the data is ready… and (as we outlined above) data in Hadoop isn’t usually ready.
3. Emerging tools are point solutions. As a result of the above challenges we have seen a ton of excitement and investment in new Hadoop native tooling offered by various startups, too numerous to mention here. We are tracking tools for cataloging and governing the data lake, profiling tools to help users understand new data sets in Hadoop. Data wrangling tools that enable end-users to change the data directly in Hadoop and a ton of analytic and data visualization products to help expose new insights and patterns. An exciting space for sure, but the problem is that in addition to the fact that these tools are new (and may or may not exist next month), they only cover one or two of the aspects of big data discovery lifecycle. No single product allows us to find data in Hadoop and turn it into actionable insight with any kind of agility. Organizations can't be expected to buy a whole collection of immature and non-integrated tools, then ask their analysts to learn them all and ask IT to figure out how to integrate them all together.