Big data technologies have it rough. MapReduce may have been the favorite child for a few years—but Apache Spark has been rising rapidly. This is how it is with big data. The technology changes rapidly, new projects usurp old ones—and that’s what makes it so exciting.
Today, we’re going to talk about the trends that drive the big data and cloud convergence, and what’s significant about it.
But first, we’re going to look at some of the industry trends that are making big data truly viable, important and central to many organizations.
Some history: SQL relational data has been fundamental to computing within the business context for the last 30 years. These transactional systems have powered enterprise data management and e-commerce, and really were the initial engines for the internet era.
This remains a key part of the infrastructure for data management today. But compute technology has expanded. And as more devices and more Internet services have become available, new types of data have been generated – the multi-source, multi-structured data from machine sensors, connected devices, clickstreams, system logs, etc.
You know, it’s the data we all love that doesn’t fit well into the relational paradigm and scales out in incredible ways, going from terabytes of data to petabytes. It’s the data CEOs point to and ask, “Why aren’t you doing anything with this?” while you’re thinking, “Easier said than done.”
It’s this new type of data, and these new sources of data that drove the popularity of NoSQL databases and Hadoop over the past 10 years.
To get value for the business, these traditional data sources and these new sources of data need to be brought together for more kinds of insights, so shiny technologies like machine learning and artificial intelligence can use them together.
So here’s the next generation of big data technology: it should be possible to manage both traditional and new data sets together on a single cloud platform.
This allows you to use the data storage, the object store that’s native to the cloud infrastructure, and the compute capabilities to the cloud infrastructure, out of the box.
No more setting up and managing Hadoop clusters, no more provisioning hardware.
This is big.
It’s a paradigm shift in how you think about data management because now, the cloud is the data platform. It also enables you to allow any user to work with any kind of data quickly, securely and efficiently in a way that fits your immediate business needs.
So, what do you need to make this happen?
You have all these data sets that are being generated in data sources across the business landscape, across the Internet landscape. The first thing you need to do is integrate them and bring them into your system.
The second thing you need to do, at a high level, is manage them. You need to have a place to store them.
And third, you absolutely need analytics. You need high-powered analytics that allow you to understand the data, visualize the data, make sense of the data, and then build proactive models based on machine learning that allow you to get ahead of the business requirements and interact with data sets as events are happening in real time.
Next I’m going to drill down into each of these areas to give you a view of what’s needed in each of these areas. First, I’m going to talk about big data integration.
Data integration has always been important, whether it was with traditional databases or with data warehouses. In the same way, today it’s still important with big data. But it’s more complicated than ever, with more data sources, types, problems and frameworks.
You’ve always had data integration, but now you have to make it work with big data.
You need to be able to:
One of the problems you don't want to experience as you're working with large data sets is data quality problems. When you're bringing data in, you want to have assurance that what you're working on is meaningful, so as you start to apply machine learning algorithms, as an example, you have confidence in the answers you're getting because you have confidence in the data. As a baseline requirement, you need to be able to bring the data in. You need to be able to transform it.
You also need to be able to work with streaming data sets and non-relational data sets. Then you also need to work with both of these in a way that you can guarantee the overall data quality in the system. And that’s why you need powerful data integration.
After the hard work of data integration is done, you need to be able to manage it. You need to be able to put it somewhere and keep it secure, but make it available to those authorized to use it.
The new paradigm data lake is really built on the cloud object store. You can store any kind of data in the object store. You can store it in any form you want, and you can bring whatever processing requirements and process engines you need on demand to those data sets.
This is a key evolution in the big data architecture as we know it today. I’ll explain how it’s significant.
If you’re familiar with big data platforms that have been deployed in the past three to five years, often people had to go out, provision hardware, fix capacity, deploy a Hadoop platform – and all along, they were constrained by the capabilities of the Hadoop platform vendor they were using.
But cloud infrastructure allows you to deal with your compute requirements, spin up resources and spin them down automatically.
You don’t have to handle upgrades. You don’t have to worry about capacity planning.
If your central data lake technology is based in the object store, you can push out to alternative storage systems, like relational databases or NoSQL stores as needed.
After the data’s stored and available in the data lake, you can process it with various open source technologies.
But that’s not the most exciting part.
Hadoop became popular because of its storage capability and its compute ability with MapReduce. But for Hadoop, storage and compute are inextricably tied together when it come to scaling up and down. If you need more compute capability, you have to pay for more bulk storage too, and vice versa.
Today’s modern data lake architecture, which is only possible in the cloud, has Apache Spark as its framework and object storage as its bulk storage. This is big, because they can scale elastically and most importantly, they can elastically and independently of each other. This means freedom from the necessity of scaling both whether it was truly needed or not.
As another benefit, object storage is cheaper and more flexible than HDFS which relies block storage. In fact, block storage can often be two to five times more expensive than object storage.
With object storage on the cloud, you can bring the compute to the data while you need it. And when you’re done with the workloads and the processing that’s required with that particular compute cluster, you can spin it down which helps you control the cost more.
Elasticity is a native feature of the cloud, and it shifts the way you think about provisioning and the need to plan for capacity. It removes many of the constraints and shackles that have been in place for existing big data systems today.
Essentially, a data lake built in the cloud is more cost effective, faster, and more flexible.
Having more data gives you the potential to understand your customers better and tackle problems you’re trying to solve. But you still have to discover which questions can be answered.
Existing analytics tools are enhanced to help you understand the new kinds of data sets you’re collecting. Visualization falls into that category—it enables you to explore the format of your data, transform it, tweak it, and better prepare it.
And machine learning is a buzzword right now, sure, but it’s such a big buzzword because of what it can accomplish. You can take your big data and train models based on that data, and gain better results because you have so much data to feed it.
But machine learning can also be used to improve the analytic tools themselves, so you can uncover new things about your data that you haven’t been able to uncover before, which is truly exciting. You can use machine learning to examine your data and automatically suggest useful visualizations and ways to think about and explore your data.
And, similar to recommendation engines on e-commerce sites on the internet that suggest other items you might be interested in, machine learning can enable discovery in the patterns of usage of the data itself, so you can have recommendations in real time about issues that a business user might want to know about.
For example, for a sales executive, the system may automatically send information on the probability of achieving a sales target based on a deal that just closed and intuitively sense that these are the kinds of information that he or she might be interested in.
If you’ve made it this far into the article, congratulations! But even if you don’t remember anything else, I would like you to remember three things:
If you're interested in making your cloud strategy for big data more effective, download your free Forrester white paper today, "Going Big Data? You Need a Cloud Strategy." Or, try building a data for free with an Oracle trial.