In a recent TDWI webinar, we explored building a successful data lake in the cloud.
But what is a data lake?
Here's a simple definition: A data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources.
Watch the video to learn more.
Data lakes are becoming increasingly important as people, especially in business and technology, want to perform broad data exploration and discovery. Bringing data together into a single place or most of it in a single place can be useful for that.
Depending on your platform, the data lake can make that much easier. It can handle many data structures, such as unstructured and multi-structured data, and it can help you get value out of your data.
The key difference between a data lake and a data warehouse is that the data lake tends to ingest data very quickly and prepare it later on the fly as people access it. With a data warehouse, on the other hand, you prepare the data very carefully upfront before you ever let it in the data warehouse.
That’s because you have different goals with these two. With a data lake, you want to get your data in there as quickly as possible so that companies with operational use cases, especially around operational reporting, analytics, and business monitoring, have the newest data so that as they’re running their processes multiple times during a single business day, they can actually see the latest things that are happening in the operations.
In addition, with the data lake you’re usually ingesting data in the original form without altering it. Why? Well, one reason for that is that in many forms, advanced analytics actually depends on detailed source data. This would be analytics based on any kind of mining, whether it’s:
So as you can see, many of these analytic forms need the detailed source data, which is very different from what reporting requires. That’s why the data lake tends to be a treasure trove of data for analytics, at least for advanced forms of analytics.
Now to be clear, Hadoop hasn’t replaced anything. It’s mixed in with relational databases and in today’s modern warehouses there’s Hadoop thrown into the mix. We’re seeing it there to help data warehouses scale better.
But there are also different ways for users to design their warehouses and many people design the warehouse primarily as a data store for different forms of reporting, whether it's traditional reports or newfangled approaches to reporting like dashboards, scorecards, and so forth. In those cases your warehouse may or may not be the best environment for the detailed source data that a lot of analytics needs. And that's why Hadoop is brought in, to deal with large volumes of detailed source. So you can see that this is just one way to use the data lake that extends the data warehouse.
Using the data lake to extend the data warehouse is something we see in omnichannel marketing, sometimes called multichannel marketing. The way to think about the data ecosystem in marketing is that every channel can be its own database, and every touchpoint can be as well. And then also a lot of marketers buy data from third parties. For example, you might want to buy data that has additional demographic and consumer preference information about your customers and prospects, and that helps you fill out that complete view of each customer, which in turn helps you create more personalized and targeted marketing campaigns.
That’s a complex data ecosystem, and it’s getting bigger in volume and greater in complexity all the time. The lake is brought in quite often to capture data that's coming in from multiple channels and touchpoints. And some of those actually are streaming data. If you're working for a company that has offered a smartphone app to its customers, you may be getting that data in real time or close to it as those customers use that app. A lot of times you don't really need full real time. It could be an hour or two old. But it allows the marketing department to do very granular monitoring of the business and create specials, incentives, discounts, and micro-campaigns.Digital Supply Chain Data Lake
The digital supply chain is an equally diverse data environment and the data lake can help with that, especially when the data lake is on Hadoop. Hadoop is largely a file-based system because it was originally designed for very large and highly numerous log files that come from web servers. In the supply chain you also get a lot of file-based data. Think about file-based and document-based data from EDI systems, XML, and of course today JSON's coming on very strong in the digital supply chain. That's very diverse information.
Plus you have internal information. If you're a manufacturer, you probably have data from the shop floor, from shipping and billing, that's highly relevant to the supply chain. The lake can help you bring that data together and manage it in a file-based kind of way.The Internet of Things Data Lake
The Internet of Things is creating new data sources almost daily in some companies. And of course, as those sources diversify they create more data. It's because there are increasingly more sensors on more machinery all the time. As an example, every rail freight or truck freight vehicle like that has a huge list of sensors so you can track that vehicle through space and time, in addition to how it’s operated. Is it operated safely? Is it operated in an optimal way relative to fuel consumption? Huge amounts of information are coming from these places, and the data lake is very popular because it provides a repository for all of that data.
Now, those are examples of fairly targeted uses of the data lake in certain departments or IT programs, but a different approach is for centralized IT to provide a single large data lake that is multitenant. It can be used by lots of different departments, business units, and technology programs.
We see this more and more. As people get used to the lake, they figure out how to optimize it for diverse uses and operations, analytics, and even compliance.
And the single data lake can also support multiple technical functions. Quite often it's certain pieces of the warehouse architecture that the lake takes over. For example, you may have put loving care and design into your warehouse architecture for, say, dimensional modeling as an example, and yet you haven't put much attention into data landing and staging.
Come on, admit it. So that's always a weak and somewhat immature area. If you bring in a data lake, possibly on Hadoop, possibly on object storage with Apache Spark, then one of the first things is it's the new landing and staging area, and then it becomes an area for collecting huge amounts of data for advanced analytics.
The data lake can be used many ways, and it also has many platforms that can be under it. Hadoop is the most common but not the only platform.
Hadoop is appealing. It has proved to have linear scalability. It's a low cost for scalability compared to, say, a relational database. But Hadoop is not just cheap storage. It's also a powerful processing platform. And if you're trying to do algorithmic analytics, Hadoop can be very useful for that.
Relational Database Management System
The relational database management system can also be a platform for the data lake, because some people have massive amounts of data that they want to put into the lake that is structured and also relational. So if your data is inherently relational, a DBMS approach for the data lake would make perfect sense. Also, if you have use cases where you want to do relational functionality, like SQL, complex table joins, that kind of thing, then the RDBMS makes perfect sense.
But the trend is toward cloud-based systems, and especially cloud-based storage. The great benefit of clouds is elastic scalability. They can marshal server resources and other resources as workloads scale up. And compared to a lot of on-premises systems, cloud can be low-cost. Part of that is because there’s no system integration.
If you want to do something on-premise, you or somebody else has to do a multi-month system integration, whereas for a lot of systems there’s a cloud provider who already has that stuff integrated. You basically just buy a license and you can start using that stuff within hours instead of months. In addition the object store approach to cloud, which we mentioned in a previous post on data lake best practices, has many benefits.
And of course, you can have a hybrid mix of platforms with a data lake. If you're familiar with what we call the logical data warehouse, you can also have a similar thing like a logical data warehouse, and this is logical data lake. This is where data is physically distributed across multiple platforms. And there are some challenges to that, like if you want to do far-reaching analytic queries, and a lot of you do, then you need special tools that are really good with federated queries or data virtualization and things of that nature to help you with that. But that technology is available at the tool level, and many people are using it.
So if you’ve been wondering what a data lake is, I hope you’ve had your answer. The data lake is your answer to organizing all of those large volumes of diverse data from diverse sources. If you have more questions, you can catch the data lake webcast we produced with TDWI. And if you’re ready to start playing around with a data lake, we can offer you a free trial right here.