Or, What Not to Do When You’re Building a Data Lake
So, you’re thinking about a data lake for your organization. You might even have the greenlight to start the planning stages.
We’re big believers in the power of the data lake. It can often be a significantly cheaper way to store your data, but that’s not the most attractive part. No, it’s the fact that it can hold your structured and unstructured data, internal and external data, and enable teams across the business to discover new insights.
But here’s the thing – the hype has run away (a little) with the data lake. A data lake is not something you can implement with a snap of your fingers. The rewards are enormous, but it still takes work and strategy, and that’s why we want to help you avoid some mistakes with these seven data lake best practices. Let's create an easier path to data lake nirvana.
We’ve gathered insights from experts Larry Fumagalli and David Bayard of Oracle’s Cloud Platform Team for best practices and what not to do.
We'll get into them in detail in a minute. But before you ever start, make sure you think about whether your data lake is going to be located in the cloud or on premise. Do you have to create your data lake on premise because of regulatory or business requirements? Or can you locate your data lake in the cloud and take advantage of the new data lake architecture, which we’ll describe in more detail below. Perhaps you can talk the exec team into trying cloud if you have your own private cloud.
There are pros and cons to each of these methods, but that’s a topic for an entirely different article. Today, we’ll focus on data lake best practices overall.
Over and over, we’ve found that customers who start with an actual business problem for their data lake are often more effective. They are more likely to have results to point to, and more likely to have information that will please the higher-ups. They often also get the data job done and do it more quickly and more easily, because they remain focused.
This may seem like a basic piece of information, but we include it here because there still exists a tendency for IT to turn their data lake into a science project; they want to play with it and experiment and build a dream data repository.
And they tend to assume that once that dream is a reality, it will solve all use cases and business teams will simply come to them with their data questions and issues. But the actual reality is that this rarely happens, and it’s better if you start with a business problem in mind, stay focused, and solve it.
Unfortunately, not having the right people for a data lake is a problem we see all too often. Hype is in the air for the data lake and you may want to implement one. But you need to have a game plan in place if you don’t already have anyone on the team with the right knowledge. Have you thought about how you’re going to acquire in-house experience if you don’t have it already?
If your team doesn’t have this, unfortunately, the road to success is going to be much longer.
So have a plan for either hiring the people you need, or giving the people you already have a comprehensive training.
This one is related to point #2, above. A misperception we sometimes see is people thinking a data lake is just a way to do a database more cheaply. So even if they don’t actually have anyone in the organization with the knowledge or the skills, they’ll still implement one.
Then they’ll try to treat it like a database – and get frustrated when it doesn’t behave as a database does. Then, 15 months later, they decide their data lake hasn’t done what they intended it to and they’re disappointed.
Let me be clear. This isn’t the fault of the data lake. It’s a case of misaligned expectations.
A data lake isn’t a magic solution. There definitely are projects that can be done more quickly and easily in a database. So take a look at the size of your staff, whether you can hire others, and whether your company really needs a data lake. Think carefully about whether you really need a data lake and be sure you know what it can and can’t do.
When customers are interested in implementing a data lake, one of the first things we often do is tell them Oracle has a free trial they can experiment with. It’s good for a month, and you can test it out without spending any of your own money.
If you’re doing Hadoop in the cloud, design around object storage and not just HDFS. Object storage in the cloud with Spark is more flexible and lower cost.
It does add slightly more complexity in the sense that many Hadoop tools in practice default to or prefer to use HDFS (block storage). So, you may find that you spend some extra effort or attention to adjusting the tool configuration to work directly against object storage, or perhaps temporarily copy subsets of object store data to block storage. As big data evolves, however, the extra effort and attention due to the historical bias of Hadoop tools for HDFS appears to be going away.
However, there are still many advantages to designing around object storage.
Object storage gives you a new way to have a common method of storing and sharing data across multiple big data clusters. It also allows you to optimize each of those clusters for what you truly want to do. For example, in one cluster you can decide you want more compute clusters than the other clusters because you’re working with larger sets of data.
Object storage is cheaper than the block storage that HDFS has to rely upon. And most advantageously, it can scale both elastically and independently so your compute and storage are no longer tied together, resulting in a cheaper cost for you. Of course, since you’re on the cloud you can spin up and spin down the clusters as you’re using them.
This means a data lake that’s easier to manage, cheaper, and has better performance. So look at your options, and don’t just default to HDFS and MapReduce.
It’s not just about data storage. It’s about data management too. You need your data to be actively and securely managed.
In the past, you could simply put your data in Oracle Database and you could count on it being very secure and safe from break-ins.
But the data lake is a new beast. Some companies have issues with break-ins because they’re misconfiguring permissions and making their data too readable. There are technologies coming out all the time to help with security and governance. Take a good look at your vendors – how are they addressing data masking and encryption?
At the bare-bones level, here’s what you’ll need to keep your data safe, and out of the hands of the wrong people:
Most data lakes are going to have a little of buy and a little of build. It’s rare that you’ll find a perfect vendor solution that will meet every one of your needs. But it’s also too expensive (not to mention complicated) to custom-build every feature you’re going to need.
We find that customers often don’t fully understand how many resources it takes to get a data lake going, because it’s all new to them. There just isn’t the experience for them to know everything they’re going to need.
Don’t let yourself get sucked into the trap of building everything yourself. Everything you buy will have a cost. But everything you build has a time cost and an efficiency cost.
For example, it comes to data integration for data lakes, Oracle has a very popular product called Golden Gate that has many happy customers. It allows you to bring data in, and bring it in faster to allow you to do work that’s more important.
Make sure you think carefully about what your team will build or buy.
Data lakes aren’t just a place for data science work, and they’re not a magic place for all of your data. You still need a full data management lifecycle. You still have to load your data into staging, perform data quality checks, clean and enrich it, steward it, and run reports on it.
Your numbers are only good if your data quality is good, and putting your data into a data lake doesn’t negate having to go through the data management lifecycle.
Our biggest tip for building an information management pipeline is: start with something you’re already familiar with. This can be a data source you already know, or something that you’re very familiar with on a smaller size.
In the data lake, build the full data management lifecycle with that data source before you start with unstructured sources, sensor data, streaming data, etc. That way, you know your foundation is really solid and if something goes wrong, you won’t be questioning the foundation itself. You’ll be more confident looking for the source of the problem elsewhere.
As we mentioned before, the rewards of building a data lake are enormous. If you still have any hesitations, we recommend trying out a data lake with our free trial or reaching out directly to us. We’ll be happy to answer any questions you might have.