Learn about data lakes, machine learning & more innovations

What Is A Data Lab?

Peter Jeffcock
Big Data Product Marketing

The data lab is a more recent term in the big data and data science world. But it’s an important one, because it can be a fast route to uncovering value in new big data as well as business data in an existing data warehouse. Let’s take a look at why you should consider a data lab, and some of the key requirements for success.

There are lots of ways that big data differs from (or maybe “expands upon” is better) the data that sits in a data warehouse and is used to run your business. But perhaps the key difference is that you just don’t know what questions it is capable of answering. Think about monthly sales figures. You can query those to find out who sold what, who bought what and so on. Put more simply, you know what questions that data is capable of answering, and you can ask those questions using typical visualization or reporting tools

But with new data sources things are different. You have location data, web log files, data from sensors, weather data, traffic flow and more. There’s value hidden in all that data, but you’re not sure what it is. That’s where the data lab comes in.

What Is A Data Lab?

The data lab is a separate environment built to allow your analysts and data scientists to figure out the value hidden in your data. The data lab helps you find the right questions to ask and, of course, put those answers to work for your business.

Try building a fully functioning data lake - free

But why a separate environment for the data lab? It’s all about resources. Consider the following scenario.

It’s late at night on the last day of the quarter. In one part of the building, finance is busy closing the books, initiating the scripts and applications that will generate the reports for executives the next morning. It’s a critical time. Elsewhere in the building, somebody has lost track of the date as they’ve been working on a particularly vexing problem for days. But perhaps the end is in sight, because a particularly resource-intensive machine learning algorithm has been showing some promise and it’s time to try it out on the whole data set.

If there’s one thing you need in a production environment, it’s predictability. You want workloads to run and finish on time. But when you’re experimenting and trying to figure things out, predictability is not on your list. In that example above, somebody could unintentionally do significant damage to the business with their experimentation. That’s just one reason why you need to move experimentation away from your production environment.

Who Is Involved with A Data Lab?

I’ll identify four key roles that you need to consider.

Data Scientist
  • The data scientist has the key role, with responsibility to create and train models that can be used to make predictions or identify data for further investigation
Data Engineer
  • The data engineer needs to bring in, transform, and format data so that it’s usable for machine learning. They also have a role in ensuring the accuracy and relevancy of the data (basically, can you rely upon it?) and may also have responsibility for ensuring regulatory or other forms of compliance.
Business Analyst
  • If the data scientist understands the data and the algorithms, very often it’s the business analyst who understands the business and its customers. The analyst helps guide which problems get tackled, as well as interprets actual results and their importance to the business.

A successful data lab project will have these three roles, and others, working together as a team. Demand for those functions will vary over time and with different projects. And very often you’ll find people who can combine two or very occasionally three different functions. For example, surveys have shown that many data scientists spend as much as 80% of their time doing data engineering work (look up the term “data janitor” which on occasion is used pejoratively), readying data for use.

I left off one role, that of developer. Somebody who perhaps is more involved with putting the results of the lab to work that in the core work of the lab itself.

Monetizing the Data Lab

The lab is not the end result. Rather, it’s a way to generate new insights that can be put to productive use. It’s important to figure out upfront how you’re going to turn insight into value. And if you’re starting a data lab project for the first time, you want that value to be visible quickly to maintain or gain organizational support for the work. In broad terms, here are three ways to go about monetizing your data lab.

Build Actionable Reports
  • Sometimes what you find is best communicated by some kind of report. Anything from a simple email to a full written document. For example, a simple write-up with the details would be enough for the investigation team to look into what looks like a fraudulent billing practice.
Modify Existing Applications and Processes
  • Perhaps what you find enables you to modify something you are already doing. If, for example, you found evidence of more widespread fraud this could be addressed by modifying an existing applications process, or flagging suspicious applications for further review. Or imagine that you found a pattern that pointed towards a higher likelihood of a sale, which would enable you to modify the recommendation process in a web application to point customers to things that were more likely to appeal to them.
Create New Custom Applications
  • Finally, you might spot something in the data that is quite new. Perhaps you can now predict high value customers before their spending ramps up. Maybe you’d like to create a new service or application that will engage them in a different way.

The Cloud Is the Best Place For Data Lab

You can build a data lab anywhere, but the cloud best enables you to meet some of the unique issues you uncover in a data lab environment.

Experiments Don’t Always Produce Results
  • By their very nature, you can’t guarantee results from experiments before you have done them. This risk can cause businesses to be reluctant to incur the time and cost of establishing an on-premises lab. Building a lab in the cloud is quicker, enabling that first win in less time. And perceived risk is less due to a smaller upfront investment.
Workloads Can Vary Significantly Over Time
  • The nature of experiments is that you can’t predict what’s going to happen next. You have some ideas, but ultimately what you do tomorrow is at least partly driven by what you find today. This translates into computing demands that vary over time, something easily managed in the cloud, but hard to do on premises.
Different Kinds of Workloads Can Perform Optimally in Different Environments
  • Workloads don’t just vary by time. Some may be computationally intensive, others might require massive storage, while many machine learning algorithms can benefit from GPUs. It can be hard to plan for all of this on premises, but in the cloud you can spin things up as you need them.
Experimental Work Often Reaches an End Point
  • While some lab environments can keep running indefinitely, many experiments come to an end. If you’re truly finished with your lab, it’s easy to de-commission. If you’re done with the computationally intensive work but want to keep the data, then cloud-based object storage can keep your data in a low-cost warm archive until you need it again.

Build A Data Lake In The Cloud Today

If you're thinking about a data lab in the cloud a good first step would be to build a data lake to experiment with. Try our free guided trial in the cloud and get started with your data lab. Or if you'd like to see the benefits of a data science platform, it might be time to see what you can accomplish with the right resources, tools, and services. 


Join the discussion

Comments ( 1 )
  • Herbert Sebastian Muller Balcazar Sunday, September 22, 2019
    Very good article. Helpful. Thank you for writing it!
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.