Oracle AI & Data Science Blog
Learn AI, ML, and data science best practices

AI Influencer Blog Series - Why Data Engineering Is Crucial to Scalable Data Science

Andreas Kretz
Data Engineer and Author of The Data Engineering Cookbook

There was a lot of talk a few years ago about how huge batches of data were going to transform business, but there wasn’t as much talk about the preparation and planning that yields business benefits. The hype was about the data itself, but delivering meaningful insights turned out to be quite another thing, much like the difference between a grocery list and a prepared meal—the distance between the two can be quite a long journey.

I’m seeing the same thing with data science. Demand for data scientists is high, but it’s important that organizations scale for data engineering as they grow their data science capabilities. In some cases, it might work to have data scientists do the work of data engineers, which involves designing the systems for piping in data and preparing it for analysis and machine learning (ML) models. However, this setup is not scalable as demand for data insights grows.

Businesses might also find it’s simply harder to find data scientists that can do both. Some data scientists have a background that helps with data engineering, but data scientists come from a mix of backgrounds (including non-technical ones), and the high demand for data scientists points to this trend continuing.

Another factor is that roles on data teams are changing. The data science role is becoming more defined around analysis and modeling, with the support of data engineering.  In fact, a new type of data engineer—a machine learning engineer—is joining data science teams to focus on machine learning automation.

The takeaway for big businesses is that an over-reliance on data scientists to “do it all” could cause project delays and failures.


The Role of Data Engineering

Here’s how a senior software engineer and data team member at Netflix describes his team’s evolution as the company grew: “Once you scale up an organization, the person who is building the algorithm is not the person who should be cleaning the data or building the tools. In a modern big data system, someone needs to understand how to lay the data out for the data scientist to take advantage of.”

Data engineers are basically computer scientists but with in-depth knowledge of the tools that are in data science platforms and can process data at large scales. They work with Mongo DB, Hadoop, Apache Spark, SQL and NoSQL databases, processing frameworks, web interfaces and APIs, and other components that are part of preparing the data for use. Their responsibilities include security, resilience, and scalability of the systems they build.

Preparing data for discovery isn’t a straightforward task. Data engineers unite different applications with source data that needs to be consolidated and prepared. They also need to choose the right system technologies to align with project goals, and combine them into an operational framework for analyzing the cleaned and prepped data. Finally, they troubleshoot and fix system problems when they occur, and they monitor the system over time, making updates and improvements based on learning and observation.

The machine learning engineer is a new role that works with the data scientist to create and automate machine learning algorithms using languages such as Python and R. In some cases, the data scientist does all of this work, but again, as organizations scale up, more specialized support for the data scientist will be required to achieve quality outcomes.


How to Become a Data Engineer

I work as a data engineer for a large, global manufacturer, but I have a background in computer science. There are some online courses for data engineering, but most paths to this job are like mine was—self-taught. I’ve written a free cookbook for people interested in this career path. If you’re just getting started, I suggest picking out a few tools that work together and creating a system for ingesting data. Learn from this, and do a project for yourself, a college class, or the business where you work.

There are online tools you can use to learn about specific applications, so take advantage of these. And then look further into the tools that you’re focused on and build up expertise in using them.

The demand for data engineering is definitely there and will continue to grow. If you want a job as a data engineer, follow the advice in my cookbook. If you’re working with data scientists or overseeing a data team, do you have the engineering support you need? You might think you do today, but evaluate the situation for scalability. As an organization expects more from its data scientists, it should support them with the engineering expertise needed to form a productive, successful data science pipeline.


To learn more about AI and data science, visit the Oracle AI page. You can also try Oracle Cloud for free!


Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.