Note: This post originally appeared on the Oracle AI & Data Science Blog.
The world of big data encompasses a lot of buzzwords, and terms like “big data,” “artificial intelligence,” and "data science" have gone from being tech-centric terms to things executives and business analysts discuss. Among the next wave of terms is “automated machine learning,” which is frequently shortened to AutoML.
AutoML automates common machine learning steps that are often repetitive and involve a good amount of trial and error on the path to producing quality machine learning models. Some people seem to think that AutoML can actually replace data scientists in an organization by turning data science into a one-touch solution. That’s not true, and it’s important to recognize AutoML as a tool that expedites the capabilities of data scientists, just like machine learning itself.
Subscribe to the Oracle Big Data Blog to get the latest big data content sent straight to your inbox!
To fully grasp the difference between the terms AutoML and machine learning, it’s important to have a clear understanding of what machine learning actually is. The purpose of machine learning is to build a representation of patterns via models/algorithms and leverage those models to make inferences on new data. Much like a child learns basic language through a combination of supervised teaching, repetition, and exposure, machine learning algorithms improve models as they get trained on greater volumes of data. Over time, the accuracy of the algorithm’s predictions improves with retraining cycles using larger amounts of data or more recent/relevant data until optimal results are achieved, allowing them to be applied to real-world problems.
Consider the full unabbreviated term behind AutoML: automated machine learning. While it doesn’t completely make machine learning a turnkey service, AutoML attempts to streamline the overall process by automating some of the manually intensive steps in training a machine learning model:
Algorithm Selection—For any given dataset, there are multiple algorithms that can be used. For example, a true/false problem might use any number of algorithms: logistic regression, support vector machine (SVM), decision tree, gradient boosted trees, and more. Determining the best algorithm for the dataset can be an intensive process with significant evaluation and tweaking. AutoML uses automation to efficiently identify the algorithms/models that work best for the dataset.
Feature Selection—Features, also known as predictors, are essential to model building, though the best ones to use usually depend on the selected algorithm. And the volume of features used affects model build and scoring times, potentially slowing down the overall process. AutoML assesses which combination of features works best through an automated evaluation process.
Model Tuning—Each algorithm has its own set of hyperparameters. The best set of hyperparameters yields the most accurate model, but requires a process to determine the most optimal combination. This iterative process requires repetition and manual evaluation, making it a lengthy but critical process. AutoML can efficiently automate this, even with a large number of hyperparameter combinations, and find the best set of hyperparameters for a given model/algorithm.
Model Evaluation—How effective is the algorithm at producing desired results? This evaluation dictates whether or not a different algorithm should be tried or if hyperparameters need to be tweaked. Based on evaluation metrics, AutoML can assess performance against these and evaluate the efficacy of the model.
What does AutoML not do? Simply put, it’s not a one-touch answer to producing the perfect machine learning model. Business problem definition, data understanding and acquisition, as well as data preparation are still required. AutoML accelerates the process of producing better models, faster, without requiring detailed understanding of each algorithm. Depending on the AutoML tool used, this may be available in a simplified drag-and-drop environment (that allows for less customization), or in pure code for data scientists who want to take advantage of AutoML but still have the ability to fully customize the script. AutoML can provide a final model, or a starting point from which a data scientist fine-tunes the model.
As for why AutoML is becoming popular now, there are a myriad of factors involved thanks to the technological evolution. For one, the advent of AutoML is emerging simply due to advancements in processing power. As compute power has become more affordable, tools like AutoML become more accessible. This is particularly true for cloud-based tools, as platforms often provide the ability to scale compute power up as needed.
Another factor is the growth of the available libraries from both open source and commercial developers. This increase in tools expands the scope of what is easily handled by AutoML, which in turn invites greater overall usage by data scientists. In addition, solution vendors are investing in AutoML because of the benefits to data scientists and their organizations.
With all that in mind, does this mean that AutoML replaces data scientists? While sometimes AutoML is promoted as such, it couldn’t be further from the truth.
AutoML is not a replacement for data scientists. It simply can’t do what a data scientist can do in terms of critical thinking, engineering predictive features, or understanding the context and limitations of their projects. What it does, however, is make everyone’s life easier. AutoML is a tool for automation steps in the machine learning model lifecycle, something that accelerates the process in different ways to create greater efficiency—regardless of whether the user is new to data science or a data scientist with years of experience.
For the experienced data scientist: Without AutoML, hours are lost doing necessary but manual tasks such as selecting features and tuning hyperparameters instead of deeper levels of analysis. AutoML seeks to remove those barriers by letting automated processes run in the background while data scientists can focus on more complex issues. Thus, by eliminating or minimizing tedious tasks, data scientists are not being replaced, but empowered to do the things that only they can do.
For newer data scientists: Getting over the hump of creating effective machine learning models can prove to be an upward battle at times. AutoML makes this much easier, expediting the path to assessing a more complete set of algorithms and hyperparameters. By making it easier to get started, newer data scientists are able to create useful projects in a more timely fashion, while also accelerating growth in fundamental skills such as algorithm selection and hyperparameter tuning.
A good analogy of how AutoML helps both new and experienced data scientists is the advent of the assembly line in manufacturing. With the assembly line, many tedious processes were automated, enabling workers to put their time and energy into bigger issues, from quality of product to improving design and manufacturing processes. AutoML gives similar power to data scientists, delivering more time to engineer predictive features, develop data acquisition strategies, improve the data transformation pipelines, and more.
As compute power becomes more scalable and accessible, the use of AutoML will only become more prevalent and vital in the world of Data Science. In future posts, we’ll dive deeper into each of these elements of AutoML to help keep you ahead of the curve. Subscribe to the Oracle Big Data Blog to catch the latest on AutoML and machine learning, all delivered straight to your inbox—and don’t forget to follow us on Twitter @OracleBigData.