X

Learn about Oracle Machine Learning for Oracle Database and Big Data, on-premises and Oracle Cloud

  • February 14, 2020

Automated Machine Learning – will automation replace Data Scientists?

Mark Hornick
Senior Director, Data Science and Machine Learning

The short answer is no, at least not entirely. There is a lot of anticipation and hype around artificial intelligence replacing workers in all sorts of roles — taxi and truck drivers, paralegals, even doctors when it comes to diagnosis. While the promise is great, today’s reality is something else. That doesn’t mean, of course, we can’t make major strides in increasing data scientist productivity through automation.

A significant part of the machine learning process can be automated by addressing some of its more time intensive and repetitive aspects. At the same time, automation also enables non-experts to leverage machine learning, even if they do not know the finer details of the algorithms and their settings (hyperparameters). Problem definition, data preparation, and ultimate solution evaluation is still largely a human activity.

With Oracle Machine Learning for Python (OML4Py)—a new component of the Oracle Machine Learning product family coming soon—we introduce a feature called AutoML. The goal for AutoML is to increase data scientist productivity while reducing the overall compute time required to derive a high quality model. AutoML comprises three main steps: algorithm selection, feature selection, and hyperparameter tuning.

Most machine learning toolkits provide a wide range of algorithms to solve a given type of problem, for example classification or regression. However it’s not always clear which algorithm will work best on a given data set. Different algorithms “see” patterns in the data and relationships among predictors differently, and this can have a dramatically effect on model quality.

With OML4Py AutoML algorithm selection, we leverage a technique called meta-learning, where a pre-built model is used to predict which algorithm is most applicable to a given data set based on the distribution of values in the supplied data set. This facilitates the data scientist (and non-expert users) finding the best algorithm much faster than exhaustive search techniques.

Next comes automated feature selection. The features are also known as predictors – and some data sets can have a lot of them. Which features to use for model building is often dependent on the algorithm chosen. OML4Py AutoML automatically assesses each of the features and identifies those that are most predictive of the target. By reducing the number of features used to build a model, we not only can reduce model build time (and scoring time), but also increase accuracy by removing features that contribute noise.

The last step in this process is hyperparameter tuning. Hyperparameter’s are really just the parameters we provide the algorithm to direct its behavior. Each algorithm typically has its own set of hyperparameters, and some are easier to understand and tweak than others. For example, in a decision tree, we might specify the maximum depth of the tree that we want the algorithm to build. In a neural network, it may be the number of hidden layers and neurons per layer. Once we have selected the algorithm and the data, we can specify the set of hyperparameters. While default hyperparameters we be good, adjusting (tuning) these parameters often produces a better model. With automated hyperparameter tuning, OML4Py AutoML also uses a form of metal learning with a gradient descent approach to significantly improve model accuracy while avoiding the manual or exhaustive search techniques that one might otherwise resort to. Reducing the compute time not only improves user productivity, but reduces compute costs as well, especially as more users are adopting pay-as-you-go cloud resources.

Automating these machine learning steps—algorithm selection, feature selection, and hyperparameter tuning—can reduce data scientist effort and yield useful models sooner. This also enables understanding sooner whether the data are sufficient to address the business problem.

AutoML is just one of the OML4Py features, which also include scalable in-database model building, scalable data exploration and preparation operations on database tables from Pandas DataFrame proxy objects, and embedded Python execution enabling Python script deployment from SQL. OML4Py will be available first on Oracle Autonomous Database, followed by on-premises availability for 18c, 19c, and 20c. See also the article on the Oracle Database 20c preview release.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.