Oracle Machine Learning supports data science and machine learning with Oracle Database, Oracle Autonomous Database, and Oracle Big Data-related products and technologies. As introduced in my previous blog, Oracle Machine Learning family of products, OML offers a wide range of features and capabilities, however in this blog, we'll focus on three key attributes: automated, scalable, and production-ready.
Data science in general, and machine learning in particular, can involve many repetitive activities, for example, data preparation, text processing, ensembles, model building and evaluation, algorithm selection, and model hyperparameter tuning.
For data preparation, OML supports automated data preparation (ADP) for steps such as binning, normalization, and outlier and missing value treatment. These can be applied automatically based on the requirements of the specific algorithm. For text processing, OML allows one or more columns in a data set to be identified as text columns. These are automatically processed to extract, for example, tokens with a TF–IDF score. The resulting sparse tokenized data is automatically integrated with other structured data in the provided training data set and provided to the algorithm.
OML supports partitioned models as a type of ensemble model. The partitioned model feature enables specifying one or more columns on which to partition data, then automatically builds one model per partition. This results in a single model composed of multiple sub-models. Users score data specifying the top level model, but the system chooses the proper sub-model based on the row of data being scored. This eliminates the need to manually build and maintain sub-models, and select individual models for scoring.
Oracle Data Miner automates several aspects of the data science / machine learning process. For example, with classification and regression, Oracle Data Miner automatically splits the data into training and test data sets, builds multiple models, and presents the evaluation of model quality. This is achieved through a drag-and-drop user interface for constructing analytical workflows. These workflows can be automatically turned into PL/SQL scripts for separate deployment in Oracle Database.
A new feature being introduced with the coming soon Oracle Machine Learning for Python (OML4Py) is called AutoML, which consists of automatic model selection, feature selection and hyperparameter tuning. AutoML employs meta-learning, or the use of machine learning to guide the machine learning process. Auto Model Selection identifies the algorithm that achieves the highest model quality and enables finding the best model faster than with exhaustive search techniques. Auto Feature Selection reduces the number of features by identifying those that are most predictive for the specified classification or regression target. This can result in not only improved performance—by eliminating noisy data—but also has the possibility to increase accuracy significantly. Auto Tune avoids manual or exhaustive search techniques when specifying the settings that control how the algorithm builds the model. This can significantly improve model accuracy while taking the grunt work out of manual exploration.
So not only does this automation increase data scientist productivity, and in the case of AutoML reduce the overall computer time, but it also enables non-expert users to leverage machine learning as they do not need to know the finer points of the machine learning process or of each algorithm.
As enterprise data volumes continue to grow, traditional data processing and machine learning algorithm implementations often struggle to keep up, whether proprietary or open source. The key to handling big data volumes involves rethinking and redesigning traditional algorithms to make them parallel and distributed. Furthermore, algorithms must make better use of available memory by not having to load all of the data prior to model building. Another major factor for scalability is avoiding moving data to analytical engines, which applies to both model building and data scoring.
The OML algorithms embedded in the kernel of Oracle Database eliminate data movement. These algorithms operate within the secure layer of Oracle Database, achieving the fastest data access to database data. Furthermore, by partitioning the data and algorithm processing for parallel execution, powerful platforms like Oracle Exadata and Oracle Autonomous Database can be fully utilized. The OML algorithms that are part of OML4Spark are also implemented in a parallel distributed manner that allows them to take full advantage of the Hadoop cluster.
For data processing, Oracle Database is unsurpassed in performance and scalability. This enables scalable data preparation, exploration, and analytics without moving data to external analytical engines. Oracle exposes these capabilities directly from SQL, but also using the transparency layers of OML4R, and the coming soon OML4Py. The transparency layer uses proxy objects that reference data available as tables and views from Oracle Database. Standard data frame functions are overloaded and the desired functionality is translated into SQL for execution in Oracle Database as a high performance compute engine. This allows R and Python users to manipulate data via standard syntax. OML4Spark also provides a transparency layer for referencing data in file systems, HDFS, Hive, Impala, Spark DataFrame, and JDBC data sources in Big Data environments.
In other situations, data scientists want to leverage third-party packages from open source ecosystems from R and Python. These packages are normally single-threaded and require all data to be in memory, therefore affecting both performance and scalability. OML4R and OML4Py provide a feature called embedded execution that allows users to store their R and Python user-defined functions in Oracle Database and then execute those functions in a data-parallel, task-parallel, or non-parallel manner. By facilitating "embarrassingly parallel" execution under the control of Oracle Database, users can, e.g., score very large data using an open source model at scale. Note that the underlying performance and scalability characteristics of the third-party package is not changed, only the way it is used, so such packages must still be used judiciously.
One of the biggest challenges for data science and machine learning projects is getting solutions into production. Even when data scientists and analysts provide a demonstrably valuable solution to a business problem, the challenges associated with integrating that solution with existing operational systems, applications, and dashboards either takes too long, or is fraught with infrastructure and plumbing quality issues. This, unfortunately, can derail even the best of initiatives.
Oracle Machine Learning approaches this need for faster and easier solution deployment in several ways. First, as a result of using in-database machine learning models, SQL users can immediately use models for scoring through SQL queries. This means that any application or dashboard that can work with SQL can immediately score data, either in batch or interactively on individual rows, With the introduction of Oracle REST Data Services, those same models can be used via REST invocations.
Second, as discussed above, embedded execution can enable some aspects of scalability, but equally or even more important is the ability to deploy R and Python user-defined functions in production easily through SQL invocation. If a data scientist produces powerful visualizations or uses algorithms available in open source packages, putting these results into production often involves ad hoc solutions that require spawning separate analytical engines, managing data movement, and addressing backup, recovery, and security issues explicitly. This can greatly increase solution deployment complexity and lead to the delay or failure of a data science / machine learning project. Using OML4R or OML4Py embedded execution allows immediate deployment of user-defined functions where the results can be returned as a database table. Structured data frame results are available directly as a table, while images from R or Python visualization capabilities can also be returned as a table with a BLOB column containing the PNG image(s). In addition, both structured and image data can be returned as XML for use by components that naturally work with XML.
These three key attributes—automated, scalable, and production-ready—enable users to increase their productivity, achieve enterprise goals faster and more easily, and with the time savings, innovate more.