Unfortunately, not all machine learning algorithm implementations are the same, which can have significant impact on data science project success. Too often, a data science project that shows promise in the lab meets scalability, performance, and deployment issues when moving to production.
At one level, we can view an algorithm as a set of instructions that performs a particular computation. Ideally, these instructions are unambiguous, but even when clear their implementation can take on many forms – from research prototype to enterprise-ready software. From a role perspective, a scientist has the insight to design the algorithm, but often an engineer needs to implement it to meet certain production specifications.
Other things being equal, most algorithm implementations will work effectively on small or even moderate size data sets. However, when placed into production and scaling up to enterprise workloads, many algorithm implementations experience issues in performance and scalability. These often result from single-threaded algorithm implementations and the requirement for all the data to reside in memory for processing. While many open source packages offer an extensive and highly valuable set of machine learning algorithms and techniques, these are often single threaded and expect data to fully fit in memory.
This is where the benefit of applying state-of-the-art engineering techniques pays off. Many successful and useful algorithms can be redesigned to take advantage of parallelism (multi-threading) and distributed execution (across multiple nodes). This enables overcoming performance issues as model building and scoring can take advantage of multiple CPUs and compute nodes on large-scale hardware, e.g., Oracle Exadata, and cloud-based solutions, e.g., Oracle Autonomous Database.
To scale to larger data volumes, especially those that do not fit in memory, machine learning algorithms need to be redesigned to improve memory utilization. This may occur through, e.g., working on smaller batches of data incrementally and having efficient internal data representations to minimize memory consumption, especially for sparse data.
A kinder, gentler interface
We can go one step further and reflect on the requirements of the algorithm inputs. These come in two flavors: data and algorithm settings, which include hyperparameters (see What's the difference between a parameter and a hyperparameter?). Regarding data, many algorithms have explicit requirements on data format or representation. For example, neural networks normally require all data to be numeric and normalized. While data scientists may be familiar with the details for each algorithm, less expert users are often stymied by individual algorithm peculiarities. In terms of hyperparameters, some algorithms provide few, if any, “knobs” that can be adjusted to affect model quality or performance, while others provide a wide range of such knobs that may not be well-understood by typical users. The combinatorial space of possible settings, which can include feature selection, can make the machine learning task tedious or mundane,
To address this, algorithm implementations can be augmented by support to perform automatic data preparation such that data scientists do not need to perform perfunctory transformations on every data set (unless they want to), and have those transformations automatically applied when scoring data. Since each algorithm may have a specific data input representation, the set of required transformation can become part of the model building process. Any statistics can be maintained with the model and used during scoring. Typical transformation include binning, normalization, and outlier treatment. See Understanding Automatic Data Preparation for more details. Further, a degree of intelligence can be built into algorithm to automatically and efficiently tune the hyperparameters so the data scientist can minimize time spent on building and comparing many models that span a wide range of possible hyperparameters.
Oracle Machine Learning
As such, not all algorithm implementations are created equal. Oracle Machine Learning, through the Oracle Advanced Analytics option to Oracle Database, addresses the need of enterprises for scalable and performant algorithms, while also providing automatic data preparation and support for hyperparameter tuning. Oracle software engineers apply decades of experience with Oracle Database parallelism and software optimization to achieve machine learning algorithms that execute in the Oracle Database kernel. Since enterprise data often reside in an Oracle database, there is no need to move data to external servers to perform machine learning. This eliminates data access latency, duplication, and the corresponding security, backup, and recovery issues that ensue. The Oracle Machine Learning algorithms are available directly from a SQL API (OML4SQL), R API (OML4R), notebook interface, and the Oracle Data Miner user interface. See Oracle Machine Learning for more information.