Accelerate your model build process with the Intel® Extension for Scikit-learn

June 8, 2021 | 3 minute read
Praveen Patil
Principal Product Manager - Data Science
Jize Zhang
Software Developer
Text Size 100%:

Earlier this month, Oracle Cloud Infrastructure (OCI) Data Science released support for the Intel scikit-learn extension, daal4py, to accelerate your scikit-learn applications. This extension helps to speed up your scikit-learn models with the use of Intel oneAPI Data Analytics Library (oneDAL). 

 

What is daal4py? 

Daal4py is designed to make your machine learning in Python fast and easy to use. With minmal code changes, daal4py dynamically patches scikit-learn estimators to use Intel oneDAL library as the underlying solver, making it produce the same results faster. 

 

Top features and benefits

  • Maximizing classic machine learning algorithms' performance
  • Seamless user experience: Two lines of code changes only (same code, same behavior)
  • Scikit-learn conformance with mathematical equivalence, defined by the scikit-learn Consortium, continuously vetted by public continuous integration

 

The following scikit-learn algorithms are available as part of the accelerator: 

  • Random forest regression classification
  • SVC
  • kNN regresssion, classification, and search with kd-tree and brute force
  • Logistic regression
  • Principle concept analysis
  • K-means
  • Density-based spatial clustering of applications (DBSCAN)
  • Linear and ridge regression
  • ElasticNet and LASSO
  • Dimensionality reduction (tSNE)

 

Use case

The following example shows the improvement in performance when training a k-means model using scikit-learn and daal4py accelerator. 

1. Install the latest version of Intel scikit learn extension.

             pip install scikit-learn-intelex

 

2. Load the necessary modules.

import daal4py.sklearn

import importlib

import logging

import numpy as np

import sklearn

import time

import warnings

 

from sklearn.datasets import make_blobs

from sklearn.cluster import KMeans

 

3. Prepare the dataset using the sklearn make_blobs function, which generates isotropic Gaussian blobs for clustering. The following command creates a dataset with 100K rows and 150 columns. 

rows, cols = 100000, 150

X, y = make_blobs(n_samples=rows,  n_features=cols, centers=8, random_state=42

 

4. Train the k-means models using sklearn on the dataset.

from sklearn.cluster import KMeans

estimator = KMeans(n_clusters=8)

print("Module being used: " + estimator.__module__)

 

t0 = time.perf_counter()

trained = estimator.fit(X)

fit_elapsed = str(time.perf_counter() - t0)

print("Training took seconds " + fit_elapsed + " seconds")

 

5. Train the k-means models using the daal4py accelerator. To use the accelerator, you only have one extra step. To use oneDAL as the underlying solver, use daal4py to dynamically patch the sklearn estimators. You get the same solution as before, but faster. The sklearn modules must be imported again after the patching is complete. 

from sklearnex import patch_sklearn

patch_sklearn()

 

from sklearn.cluster import KMeans

estimator = KMeans(n_clusters=8)

 

# After patching, this should indicate daal4py is being used

print("Module being used: " + estimator.__module__)

 

After patching, now train using daal4py accelerator. 

t0 = time.perf_counter()

trained = estimator.fit(X)

elapsed = str(time.perf_counter()-t0)

print("Training time in seconds " + elapsed+" seconds")

 

6. Finally, unpatch daal4py and reload to start using sklearn modules.

daal4py.sklearn.unpatch_sklearn()

sklearn = importlib.reload(sklearn)

# remember to re-import all the relevant modules

 

Conclusion

Comparing the performance, the daal4py accelerators provide almost 50% improvement in training time. Hopefully, this brief overview gives an idea of how Intel accelerators for sklearn can help improve performance; they are now available within Oracle Cloud Infrastructure Data Science environment. 

For more information, see the following resources: 

Praveen Patil

Principal Product Manager - Data Science

Currently working as Product manager associated with Data & AI group within Oracle Cloud Infrastruture. 

Prior to moving to Product role I was a practitioner in Data science space. Over the years my experience has been in applying Advanced analytics and Data science methodologies to various domains - Financial services, Teleco, Entertainment & Gaming and Cloud Business 

Jize Zhang

Software Developer

Jize is a Seattle-based software engineer at Oracle with a doctorate in applied mathematics from the University of Washington.


Previous Post

Clustering text documents using the natural language processing (NLP) conda pack in OCI Data Science

Wendy Yip | 9 min read

Next Post


OCI Data Science now offers E3 Flex Shapes with AMD’s 64-core processor

Wendy Yip | 2 min read