The latest cloud infrastructure announcements, technical solutions, and enterprise cloud insights.

Accelerate machine learning in Oracle Cloud Infrastructure on NVIDIA GPUs with RAPIDS

JR Gauthier
Sr Principal Product Data Scientist
This is a syndicated post, view the original post here

Earlier this month, Oracle Cloud Infrastructure (OCI) Data Science released conda environments for notebook sessions. One of the environments available for (NVIDIA) GPU virtual machines (VMs) is the RAPIDS (version 0.16) environment. In this post, I give an overview of NVIDIA RAPIDS and why it’s awesome!

RAPIDS is a suite of open source machine learning libraries that lets engineers run end-to-end machine learning and analytics pipelines entirely on GPUs. I like to think of RAPIDS as the pandas+sklearn for GPUs, a flexible collection of libraries that covers a wide range of algorithms typically used in industry. As you might expect, you have substantial speedup gains when training models on GPUs compared to CPUs. I show a few examples in this post.

The RAPIDS user interface is also similar to what you experience in the PyData stack of libraries, including pandas, scikit-learn, and Dask. If you're familiar with those libraries, the RAPIDS learning curve is minimal.

Getting started

To get started with OCI Data Science and our notebook session environment, visit our Getting Started with Data Science tutorial page. This page guides you through the process of configuring your OCI tenancy for Data Science and launching a notebook session. You can also watch the video, Oracle Cloud Infrastructure Data Science Tenancy Setup, on our YouTube playlist.

Create a notebook session or activate a deactivated notebook session on a GPU shape. Both VM.GPU2.1 and VM.GPU3 shapes work. Install the NVIDIA RAPIDS 0.16 conda environment from the Environment Explorer tool. RAPIDS works out of the box on GPUs.

If you want to monitor usage of the GPUs while you run RAPIDS commands, I recommend using the handy gpustat command from a JupyterLab terminal window. The following command refreshes statistics every three seconds.

gpustat -i 3 

After running the command, I see the following result in my terminal window:

Overview of the NVIDIA RAPIDS Conda Environment for GPUs

One of the Data Science conda environments that we offer in notebook sessions is the NVIDIA RAPIDS version 0.16. It’s available on VM.GPU2.1 (NVIDIA P100) and VM.GPU3.X (NVIDIA V100 Tensor Core GPU) shapes. For more information about the new conda environments feature, watch our new video on Data Science conda environments, available in our YouTube playlist.

In the remainder of this post, I walk you through each of the RAPIDS library to help showcase its functionality and performance improvement.

A screenshot of the version details of NVIDIA RAPIDS 0.16.

RAPIDS dataframe ETL library (cuDF)

RAPIDS CUDA data frame (cuDF) is probably the first library you need when using RAPIDS. cuDF is similar to pandas in that it handles data frames. It’s built on top of Apache Arrow, which supports columnar data formats. You can do most of the standard operations you can do with pandas or Dask with cuDF. cuDF supports the standard NumPy data types (dtypes).

Let’s download a csv file from our public Object Storage bucket. The dataset trains classifiers on an imbalanced dataset. The dataset has 294K rows and 36 columns. The target is called “anomalous.”

import pandas as pd 
import cudf as cdf 
from urllib.request import urlretrieve

urlretrieve("https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/hosted-ds-datasets/o/synthetic%2Foracle_fraud_dataset1.csv", "./fraud.csv")

In the following code cell, I read the local fraud.csv file with pandas using read_csv(). I do the same operation in a separate cell with cudf read_csv() and compare the wall time for both executions.


pdf = pd.read_csv("fraud.csv")

with cuDF:


df = cdf.read_csv("fraud.csv")

Same dataset, same location. You can see a significant speedup with cuDF. I get about twice the speedup. Although dataset is small, I’ve run similar tests on a much larger dataset (11M rows with 29 columns), and I got a speedup of about 20 times. I ran my tests on a VM.GPU3.2 shape. cudf can also ingest data in various forms, including Avro, CSV, Feather, HDF, ORC, Parquet, and JSON.

Data ingestion is fast, but operations on the data are much faster than the equivalent done with pandas. Next, let’s try a simple sort command:


With cuDF:


I get speedups of about 5–6 times. Now, try a custom transformation of your choice. cuDF supports various transformations on data frames including grouping, joining, filtering, and custom transformations applied to rows and columns. To learn more about the supported transformations, consult the cuDF documentation or refer to the cuDF GitHub repo. If a transformation isn’t supported with cuDF, you can convert to a pandas data frame, apply the transformations, convert back to a cuDF data frame, and then resume your workload.

Another interesting feature of cuDF is that it can be used with Dask through the dask_cudf library for scaling workloads to multiple GPUs on the same compute node or to multiple nodes with multiple GPUs. If your data doesn’t fit in the memory of a single GPU, use dask_cudf.

RAPIDS machine learning library (cuML)

RAPIDS cuML is the machine learning library of RAPIDS. It closely follows the scikit-learn API and provides implementations for the following algorithms and more:

  • DBSCAN and K-means for clustering analysis

  • PCA, tSVD, UMAP, and TSNE for dimensionality reduction

  • Linear regression with lasso, ridge, and elasticnet regularization for linear modeling of both regression and classification use cases

  • Random forest for both regression and classification use cases

  • K-Nearest neighbors for both classification and regression use cases

  • SVC and epsilon-SVR

  • Holt-winters exponential smoothing and ARIMA modeling for time series

cuML is well suited to do bread-and-butter machine learning work and for training models on tabular dataset. As I mentioned with cuDF, you can train most of the algorithms listed at scale using more than one GPU using Dask. You can benefit from this distributed approach by launching notebook sessions on VM.GPU3.2 or VM.GPU3.4 shapes.

Let’s look at an example. The following code snippet compares the training of a random forest algorithm using scikit-learn and cuML. It’s similar to what the NIVIDA team posted in their introductory notebook.

First, we need modules. Although the cuML random forest algorithm is different from the sklearn one, I tried to use a similar set of hyperparameters for both algorithms to allow for a meaningful comparison.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split as sk_train_test_split

import cuml
from cuml.preprocessing.model_selection import train_test_split as cu_train_test_split
from cuml.ensemble import RandomForestClassifier as cuRF

# random forest depth and size. For both cuML and sklearn algos
n_estimators = 25
max_depth = 10

Next, we split the dataset column wise by separating the features from the target variable. Then we do a train and test split using sklearn and cuML functions. Both have the same train and test datasets sizes.

# splitting our datasets between features and target variable: 
X, y = pdf[pdf.columns.drop('anomalous').values], pdf['anomalous']
Xcu, ycu = df[df.columns.drop('anomalous').values], df['anomalous']

# converting the dataframes into float32. cuML random forest predict 
# supports only float32.  
X = X.astype(np.float32)
Xcu = Xcu.astype(np.float32)

# Train/test split for each dataset. By default, sklearn train_test_split function 
# generates a train dataset with size corresponding to 75% of the original dataset.  
X_train, X_test, y_train, y_test = sk_train_test_split(X, y)
Xcu_train, Xcu_test, ycu_train, ycu_test = cu_train_test_split(Xcu, ycu, train_size=0.75)

We’re now ready to instantiate estimator objects for both libraries, using similar hyperparameters:

# sklearn random forest estimator: 
sk_est = RandomForestClassifier(n_jobs=-1, max_depth=max_depth, n_estimators=n_estimators)

# cuML random forest estimator: 
cu_est = cuRF( max_depth = max_depth, n_estimators = n_estimators, random_state  = 0)

We can fit each estimator separately.


sk_rfc = sk_est.fit(X,y)

For cuML:


cu_rfc = cu_est.fit(Xcu, ycu)

On a separate machine with a NVIDIA P100 GPU card, I noticed an approximate 50-time speedup with cuML. Fitting the random forest estimator with sklearn took about five seconds, while it took about 100 ms with cuML -- a significant speedup. I can also explore the hyperparameter space of my random forest more efficiently with cuML than I can with sklearn.

Now, we can compare the performance of each estimator on a test dataset. Because I didn’t apply any up or down sampling correction to the imbalanced dataset, take the results with a grain of salt.

sk_predictions_test = sk_rfc.predict(X_test)
cu_predictions_test = cu_rfc.predict(Xcu_test)

cu_score = cuml.metrics.accuracy_score( ycu_test, cu_predictions_test)
sk_score = accuracy_score(y_test, sk_predictions_test)

print( " cuml accuracy: ", cu_score )
print( " sklearn accuracy : ", sk_score )
 cuml accuracy:  0.9993894100189209
 sklearn accuracy :  0.9995793872622181

Both algorithms perform similarly, but cuML’s random forest is 50 times faster to train. I invite you to generate datasets of different sizes and look at the impact of dataset size on cuML’s performance.

You have a lot of resources on cuML to get you started with the library, including this great post on speeding up random forests by up to 45x using cuML.

Other RAPIDS libraries of note


cuGraph is a collection of graph algorithms and utility functions developed to be run on GPUs. Probably the closest open source library equivalent running on CPUs is NetworkX. You can do a lot of analysis with cuGraph, including finding the shortest path between two nodes, running the PageRank algorithm, and measuring the similarity between the neighborhoods of two nodes—all on a single GPU. But cuGraph also offers distributed graph analytics (using Dask!) for VMs with more than one GPU, most notable the PageRank algorithm.

To get started with graph analytics on GPUs, see the cuGraph blogs and presentations.

cuSignal and cuSpatial

cuSignal and cuSpatial are two other libraries offered as part of RAPIDS and are installed in the RAPIDS 0.16 conda environment. cuSignal focuses on signal processing. It’s a great tool for filtering, sampling, applying convolution and deconvolution of signals, or doing spectrum analysis. On the other hand, cuSpatial is great for spatial and trajectory data. It can compute various distances, such as Haversine and Hausdorff, and speed of trajectories, among other things.


I hope that I gave you a good overview of the RAPIDS 0.16 conda environment available in Oracle Cloud Infrastructure Data Science notebook sessions. Go ahead and try it!

I also invite you to participate in our upcoming Oracle Developer Live: AI and ML for Your Enterprise event on January 26, 28, and February 2, 2021. I’m giving a workshop entitled “Hands-On Lab: GPU Accelerated Data Science with Nvidia RAPIDS” on January 26, during which you can use RAPIDS to train ML models on GPUs in OCI Data Science. Register today! 

RAPIDS is an open source project and is open to contributions. Join the developer community.

Keep in touch with Oracle Cloud Infrastructure Data Science!

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha