Executing and Scheduling Data Science Notebooks easily in OCI

This is the third in a series of posts on executing notebooks in OCI Data Science. The previous two posts relied on OCI Functions and customized containers, this one illustrates how you can use environments provided by OCI Data Science to do the job the easy way. We will use one of the environments here TensorFlow Environment – which has all of the packages we need to convert the notebook and execute (see here for the packages in the environment). In this post we will then look at the architecture for orchestrating and scheduling OCI Data Science notebooks as part of other data pipelines in OCI Data Integration, as mentioned, this is a follow on with an alternative approach to what was discussed here. The following diagram illustrates this architecture, it leverages the Oracle Accelerated Data Science SDK (see here) which is a python SDK for the many Data Science features, this SDK is used from a python script that leverages a conda nevironment available from Data Science that triggers the notebook in OCI Data Science, the REST task in OCI Data Integration executes the data science job, the task then polls on the OCI Data Science job run to complete using the GetJobRun API.

Architecture for using Conda environments

Let’s see how this is actually done.

Using the Accelerated Data Science SDK

The supporting script is below You will need to download this.

We use python to invoke the notebook, the notebook URL is passed into the script using environment variable NOTEBOOK_URL – when the data science job is created this must be passed along with other environment variables to specify the Data Science environment.

demo.py

import os

notebook_url = os.getenv(‘NOTEBOOK_URL’)

filname = os.path.basename(notebook_url)

pyfilname = os.path.splitext(filname)[0]

os.system(“wget ” + notebook_url + ” && jupyter nbconvert –to python ” + filname + ” && python3 ” + pyfilname + “.py”)

Permissions

Example permissions to test OCI Data Science and from OCI Data Integration.

Resource principal for testing from Workspaces for example (replace with your information);

allow any-user to manage data-science-family in compartment YOURCOMPARTMENT where ALL {request.principal.type=’disworkspace’,request.principal.id=’YOURWORKSPACEID’}
allow any-user to manage object-family in compartment YOURCOMPARTMENT where ALL {request.principal.type=’disworkspace’,request.principal.id=’YOURWORKSPACEID’}
allow any-user to manage log-groups in compartment YOURCOMPARTMENT where ALL {request.principal.type=’disworkspace’,request.principal.id=’YOURWORKSPACEID’}
allow any-user to manage log-content in compartment YOURCOMPARTMENT where ALL {request.principal.type=’disworkspace’,request.principal.id=’YOURWORKSPACEID’}

Orchestrating in OCI Data Integration

Create the Data Science Job using the environment values (a full description of the CONDA_* environment variables is here);

CONDA_ENV_TYPE = service
CONDA_ENV_SLUG = tensorflow28_p38_cpu_v1
NOTEBOOK_URL = https://raw.githubusercontent.com/tensorflow/docs/master/site/en/tutorials/customization/basics.ipynb

This is done when creating the job;

Create job using environment variables

Then simply add your Python script to the job;

Add script to job

Set up the rest of the job definition as you would otherwise. When the job is created we will now use the OCI Data Science job id when we run the job within OCI Data Integration, it is a parameter in the REST task.

See the post here for creating REST tasks from a sample collection, the REST task calls the OCI Data Science API and then polls on the Data Science GetJobRun API [https://blogs.oracle.com/dataintegration/post/oci-rest-task-collection-for-oci-data-integration] (Invoking Data Science via REST Tasks);

REST task

The execution notebook can be orchestrated and scheduled from within OCI Data Integration. Use the Rest Task to execute the notebook. You can get this task using the postman collection from here.

You can use this task in a data pipeline and run multiple notebooks in parallel and add in additional tasks before and after;

Pipeline invoking multiple notebooks.

You can schedule this task to run on a recurring basis or execute this via any of the supported SDKs;

Schedule a notebook execution

That’s a high level summary of the capabilities, see the documentation links in the conclusion for more detailed information. As you can see, we can leverage OCI Data Science to trigger the notebook execution and monitor from within OCI Data Integration.

Want to Know More?

For more information, review the Oracle Cloud Infrastructure Data Integration documentation, associated tutorials, and the Oracle Cloud Infrastructure Data Integration blogs.

Organizations are embarking on their next generation analytics journey with data lakes, autonomous databases, and advanced analytics with artificial intelligence and machine learning in the cloud. For this journey to succeed, they need to quickly and easily ingest, prepare, transform, and load their data into Oracle Cloud Infrastructure and schedule and orchestrate many otger types. oftasks including Data Science jobs. Oracle Cloud Infrastructure Data Integration’s Journey is just beginning! Try it out today!

Executing and Scheduling Data Science Notebooks easily in OCI

Using the Accelerated Data Science SDK

Permissions

Want to Know More?

David Allan

Architect

OCI GoldenGate User Assistance News for April 2023

Using Oracle Cloud Infrastructure (OCI) GoldenGate with Snowflake

Executing and Scheduling Data Science Notebooks easily in OCI

Using the Accelerated Data Science SDK

Permissions

Want to Know More?

Authors

David Allan

Architect

OCI GoldenGate User Assistance News for April 2023

Using Oracle Cloud Infrastructure (OCI) GoldenGate with Snowflake