Executing and Scheduling Data Science Notebooks in OCI

In this post we will look at the architecture for orchestrating and scheduling OCI Data Science notebooks as part of other data pipelines in OCI Data Integration. The following diagram illustrates this architecture, it leverages the Oracle Accelerated Data Science SDK (see here) which is a python SDK for the many Data Science features, this SDK is used from a function that triggers the notebook in OCI Data Science, the REST task in OCI Data Integration executes the function and polls on the OCI Data Science job run to complete using the GetJobRun API.

…

github

OCI Object Storage

Google Cloud Storage

…

Notebooks

Microsoft Azure Blob Storage

OCI Region

Integration

VCN

Developer Services &

Storage

Buckets

Identity & Security

Subnet

Data

Integration

Let’s see how this is actually done.

Using the Accelerated Data Science SDK

The function and supporting scripts are below (git link is https://github.com/davidallan/oci_fn_execute_notebook). You will need to download these and deploy the function to OCI.

We use OCI Functions to invoke the notebook, the func.yaml file is leveraging the Python language in functions.

func.yaml

 
          schema_version: 20180708
         
          name: notebook
         
          version: 0.0.14
         
          runtime: python
         
          build_image: fnproject/python:3.9-dev
         
          run_image: fnproject/python:3.9
         
          entrypoint: /python/bin/fdk /function/func.py handler
         
          memory: 256
         
          timeout: 300

The requirements.txt file depends on functions SDK and the Data Science ADS;

requirements.txt

 
          fdk
         
          oracle-ads

The python script is below, this has the notebook parameterized along with the OCI Data Science projects and other info. There are various properties parameterized for demonstration purposes, more could be done such as shape and subnet, but for a publicly available ipynb this is all that is needed;

noteBook the Python notebook URL to use, for example you can use the sample provided here to test https://raw.githubusercontent.com/tensorflow/docs/master/site/en/tutorials/customization/basics.ipynb
jobName provide the job name for the OCI Data Science job that will be launched
logGroupId the log group Ocid to be used.
projectId the OCI Data Science project Ocid – create a project to be used
compartmentId the compartment Ocid where the job will be created
outputFolder the output folder in OCI to write the results ie. oci://bucketName@namespace/objectFolderName

The function is based on the Python example in [https://accelerated-data-science.readthedocs.io/en/latest/user_guide/jobs/run_notebook.html#tensorflow-example] (TensorFlow example) the OCI Data Science accelerated data science documentation. Its using a small shape for demonstration, with OCI you can run notebooks and data science jobs on all kinds of shapes including GPUs!

func.py

 
          import io
         
          import json
         
          import logging
         
          from fdk import response
         
          from ads.jobs import Job, DataScienceJob, NotebookRuntime
         
          import oci
         
          import ads
         
          def handler(ctx, data: io.BytesIO=None):
         
            logging.getLogger().info("Inside Python Hello World function")
         
            body = json.loads(data.getvalue())
         
            noteBook = body.get("noteBook")
         
            jobName = body.get("jobName")
         
            logGroupId = body.get("logGroupId")
         
            projectId = body.get("projectId")
         
            compartmentId = body.get("compartmentId")
         
            outputFolder = body.get("outputFolder")
         
            if (noteBook == None or logGroupId == None or projectId == None or compartmentId == None or outputFolder == None):
         
                resp_data = {"status":"400", "info":"Required parameters have not been supplied - noteBook, logGroupId, projectId, compatmentId, outputFolder need to be supplied"}
         
                return response.Response(
         
                      ctx, response_data=resp_data, headers={"Content-Type": "application/json"}
         
                )
         
            ads.set_auth(auth='resource_principal')
         
            job = (
         
              Job(name=jobName)
         
              .with_infrastructure(
         
                  DataScienceJob()
         
                  # Configure logging for getting the job run outputs.
         
                  .with_log_group_id(logGroupId)
         
                  # Log resource will be auto-generated if log ID is not specified.
         
                  #.with_log_id("<log_ocid>")
         
                  # If you are in an OCI data science notebook session,
         
                  # the following configurations are not required.
         
                  # Configurations from the notebook session will be used as defaults.
         
                  .with_compartment_id(compartmentId)
         
                  .with_project_id(projectId)
         
                  #.with_subnet_id("<subnet_ocid>")
         
                  .with_shape_name("VM.Standard.E3.Flex")
         
                  # Shape config details are applicable only for the flexible shapes.
         
                  .with_shape_config_details(memory_in_gbs=16, ocpus=1)
         
                  # Minimum/Default block storage size is 50 (GB).
         
                  .with_block_storage_size(50)
         
              )
         
              .with_runtime(
         
                  NotebookRuntime()
         
                  .with_notebook(
         
                      path=noteBook,
         
                      encoding='utf-8'
         
                  )
         
                  .with_service_conda("tensorflow28_p38_cpu_v1")
         
                  .with_environment_variable(GREETINGS="Welcome to OCI Data Science")
         
                  .with_exclude_tag(["ignore", "remove"])
         
                  .with_output(outputFolder)
         
              )
         
            )
         
            try:
         
              # Create the job on OCI Data Science
         
              job.create()
         
              # Start a job run
         
              run = job.run()
         
              # Returns the job run id (id) and the job id (jobId).
         
              returnResponse = json.loads("{\"id\":\"" + run.id + "\", \"jobId\":\"" + job.id + "\"}")
         
              return response.Response(
         
                  ctx, response_data=returnResponse, headers={"Content-Type": "application/json"}
         
              )
         
            except oci.exceptions.ServiceError as inst:
         
                return response.Response( ctx, response_data=inst, headers={"Content-Type": "application/json"})

Permissions

Example permissions to test OCI Functions and from OCI Data Integration.

Resource principal for testing from OCI Functions for example (replace with your information);

allow any-user to manage data-science-family in compartment YOURCOMPARTMENT where ALL {request.principal.type=’fnfunc’}
allow any-user to manage object-family in compartment YOURCOMPARTMENT where ALL {request.principal.type=’fnfunc’}
allow any-user to manage log-groups in compartment YOURCOMPARTMENT where ALL {request.principal.type=’fnfunc’}
allow any-user to manage log-content in compartment YOURCOMPARTMENT where ALL {request.principal.type=’fnfunc’}

Resource principal for testing from Workspaces for example (replace with your information);

allow any-user to manage data-science-family in compartment YOURCOMPARTMENT where ALL {request.principal.type=’disworkspace’,request.principal.id=’YOURWORKSPACEID’}
allow any-user to manage object-family in compartment YOURCOMPARTMENT where ALL {request.principal.type=’disworkspace’,request.principal.id=’YOURWORKSPACEID’}
allow any-user to manage log-groups in compartment YOURCOMPARTMENT where ALL {request.principal.type=’disworkspace’,request.principal.id=’YOURWORKSPACEID’}
allow any-user to manage log-content in compartment YOURCOMPARTMENT where ALL {request.principal.type=’disworkspace’,request.principal.id=’YOURWORKSPACEID’}

Function Deployment

Follow the regular function deployment pattern. I will not go through this here, there are tutorials on Functions here that are useful to go through;

[https://docs.oracle.com/en-us/iaas/Content/Functions/Tasks/functionsquickstartguidestop.htm] (OCI Functions Quickstart)

There is also a very convenient way to create functions from the OCI Console’s Cloud Editor, the function can be created directly from the git URL https://github.com/davidallan/oci_fn_execute_notebook and deployed.

Sample Execution

You should test the function from the command line first to ensure it is defined and created ok.

echo

'{"jobName":"My Job", "logGroupId":"ocid1.loggroup.oc1.iad....", "compartmentId":"ocid1.compartment.oc1.....", "projectId":"ocid1.datascienceproject.oc1.iad.....",  "noteBook":"https://raw.githubusercontent.com/tensorflow/docs/master/site/en/tutorials/customization/basics.ipynb", "outputFolder":"oci://bucket@namespace/helloworld"}'

| fn invoke yourapp notebook

Orchestrating in OCI Data Integration

See the post here for creating REST tasks from a sample collection, the REST task calls the OCI function and then polls on the Data Science GetJobRun API [https://blogs.oracle.com/dataintegration/post/oci-rest-task-collection-for-oci-data-integration] (Invoking Data Science via REST Tasks)

The execution notebook can be orchestrated and scheduled from within OCI Data Integration. Use the Rest Task to execute the notebook. Here is a snippet of the Rest Task;

Rest Task

You can get this task using the postman collection from here. Below you can see the task being executed and the notebook specified along with other arguments;

Run a task

You can use this task in a data pipeline and run multiple notebooks in parallel and add in additional tasks before and after;

Pipeline invoking multiple notebooks.

You can schedule this task to run on a recurring basis or execute this via any of the supported SDKs;

Schedule a notebook execution

That’s a high level summary of the capabilities, see the documentation links in the conclusion for more detailed information. As you can see, we can leverage OCI functions to trigger the notebook execution and monitor from within OCI Data Integration.

Want to Know More?

For more information, review the Oracle Cloud Infrastructure Data Integration documentation, associated tutorials, and the Oracle Cloud Infrastructure Data Integration blogs.

Organizations are embarking on their next generation analytics journey with data lakes, autonomous databases, and advanced analytics with artificial intelligence and machine learning in the cloud. For this journey to succeed, they need to quickly and easily ingest, prepare, transform, and load their data into Oracle Cloud Infrastructure and schedule and orchestrate many otger types. oftasks including Data Science jobs. Oracle Cloud Infrastructure Data Integration’s Journey is just beginning! Try it out today!

Executing and Scheduling Data Science Notebooks in OCI

Using the Accelerated Data Science SDK

Permissions

Function Deployment

Sample Execution

Orchestrating in OCI Data Integration

Want to Know More?

David Allan

Architect

New User Assistance Enhancements in Oracle GoldenGate for Big Data 21.9.0.0.0

Executing and Scheduling Data Science Notebooks in OCI via Data Science

Executing and Scheduling Data Science Notebooks in OCI

Using the Accelerated Data Science SDK

Permissions

Function Deployment

Sample Execution

Orchestrating in OCI Data Integration

Want to Know More?

Authors

David Allan

Architect

New User Assistance Enhancements in Oracle GoldenGate for Big Data 21.9.0.0.0

Executing and Scheduling Data Science Notebooks in OCI via Data Science