X

Oracle AI & Data Science Blog
Learn AI, ML, and data science best practices

Execute a Python Process in the Oracle Cloud Infrastructure Data Science Notebook Session Environment

By Jean-Rene Gauthier, Sr. Principal Product Data Scientist

As you work in the notebook session environment of the Oracle Cloud Infrastructure Data Science service, you may want to launch Python processes outside of the notebook kernel. These Python jobs could be quite useful to run alongside your notebooks. Perhaps you want to train multiple models in the background or quickly iterate on a few permutations of data transformation, or you need to run jobs on a scheduler without leaving your notebook session.

In this post, I will show you an easy way to execute ad hoc Python processes without leaving the notebook session environment (learn about the latest release of this environment here). I will also show you how to monitor those Python processes and how to run them on a schedule.

Although notebooks are great for interactive workloads, you sometimes want to run some Python job/script outside of code cells. Alternatively, you may have a long running cell in which you train a model and you do not want to block the execution of other code cells. Or perhaps you want to train multiple models in parallel, but launching copies of the same notebook is cumbersome and difficult to manage.

The notebook interface was not really designed to handle long running processes. Here, I provide a few tricks to launch separate Python processes (aka "jobs") in your notebook session environment without leaving the JupyterLab interface.

Subscribe to the Oracle AI & Data Science Newsletter to get the latest AI, ML, and data science content sent straight to your inbox!

 

Creating a Python script from a notebook file

The first step is to create a Python script from a notebook file that you will later execute as a separate Python process. I explore a couple of options that are available to you in JupyterLab.

 

Using the writefile cell magic command

The first one is to use the %%writefile cell magic command. Simply write a Python script in a code cell, and pass the name of the script you want to create to %%writefile. If you are interested in learning more about the Jupyter magic commands and what they can do, I would recommend going over the IPython documentation. Magic commands (to name a few: %lsmagic, %time, %%bash, %%python) are quite useful. Try them in the notebook session environment!

First, create a notebook named my-notebook.ipynb and copy the code block below in the first cell of your notebook. Executing this cell:  
 

%%writefile my-file.py

import os
import sys
from time import sleep

def main():
    sleep(5)
    print("hello")

if __name__ == '__main__':
    main()

creates a Python script called my-file.py with the following content:


import os
import sys
from time import sleep

def main():
    sleep(5)
    print("hello")

if __name__ == '__main__':
    main()

 

Using nbconvert to convert your notebook to a python script

Another way to create a Python script from a notebook file is to convert the Jupyter notebook (.ipynb) to a Python (.py) script. The Jupyter nbconvert command comes in handy.

Remove the %%writefile cell magic from your notebook, go in the JupyterLab terminal, and execute the following command:   


(base) bash-4.2$ jupyter nbconvert --to python my-notebook.ipynb
 

You should see a new Python file my-notebook.py in your home directory. This command creates a Python script by concatenating all the code cells in a single file. It is quite useful if you have long notebooks.

Note that py is only one of the formats supported by nbconvert. You can also try html or pdf, for example.

 

Executing a Python script as a separate Python process

You can execute the script in the JupyterLab terminal window. Simply run:


(base) bash-4.2$ python my-file.py
 

or execute a bash command by writing the %%bash cell magic in the first line of the code cell of your notebook:


%%bash
python my-file.py

 

You can also use the %%python cell magic to execute the content of your script as a subprocess.

The last two options (%%bash, %%python) will block execution of additional cells in the notebook until the code execution in the notebook cell is completed.

That's one limitation we will now overcome. What we really want is to run that Python script alongside other notebooks. I recommend nohup as a good terminal command to execute the script. The nohup command prevents a signal hang up (SIGHUP) to be sent to the running process when, for example, you close the terminal window or log out of your ssh session.  A good option is to run the nohup command in the background by adding & at the end of the command. For example, go to the terminal window and execute the script you just created:


(base) bash-4.2$ nohup python my-file.py &
 

In addition, you may want to redirect I/O streams to different files. For example, you can re-direct stdout to a log file:


(base) bash-4.2$ nohup python my-file.py > my-stdout-log &
 

If you open the file my-stdout-log you will see "hello" in it. You can also re-direct stderr to a different log file:


(base) bash-4.2$ nohup python my-file.py > my-stdout-log 2> my-stderr-log &
 

or you can re-direct both stdout and stderr to the same file:


(base) bash-4.2$ nohup python my-file.py > my-stdouterr-log 2>&1 &
 

If you take a look at the file my-stdouterr-log during execution, you will notice that the output of your print() statement is only written to the file at the completion of the script execution. You can avoid output buffering by adding the -u option to the Python interpreter. This option forces stdout and stderr to be unbuffered. Removing this buffering option will allow you to monitor the status of your model training job if you print or log progress diagnostics. In summary, this is how I would execute a long running process in the notebook session:


(base) bash-4.2$ nohup python -u my-file.py > my-stdouterr-log 2>&1 &
 

Try it in your notebook session. Add a for loop before the sleep() command to see changes being appended to the log file during execution.

 

Monitoring the Python process

After you launch a process, nohup will return a process ID back to the terminal window. You can use the unix top command to monitor the process:


(base) bash-4.2$ top -p <insert-your-process-id>
 

I like to have side-by-side terminal windows in my notebook session: one window to execute processes, and the other window to monitor the running processes. The top will give you both CPU and memory consumption of your Python process.

Side-by-side terminal windows to launch Python processes and monitor their status.

Side-by-side terminal windows to launch Python processes and monitor their status.

 

Running the Python process on a schedule

Unfortunately, we cannot use cron to schedule a process/job in the notebook session environment, since we do not have access to cron on the VM. However, we can use the Python library schedule to schedule the execution of Python processes. The schedule library is pre-installed in the notebook session environment. The schedule will be written as part of your Python code. Here's what the code snippet of my-file.py looks like with schedule:


%%writefile my-file.py

import os
import sys
import schedule
from time import sleep

def hello():
    print("hello")

def my_schedule():
    schedule.every(5).seconds.do(hello)
    #schedule.every().day.at("11:30").do(hello)

def main():
    my_schedule()
    while True:
        schedule.run_pending()
        sleep(1)

if __name__ == '__main__':
    main()

 

The function my_schedule() defines the frequency or time of the execution by using the schedule library. I moved the actual job definition (in this case a printed hello statement) to the hello() function. That is the function passed to the .do() method of the schedule object. Simply call nohup to execute the modified script:


(base) bash-4.2$ nohup python -u my-file.py > my-stdouterr-log 2>&1 &
 

Voila! You now have a Python process running on a schedule in your notebook session environment.

 

A few tips

You can go one step further and pass command line arguments to your Python script that can be injected in your schedule and define the frequency of execution. That way, you can define the frequency/schedule directly from the command line. That also means you can execute the process at different frequencies without hardcoding values in your script.

In my example, the job is simple. I just print hello. However, the job can be quite complex involving data transformation, feature engineering, and model training. I recommend that you define the training job in a separate module and call it from a simple schedule script like the one above. That way, you can separate the "scheduler" script from the module/script where the machine learning job is defined.

 

In summary

In this post, I described how a data scientist can run a Python process in the Oracle Cloud Infrastructure Data Science service notebook session environment. I also showed how the process can be run in the background, how you can capture the stdout and stderr logs without buffering, how you can monitor the process from the terminal window, and finally, how you can run the job using the schedule library offered in the notebook session environment. This workflow provides a lightweight Python job execution framework that can be run directly in the notebook session.

 

Keep in touch!

-    Visit our service documentation

-    (Oracle Internal) Our slack channel #oci_datascience_users

-    Our YouTube Playlist

-    Our Qloudable Hands-on Lab

 

To learn more about how Oracle Data Science can benefit your business, visit the Oracle Data Science page, and follow us on Twitter @OracleDataSci.

 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.