X

Learn about Oracle Machine Learning for Oracle Database and Big Data, on-premises and Oracle Cloud

  • March 2, 2021

Using Oracle Machine Learning for Python - embedded Python execution

Mark Hornick
Senior Director, Data Science and Machine Learning

In this post, we highlight using OML4Py Embedded Python Execution, which enables data scientists and other Python users to run user-defined Python functions in Python engines spawned and managed by the database environment.

User-defined functions can be run in a single Python engine, or in a data-parallel or task-parallel manner using multiple Python engines, for example, to enable scoring native Python models at scale. Results from these user-defined Python functions can contain both structured and PNG image results and be accessed via Python and REST APIs.

User-defined Python functions can be managed as scripts in the database, using the Python script repository. Python objects can also be stored in the database – as opposed to being managed in flat files. These features facilitate collaboration across the data science team – enabling convenient hand-off of data science work products from data scientists to application developers for immediate deployment.

So let's take a look at embedded Python execution in more detail. We first build a linear model in Python directly, then create a user-defined function that will be run in the Autonomous Database environment.

In an earlier post, we loaded the iris data and combined target and predictors into a single DataFrame, which was stored in the database table IRIS. Here, we simply access this table using oml.sync to get a pandas DataFrame proxy object to the database table.
import pandas as pd

IRIS = oml.sync(table="IRIS")
IRIS.head(4)

   SEPAL_LENGTH  SEPAL_WIDTH  PETAL_LENGTH  PETAL_WIDTH Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa

Define your function

To illustrate embedded Python execution, first define a function, build_lm_1, that returns a linear model as the result. We'll go further and score the training data and generate a plot of the model's predictions against the actual target values. Note that embedded Python execution can return images as well as structured content. We run this function to ensure it returns what we expect. In this case, both the image and model.

def build_lm_1(dat):
    import oml
    from sklearn import linear_model
    import matplotlib.pyplot as plt

    lm = linear_model.LinearRegression()
    X = dat[["PETAL_WIDTH"]]
    y = dat[["PETAL_LENGTH"]]
    mod = lm.fit(X, y)
    
    pred = mod.predict(dat[["PETAL_WIDTH"]])
    plt.scatter(dat.loc[:,"PETAL_WIDTH"], dat.loc[:,"PETAL_LENGTH"])
    plt.plot(dat[["PETAL_WIDTH"]], pred, color='blue', linewidth=3)
    plt.xticks(()) # Disable ticks
    plt.yticks(())
    plt.show()
    return mod

build_lm_1(iris_df)

LinearRegression()

Invoke using table_apply

Next, we invoke this function using the embedded Python execution function table_apply. In this example, the table_apply function takes the proxy object IRIS as input and loads the corresponding data in the database table to the user-defined function as a pandas DataFrame in the first argument. The user-defined function can be passed as a Python function object. (Note that you can also pass in the function definition as a string.) We see that the model is returned in the variable mod and the image is displayed automatically.

mod = oml.table_apply(data=IRIS, func = build_lm_1)



print("Model:",mod)
print("Type:",type(mod))
print("Coefficient", mod.coef_)
Model: LinearRegression()
Type: <class 'sklearn.linear_model._base.LinearRegression'>
Coefficient [[2.2299405]]

Use the Python script repository

Next, we save our user-defined function in the Python script repository and invoke it on the same table by providing the function name and function definition as strings.

build_lm_1_str = """def build_lm_1(dat):
    import oml
    from sklearn import linear_model
    import matplotlib.pyplot as plt

    lm = linear_model.LinearRegression()
    X = dat[["PETAL_WIDTH"]]
    y = dat[["PETAL_LENGTH"]]
    mod = lm.fit(X, y)
    
    pred = mod.predict(dat[["PETAL_WIDTH"]])
    plt.scatter(dat.loc[:,"PETAL_WIDTH"], dat.loc[:,"PETAL_LENGTH"])
    plt.plot(dat[["PETAL_WIDTH"]], pred, color='blue', linewidth=3)
    plt.xticks(()) # Disable ticks
    plt.yticks(())
    plt.show()
    return mod"""
oml.script.create("build_lm_1", func=build_lm_1_str, overwrite = True) 
oml.script.dir() 
         name                                             script description                date
0  build_lm_1  def build_lm_1(dat):\n    import oml\n    from...        None 2021-03-02 01:08:19 

In the table_apply invocation, notice the value to argument "func" is the name of the function we saved in the script repository. The results, however, are the same.

mod = oml.table_apply(data=IRIS, func = 'build_lm_1')

Use row_apply for parallel invocation

The row_apply function can be used to invoke a user-defined function on chunks of rows, which enables, for example, performing scoring in parallel for native Python models. Here, the user-defined function score_lm_1 takes a pandas DataFrame and a linear model. The row_apply function takes several arguments: 1) the proxy object IRIS, 2) that we want 10 rows scored at a time (resulting in 15 function invocations), 3) the user-defined function, 4) the linear model object, 5) that we want to have 5 parallel Python engines to process the 15 invocations, and 6) using the func_value argument that we want the result to be returned as a single table by specifying a DataFrame describing the structure to be used for the table.

def score_lm_1(dat, model):
    import pandas as pd
    from sklearn import linear_model
    pred = model.predict(dat[["PETAL_WIDTH"]])
    return pd.concat([dat[['Species', 'PETAL_LENGTH']], 
                      pd.DataFrame(pred, columns=['PRED_PETAL_LENGTH'])], axis=1)

res = oml.row_apply(IRIS, rows=10, func=score_lm_1, 
                    model=mod, parallel=5,
                    func_value=pd.DataFrame([('a', 1, 1)], 
                                            columns=['Species', 'PETAL_LENGTH', 'PRED_PETAL_LENGTH']))
res.head()
  Species  PETAL_LENGTH  PRED_PETAL_LENGTH
0  setosa           1.4           1.535749
1  setosa           1.4           1.535749
2  setosa           1.3           1.535749
3  setosa           1.5           1.535749
4  setosa           1.4           1.535749

For illustrative purposes, you can also use the group_apply function to partition the data according to one or more columns and score as well, passing the model as an argument. In this example, we specify the index argument with the Species column.

res = oml.group_apply(IRIS, index=IRIS[['Species']], func=score_lm_1, model=mod, parallel=2,
                    func_value=pd.DataFrame([('a', 1, 1)], 
                                            columns=['Species', 'PETAL_LENGTH', 'PRED_PETAL_LENGTH']))

res.head()
  Species  PETAL_LENGTH  PRED_PETAL_LENGTH
0  setosa           1.4           1.535749
1  setosa           1.4           1.535749
2  setosa           1.3           1.535749
3  setosa           1.5           1.535749
4  setosa           1.4           1.535749

Build one model per Species using Group Apply

A better use of group_apply - the ability to automatically split the table data based on values in one or more columns - is for model building. The user-defined function is then invoked on each group, or partition of the data. The function group_apply makes this easy. Here, we build three models, one specific to each species, and return them as a dictionary.

mod = oml.group_apply(IRIS[:,["PETAL_LENGTH","PETAL_WIDTH","Species"]], 
                      index=oml.DataFrame(IRIS['Species']), 
                      func=build_lm_1, parallel=2,
                      oml_connect = True)
print("Type:",type(mod))
mod
Type: <class 'dict'>

{'setosa': LinearRegression(),
 'versicolor': LinearRegression(),
 'virginica': LinearRegression()}

Instead of returning the models in the result, we may want to persist them in the database using a datastore. To illustrate, we change the user-defined function to save each model in a datastore and have our function return the assigned object name instead. The datastore allows storing Python as well as OML4Py proxy objects objects in the database under the provided name. The object takes on the name it is assigned in Python, here "mod_" and the corresponding Species value.

def build_lm_2(dat, dsname):
    import oml
    from sklearn import linear_model
    lm = linear_model.LinearRegression()
    X = dat[["PETAL_WIDTH"]]
    y = dat[["PETAL_LENGTH"]]
    lm.fit(X, y)
    name = "mod_" + dat.loc[dat.index[0],'Species']
    oml.ds.save(objs = {name: lm}, name=dsname, 
                append=True) 
    return name

If the datastore exists, we delete it so that the group_apply function can use the same datastore to store all the objects. The group_apply function takes several arguments: 1) the proxy object representing the input data, 2) the index parameter that specifies the column or columns to partition on, 3) the user-defined function, 4) the name of the datastore in dsname (note this is an argument in our user-defined function), and 5) that we wish to automatically connect to the database from the Python engine. This connection to the database engine is necessary when using the datastore functionality since the results need to be written back to the database. We then print the outcome, which contains a dictionary of three elements each assigned the model object name.

try:
    oml.ds.delete('ds-1')
except:
    print("Datastore not found")
res = oml.group_apply(IRIS[:,["PETAL_LENGTH","PETAL_WIDTH","Species"]], 
                      index=oml.DataFrame(IRIS['Species']), 
                      func=build_lm_2, dsname="ds-1",
                      oml_connect = True)
print("Outcome:",res)
Outcome: {'setosa': 'mod_setosa', 'versicolor': 'mod_versicolor', 'virginica': 'mod_virginica'}

Next, load the datastore to see the three models loaded into the client Python engine as assigned to their respective variables. We can view the model for "versicolor".

print("Datastore objects:",oml.ds.load("ds-1"))
print("Versicolor model:",mod_versicolor)
Datastore objects: ['mod_setosa', 'mod_versicolor', 'mod_virginica']
Versicolor model: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

So This gives you a taste of what's possible with the Python interface to OML4Py embedded Python execution, but you can also invoke user-defined functions using a REST API for easy deployment in applications. We'll talk about this feature in our next post, and then look at automated machine learning (AutoML).

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.