X

Learn about Oracle Machine Learning for Oracle Database and Big Data, on-premises and Oracle Cloud

  • May 6, 2021

Cross Validation Using OML4Py Part I

Jie Liu
Data Scientist

Why Cross Validation

Cross validation is a widely used model validation approach. After a machine learning model is built, it is essential to measure the performance of the model before deployment. Cross validation allows us to understand the model performance from the perspective of both bias and variance, which is a sound and reliable way.
In many cases, to get some sense of the model performance quickly, you just divide the dataset into train and test sets once, then compute one or more model quality metrics on the test set. But the measurement obtained in that way can be biased. For instance, only one division of the dataset could leave some patterns that the model needs to learn only in the test set and unseen in the training set. The model will not behave very well and that leads to underestimating model performance. On the other hand, sometimes the model is trained on data that includes sparse features. By coincidence, that particular feature reappears in the test set and the result could look very good. Actually, the model is overfitted and may not behave that well again in the new data. This is related to the concept of model variance, which characterizes how the model prediction varies if using a different pair of train and test set. With only one split, we cannot get a full picture how the model behaves across different train/test splits.
So far, the widely accepted way to evaluate model quality avoiding bias and assessing variance is to do K-fold cross validation. This is to divide the dataset into K segments with almost equal size, then build the model by using one segment of the data as a test set and train the model on the other K-1 segments. Repeat the train-validation process K times, which results in K sets of models and metrics. By doing this, we can obtain both the mean and variance of the metrics and thus we can get a clear picture of how the model behaves.
The newly released Oracle Machine Learning for Python (OML4Py) API brings benefits that are similar to those in OML4R: transparency layer, in-database algorithms, and embedded Python execution. New in OML4Py is automated machine learning.
Cross validation can be handled conveniently and efficiently using the OML4Py transparency layer. Moreover, by leveraging embedded Python execution, we can parallelize the K train/test processes. We will cover this topic in a two-part blog series. Part I focuses on cross validation for models trained by OML in database algorithms. Part II focuses on cross validation on open source models. In this blog, we will show our approach to do cross validation in OML4Py.

Data Overview

We use the dataset customer insurance lifetime value for our demonstration, an Oracle-produced dataset. The use case involves an insurance company targeting customers likely to buy insurance based on their lifetime value, demographic, and financial features for each customer. The following is a glimpse into this dataset with a subset of the columns.

Based on the column names, we can see that the dataset contains user demographic features such as state, region, gender, marital status, and some financial features like income, credit card limits.
The main business problem here is to find out which customer is likely to buy an insurance policy. From the dataset, this ground truth that a customer has purchased the insurance policy based on the column BUY_INSURANCE. This is a typical binary classification problem and we can use all features columns provided in this dataset to build a model.

Metrics for Model Evaluation

To validate the model, we need to choose which metric we want to use. The choice of metric depends on the type of machine learning tasks. For classification problem, we can use metrics like accuracy, AUC (area under curve), precision, recall, F1 - score etc. For regression problem, we can choose metrics like mean squared error (MSE), mean absolute err (MAE) and R squared. Here, the problem we have is a classification task and we will pick AUC as the metric since it does not depend on a specific threshold of the prediction score and it is widely used in many data science projects.

Cross Validation using the OML4Py Transparency Layer

The OML4Py transparency layer provides the KFold function to split the data into K folds to support cross validation. Although it looks similar to sklearn.model_selection.KFold, the OML4Py KFold function outputs the OML DataFrames of the split dataset, with no data being pulled from the DB to enable the splits.
Let us see an example of a 5-fold cross validation with the code below.

fold = 5
pairs = CUST_SUBSET_DF.KFold(n_splits = fold)

After running the code, we will have a tuple that contains 5 pairs of OML DataFrames. Each pair is a tuple (train set, test set) and each element is an OML DataFrame.
While we could view the data directly, this would involve pulling it to the client. Instead, a better option is to print out the dimension of the elements and the length, since we are already comfortable with the data set as a whole:

for pair in pairs:
    print(type(pair))
print(pair[0].shape, pair[1].shape, len(pair))

The output is

2 <class 'oml.core.frame.DataFrame'> (11105, 14) <class 'oml.core.frame.DataFrame'> (2775, 14) 
2 <class 'oml.core.frame.DataFrame'> (11100, 14) <class 'oml.core.frame.DataFrame'> (2780, 14) 
2 <class 'oml.core.frame.DataFrame'> (11148, 14) <class 'oml.core.frame.DataFrame'> (2732, 14) 
2 <class 'oml.core.frame.DataFrame'> (11117, 14) <class 'oml.core.frame.DataFrame'> (2763, 14) 
2 <class 'oml.core.frame.DataFrame'> (11050, 14) <class 'oml.core.frame.DataFrame'> (2830, 14)

We can see that the entire dataset (13880 rows) is divided into 5 segments of the sizes (2775, 2780, 2732, 2763, 2830). Each pair contains the particular segment as the test set and the rest as the train set. This provides a convenient way to iterate on each pair for model training and scoring for model validation.
Think about the next use case. We want to train a generalized linear model and do cross validation to get the best estimate of model accuracy. In the validation, we use the metric AUC (Area Under the Curve) to check model performance.
For scalability, we implemented the AUC computation by calling a SQL query using oml.cursor. This is an efficient and fast computation using Oracle SQL window functions to speed up the computation of AUC score.

def auc_score(table_name, prob, target):
    import oml
    cr = oml.cursor()    
    query_template = """
    WITH
    pos_prob_and_counts AS (
    SELECT <PROB1> pos_prob,
           DECODE(<TARGET>, 1, 1, 0) pos_cnt
      FROM <TABLE>
    ),
    tpf_fpf AS (
    SELECT  pos_cnt,
           SUM(pos_cnt) OVER (ORDER BY pos_prob DESC) /SUM(pos_cnt) OVER () tpf,
           SUM(1 - pos_cnt) OVER (ORDER BY pos_prob DESC) / SUM(1 - pos_cnt) OVER () fpf
      FROM pos_prob_and_counts
    ),
    trapezoid_areas AS (
    SELECT 0.5 * (fpf - LAG(fpf, 1, 0) OVER (ORDER BY fpf, tpf))*(tpf + LAG(tpf, 1, 0) OVER (ORDER BY fpf, tpf)) area
      FROM tpf_fpf
     WHERE pos_cnt = 1
        OR (tpf = 1 AND fpf = 1)
    )
    SELECT SUM(area) auc
      FROM trapezoid_areas"""    
    
    query = query_template.replace('<PROB1>', prob)
    query = query.replace('<TARGET>', target)
    query = query.replace('<TABLE>', table_name)
    
    _ = cr.execute(query)
    auc = cr.fetchall()
    cr.close()
    
return auc[0][0]

We put both the training and testing functionalities into the following function.

def train_and_validate(TRAIN_DF, TEST_DF, target, case_id, idx_fold, prefix, stats_map):
    train_x = TRAIN_DF.drop([target])
    train_y = TRAIN_DF[target]
    test_x = TEST_DF
    test_y = TEST_DF[target]    
    model_name = prefix + '_' + str(idx_fold)
    try:
        oml.drop(model= model_name)
    except:
        print(model_name + " not found")    
    print('Training model' + model_name)
    setting = {'GLMS_RIDGE_REGRESSION': 'GLMS_RIDGE_REG_ENABLE'}
    glm_mod = oml.glm("classification", **setting)    
    glm_mod.fit(train_x, train_y, case_id =  case_id,  model_name = model_name)      
    GLM_RES_DF = glm_mod.predict(test_x, 
                            supplemental_cols = test_x[[case_id, target]])
    GLM_RES_PROB = glm_mod.predict_proba(test_x, supplemental_cols = test_x[case_id])
    GLM_RES_DF = GLM_RES_DF.merge(GLM_RES_PROB, how = "inner", on = case_id, suffixes = ["", ""])
    GLM_RES_DF = GLM_RES_DF.materialize()
    stats_map[idx_fold] = auc_score(GLM_RES_DF, 'PROBABILITY_OF_1', target)    
    return glm_mod

We can loop through all K pairs of train/test sets in OML DataFrame format and use a Python dictionary to record all the AUCs we obtained as follows.

fold = 5
pairs = CUST_SUBSET_DF.KFold(n_splits = fold)
stats_map = {}
models = []
for i, pair in enumerate(pairs):
    TRAIN_DF, TEST_DF = pair
    print('Running fold %s ' % i)
    print('Training data:')
    print(TRAIN_DF.shape)
    print('Test data:')
    print(TEST_DF.shape)    
    model = train_and_validate(TRAIN_DF, TEST_DF, 'BUY_INSURANCE', 'CUSTOMER_ID', i, 'GLM_LTV_MDL', stats_map)
models.append(model) 

After the loop, we want to check the K models built in this process. In the code, we saved the models from the function output. We can also check the models from the list of models right away. Another way to retrieve the model is that we can leverage the OML model constructor and the model name.

for id in range(fold):
    model = oml.glm(model_name = 'LTV_MDL_' + str(id))
model

We can take a look at the information of one of the models

Model Name: LTV_MDL_0

Model Owner: JIE

Algorithm Name: Generalized Linear Model

Mining Function: CLASSIFICATION

Target: BUY_INSURANCE

Settings: 
                   setting name                  setting value
0                     ALGO_NAME  ALGO_GENERALIZED_LINEAR_MODEL
1         CLAS_WEIGHTS_BALANCED                            OFF
2               GLMS_CONF_LEVEL                            .95
3           GLMS_FTR_GENERATION    GLMS_FTR_GENERATION_DISABLE
4            GLMS_FTR_SELECTION     GLMS_FTR_SELECTION_DISABLE
5                  ODMS_DETAILS                    ODMS_ENABLE
6  ODMS_MISSING_VALUE_TREATMENT        ODMS_MISSING_VALUE_AUTO
7                 ODMS_SAMPLING          ODMS_SAMPLING_DISABLE
8                     PREP_AUTO                             ON

The benefit of cross validation is that we can compute the mean and standard deviation of the AUC scores to have a clearer picture of how the model performs across all the data.

auc_scores = list(stats_map.values())
print(np.around(auc_scores, 4))
print('Average AUC %s, STD %s' % (np.round(np.mean(auc_scores),4), np.round(np.std(auc_scores),5))) 

Let’s check the output:

[0.7677 0.7744 0.7788 0.764 0.7694]
Average AUC 0.7708, STD 0.00519 

The result shows that the average AUC score is 0.7708. We also have a sense of the variance: the standard deviation is 0.00519.

Run Cross Validation in Parallel

In the last example, we iterate through all K folds and repeat the same train and test processes K times. Depending on the data size, this can be a time costly approach. Alternatively, we can run the K computations in parallel. To achieve this, we can use Embedded Python Execution (EPE) of OML4Py. EPE allows the user to specify the desired number of Python engines that the database environment should start to run the user-defined Python function. EPE functions support automatically or manually loading database table data into the user-defined function. In this example, we show how to submit multiple in-DB OML mode training and testing jobs through index_apply() in parallel.
The first thing we need to do is upload the python script to the OML4Py script repository. This is to submit the source code of the two functions we are using to ADB and let ADB have access to the code so that they can be used by database spawned and controlled Python engines. We can use the oml.script.create function to achieve that.

auc_src = '''
def auc_score(table_name, prob, target):
    import oml
    cr = oml.cursor()
    query_template = """
    WITH
    pos_prob_and_counts AS (
    SELECT <PROB1> pos_prob,
           DECODE(<TARGET>, 1, 1, 0) pos_cnt
      FROM <TABLE>
    ),
    tpf_fpf AS (
    SELECT  pos_cnt,
           SUM(pos_cnt) OVER (ORDER BY pos_prob DESC) /SUM(pos_cnt) OVER () tpf,
           SUM(1 - pos_cnt) OVER (ORDER BY pos_prob DESC) / SUM(1 - pos_cnt) OVER () fpf
      FROM pos_prob_and_counts
    ),
    trapezoid_areas AS (
    SELECT 0.5 * (fpf - LAG(fpf, 1, 0) OVER (ORDER BY fpf, tpf))*(tpf + LAG(tpf, 1, 0) OVER (ORDER BY fpf, tpf)) area
      FROM tpf_fpf
     WHERE pos_cnt = 1
        OR (tpf = 1 AND fpf = 1)
    )
    SELECT SUM(area) auc
      FROM trapezoid_areas"""
    query = query_template.replace('<PROB1>', prob)
    query = query.replace('<TARGET>', target)
    query = query.replace('<TABLE>', table_name)
    _ = cr.execute(query)
    auc = cr.fetchall()
    cr.close()
    return auc[0][0]'''
train_src = """
def train_and_validate(TRAIN_DF, TEST_DF, target, case_id, idx_fold, prefix, stats_map):
    import oml
    auc_score = oml.script.load('auc_score')
    train_x = TRAIN_DF.drop([target])
    train_y = TRAIN_DF[target]
    test_x = TEST_DF
    test_y = TEST_DF[target]
    model_name = prefix + '_' + str(idx_fold)
    try:
        oml.drop(model= model_name)
    except:
        print(model_name + " not found")
    print('Training model' + model_name)
    setting = {'GLMS_RIDGE_REGRESSION': 'GLMS_RIDGE_REG_ENABLE'}
    glm_mod = oml.glm("classification", **setting)    
    glm_mod.fit(train_x, train_y, case_id =  case_id,  model_name = model_name)    
    GLM_RES_DF = glm_mod.predict(test_x, 
                            supplemental_cols = test_x[[case_id, target]])
    GLM_RES_PROB = glm_mod.predict_proba(test_x, supplemental_cols = test_x[case_id])
    GLM_RES_DF = GLM_RES_DF.merge(GLM_RES_PROB, how = "inner", on = case_id, suffixes = ["", ""])
    GLM_RES_DF = GLM_RES_DF.materialize()
    stats_map[idx_fold] = auc_score(GLM_RES_DF, 'PROBABILITY_OF_1', target)
    return glm_mod"""
oml.script.create('auc_score', auc_src, is_global = True, overwrite = True)
oml.script.create('train_and_validate', train_src, is_global = True, overwrite = True)

To further improve runtime performance, we can materialize the K fold train and test tables as follows.

fold = 5
pairs = CUST_SUBSET_DF.KFold(n_splits = fold)
for i, pair in enumerate(pairs):
    train_tbl_name = 'CUST_TRAIN_TBL_'
    test_tbl_name = 'CUST_TEST_TBL_'
    train_tbl_name += str(i+1)
    test_tbl_name += str(i+1)
    try:
        oml.drop(table = train_tbl_name)
        oml.drop(table = test_tbl_name)
    except:
        print("No such table")        
    _ = pair[0].materialize(table = train_tbl_name)
    _ = pair[1].materialize(table = test_tbl_name)    
    print(train_tbl_name)
    print(test_tbl_name)

After that, we can write a function to call train_and_validate uploaded above to do the cross validation. The input argument is the run id of each job. In this example, it is equivalent to the i-th fold.

def build_oml(idx):
    import oml
    train_and_validate = oml.script.load('train_and_validate')    
    stats_map = {}
    train_tbl_name = 'CUST_TRAIN_TBL_'
    test_tbl_name = 'CUST_TEST_TBL_'
    TRAIN_DF = oml.sync(table = train_tbl_name + str(idx))
    TEST_DF = oml.sync(table = test_tbl_name + str(idx))
    train_and_validate(TRAIN_DF, TEST_DF, 'BUY_INSURANCE', 'CUSTOMER_ID', idx, 'GLM_LTV_MDL', stats_map)
    return stats_map[idx]

Supply this function as the input to oml.index_apply, then we can let all K=5 folds run at the same time. We let each job output the AUC score for each fold.

res = oml.index_apply(times=5, func=build_oml, oml_connect=True, parallel = 5)
res  
[0.7127303485647599, 0.7054260012897073, 0.6969313963582076, 0.7049151140304264, 0.6859887524179962] 

One may wonder how much time this parallel execution saves. In general, it depends on many factors, such as the dataset size, number of OCPUs, service level (low, medium, high) and of course the number of cross validation folds K. In our example, the dataset is relatively small (13880). We are testing under medium and using 16 OCPUs. The entire process takes around 28s, while the loop approach takes around 36s, which is not a major difference. However, as data volumes grow and number of folds increases, the difference can become more significant. Consider a data set just 20 times larger and the loop approach takes 68 seconds, while the parallel approach takes 37 seconds. If we go even further to test a K=20 cross validation, the loop approach takes 4 minutes 34 seconds and the parallel approach takes 1 minutes 15 seconds. This is also an illustration of who embedded Python execution can be used in creative ways to reduce elapsed runtimes. Therefore, we are better off with the parallel approach in this case.
As noted above, multiple factors will impact the actual runtime. To get a sense of how the performance changes in different settings, we have a two blog posts discussing model build performance and scoring performance.
If the user has purely open source training/testing jobs, index_apply will likely lead to performance benefits because each user-defined function invocation is running in one python engine. We will show how to achieve that in cross validation Part II.

Conclusion

In this blog, we demonstrated the convenience of running cross validation on in-database models with OML4Py. By using KFold function and AUC computation in transparency layer, the cross validation is made easy. We also show how to submit multiple jobs in parallel using index apply and python script repository. We can see the benefit of running in database algorithms in parallel for this dataset example and also discussed performance variation in different situations.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.