Weight of evidence (WOE) is a powerful tool for feature representation and evaluation in data science. In a previous blog, we explained the importance and the application of WOE and its byproduct Information Value (IV). One important problem to apply this powerful tool is the scalability of the computation especially when the dataset grows large. In that blog, we presented a scalable approach of computing those values by leveraging the transparency layer provided in OML5R, with the assumption that the data resides in Oracle Database.
As one of the new released products in Oracle Machine Learning, OML4Py brings the benefits as provided by OML4R and even more functionalities, like AutoML, into python. In this blog, we will show how we compute the WOE and IV using OML4Py.
The data we used for demonstration here is customer insurance lifetime value dataset. This is a dataset created by Oracle Machine Learning team. It simulates a user scenario that an insurance company is targeting at potential customers based on the lifetime value, demographic and financial features for each individual customer. The dataset contains both categorical and numeric columns, which is a good example of showing how to do the WOE computation. The following is a glimpse into this dataset with a subset of the columns.
Based on the column names, we can see that the dataset contains user demographic features such as state, region, gender, marital status, and some financial features like income, credit card limits.
The main business problem here is to find out which customer is likely to buy an insurance policy. From the dataset, this ground truth is that a customer has purchased the insurance policy based on the column BUY_INSURANCE. This is a typical binary classification problem, and we can use all features columns provided in this dataset to build a model.
We provide the following python code to compute the WOE and IV using OML4Py transparency layer. Our task is to provide WOE transformed columns for any given OML frame with multiple categorical and numerical columns.
Our main approach consists of several steps. We first look at each categorical column, find out the distinct levels of that column and compute the WOE value for each level using the conditional probabilities and the prior probability. At this point, the IV value can also be computed based on the obtained WOE values.
For each numerical column, we need to first bin the numerical column and turn it into a categorical column with each level being the bin index. After that, we repeat the procedure describe above.
After we computed the WOE value mapping from the original value to the WOE values, we save the result into a table. By joining back to the original dataset, we can add the transformed columns into the original table.
Let us look at our method in more details. First, we compute the prior probability of the target column. In the dataset we are using, it is the probability of BUY_INSURANCE = 1 and BUY_INSURANCE = 0. This input argument of the function is listed below.
It returns a tuple:
(The name of the target column, the positive level, the positive counts, the negative level, the negative counts)
which will be used as an input to the other functions.
The code is given below.
def prior_count(DF, col, pos, neg):
labels = DF[col].drop_duplicates().pull()
assert len(labels) == 2
assert pos in labels
assert neg in labels
return col, pos, DF[ DF[col] == pos].shape[0], neg, DF[DF[col] == neg].shape[0]
After the prior probability is computed, we use the function woe_col to generate a data frame that contains all distinct levels of a column, conditional probabilities and the WOE values.
def woe_col(DF, col, priors):
vals = DF[col].drop_duplicates().pull()
target, pos, pos_cnt, neg, neg_cnt = priors
res_df = pd.DataFrame(columns = ['VAL', 'COND_POS', 'COND_NEG', 'WOE'])
GROUP_CNT = DF.crosstab([col, target]).pull()
K = len(vals)
for val in vals:
cond_pos = 1.0/(K + pos_cnt)
cond_neg = 1.0/(K + neg_cnt)
deno = 1
nomi = 1
if GROUP_CNT[ (GROUP_CNT[col]==val) & (GROUP_CNT[target]== pos)].shape[0] > 0:
nomi = (GROUP_CNT[ (GROUP_CNT[col]==val) & (GROUP_CNT[target]== pos)]['count'].values[0] + 1 )
cond_pos = nomi*1.0/(K + pos_cnt)
if GROUP_CNT[ (GROUP_CNT[col]==val) & (GROUP_CNT[target]== neg)].shape[0] > 0:
deno = (GROUP_CNT[ (GROUP_CNT[col]==val) & (GROUP_CNT[target]== neg)]['count'].values[0] + 1)
cond_neg = deno*1.0/(K + neg_cnt)
woe = np.log(nomi) - np.log(K + pos_cnt) - (np.log(deno) - np.log(K + neg_cnt))
res_df = res_df.append({'VAL':val, 'COND_POS':cond_pos, 'COND_NEG':cond_neg, 'WOE':woe}, ignore_index = True)
return res_df
For example, the WOE values for the column MARITAL_STATUS is below.
The function attach_woe_cat accepts a categorical column, calls woe_col functions to generate the WOE table for that column and attach the WOE columns to the original OML frame. The generated column has a suffix _WOE attached. So if we supply col = MARITAL_STATUS, we will get a column with WOE values attached to the original OML frame as MARITAL_STATUS_WOE.
The function is listed as follows.
def attach_woe_cat(DF, col, target, priors = None):
if not priors:
priors = prior_count(DF, target, 1, 0)
woe_df = woe_col(DF, col, priors)
WOE_DF = oml.push(woe_df)
DF = DF.merge(WOE_DF, left_on = col, right_on = 'VAL', how = 'inner', suffixes = ['',''])
DF = DF.drop(['VAL', 'COND_POS', 'COND_NEG'])
_ = DF.rename({'WOE': col + '_WOE'})
return DF
The function attach_woe_num assumes a given column is a numerical column. For numerical columns, the function first bins the column to discretize the numerical values into categorical values. Since we already have a function dealing with WOE values on categorical columns, we can reuse attach_woe_cat to complete the computation. The function is listed below.
def attach_woe_num(DF, col, target, priors = None, bin_num = 20):
if not priors:
priors = prior_count(DF, target, 1, 0)
binned_col = col + '_BIN'
DF = DF.concat({ binned_col: DF[col].cut(bin_num)})
DF = attach_woe_cat(DF, binned_col, target, priors)
DF = DF.drop([binned_col])
return DF
To complete the WOE transformation on the entire OML frame with multiple numerical and categorical columns, all we need to do is to combine the APIs above as follows.
target = 'BUY_INSURANCE'
priors = prior_count(CUST_SUBSET_DF, target, 1, 0)
CUST_WOE_DF = CUST_SUBSET_DF
for col in cat_cols:
CUST_WOE_DF = attach_woe_cat(CUST_WOE_DF, col, target, priors)
for col in num_cols:
CUST_WOE_DF = attach_woe_num(CUST_WOE_DF, col, target, priors)
Now, we obtained the OML frame with the WOE values attached. The following is a sample of the results. We can see that the previous categorical features like STATE, GENDER are transformed into numerical columns, with each level corresponding to a number.
One byproduct of the WOE analysis is the Information Value. The function information_value computes the IV for the given column.
def information_value(DF, col, target, priors, bin_num = 20):
if not priors:
priors = prior_count(DF, target, 1, 0)
if type(DF[col]) == oml.Float:
binned_col = col + '_BIN'
DF = DF.concat({ binned_col: DF[col].cut(bin_num)})
col = binned_col
woe_df = woe_col(DF, col, priors)
woe_df['IV'] = woe_df['WOE']*(woe_df['COND_POS'] - woe_df['COND_NEG'])
return woe_df['IV'].sum()
The IV provides a measure of the predictive power for reach feature. We ranked the feature columns based on the IV to check the top predictive features:
Based on the information values, we see that the most predictive features are something related to the financial status of the customer such as PROFESSION and other demographic features like NUM_DEPENDENTS, MARITAL_STATUS, etc.
The WOE and IV are frequently utilized in the domain of credit scoring calculation. One might wonder how to interpret this number, such as, how predictive is a feature if I have an IV being around 0.5? Following a rule of thumb given by from
Intelligent Credit Scoring by Siddiqi, the level of predictive power associated with IV value is listed below.
Therefore, the IVs for our example are
Following this rule, we remove the "non-predictive" predictors (with IV < 0.02) and continue with other features.
Next, we will fit a model using the WOE columns and see if the WOE transformation can improve the classification performance.
We first split the data into train and test set. This can be done by using the split function as a method of the OML frame object.
dat = CUST_WOE_DF.split(seed = 1)
train_x = dat[0].drop(['BUY_INSURANCE'])
train_y = dat[0]['BUY_INSURANCE']
test_x = dat[1]
test_y = dat[1]['BUY_INSURANCE']
Now, we train the model on the training set using the WOE transformed features.
setting = dict()
glm_mod = oml.glm("classification", **setting)
glm_mod.fit(train_x, train_y, case_id = 'CUSTOMER_ID')
Then we can apply the model to the test set and generate a table that consists of customer ID, prediction results, probability, and the target.
GLM_WOE_RES_DF = glm_mod.predict(test_x,
supplemental_cols = test_x[['CUSTOMER_ID', 'BUY_INSURANCE']])
GLM_WOE_RES_PROB = glm_mod.predict_proba(test_x, supplemental_cols = test_x['CUSTOMER_ID'])
GLM_WOE_RES_DF = GLM_WOE_RES_DF.merge(GLM_WOE_RES_PROB, how = "inner", on = 'CUSTOMER_ID', suffixes = ["", ""])
GLM_WOE_RES_DF = GLM_WOE_RES_DF.materialize()
It is interesting to ask what if we do not use WOE to train the model. For comparison, we build the model based on the original dataset and check the difference in the performance. We plot the ROC curve side by side to show the difference. We can see that the model performance based on WOE is better than the original dataset, since the WOE curve is further to the upper left overall with a higher AUC This benefit is brought with WOE transformation since the categorical columns are converted to numerical values with lower dimension. With a reduced dimension in the feature space, it is easier to fit the model with higher accuracy.
Beyond the accuracy, as we pointed out in our previous blog, WOE provides a good way to avoid exploding one categorical column into multiple one hot encoding columns. This greatly reduces time and space complexity.
In this blog, we demonstrated the implementation of weight of evidence and information value using OML4Py. By leveraging the transparency layer of OML4Py, the user can manipulate the data table in Oracle Database and complete the computation and transformation in a scalable and efficient manner. The predictive power of each feature is also provided by computing information value. Moreover, we showed that a model built with WOE can achieve higher prediction accuracy using the customer life insurance lifetime value dataset.