X

Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

  • Wednesday, August 2, 2017

Computing Weight of Evidence (WOE) and Information Value (IV)

By: Jie Liu | Data Scientist

Weight of evidence (WOE) is a powerful tool for feature representation and evaluation in data science. WOE can provide interpret able transformation to both categorical and numerical features.
For categorical features, the levels within a feature often do not have an ordinal meaning and thus need to be transformed by either one-hot encoding or hashing. Although such transformations convert the feature into vectors and can be fed into machine learning algorithms, the 0-1 valued vectors are difficult to interpret as a feature. For example, some ecommerce company may want to predict the conversion rate of a group of consumers. One can extract demographic information of people such as the postcode of their addresses. The postcode can be viewed as a categorical feature and encoded into a one-hot vector. But it is unclear which postcode has inclination for relatively higher or lower conversion rate. One can check such an inclination in the coefficients from a GLM or SVM models, however, this information is not available until the model is trained. In such a case, WOE can provide a score for each postcode and one can clearly see the linkage of a postcode with the conversion rate. Moreover, this linkage or inclination generated by WOE can be used for feature transformation and thus benefit the model training.
For numerical features, although there is a natural ordering for different numerical values, sometimes nonlinearity exists and in such cases, a linear model fails to capture that nonlinearity. For instance, the average income for a group of people may increase by time within age range 20-60 (see figure from WSJ), but may drop because of retirement after that. In such use cases, WOE provides scores for each truncated segment (e.g. 30-40, 40-50, ..., 60-70) and that can process the nonlinearity of the data.

alt text

Moreover, in many fields such as finance (e.g., credit risk analysis), the machine learning model is preferred to be transparent and interpretable step by step. WOE provides a convenient way to show the actual transformation for auditors or supervisors.

Another useful byproduct of WOE analysis is Information Value (IV). This measures the importance of a feature. Note that IV depends only on frequency counting and does not need to fit a model to obtain an attribute importance score. Note that ore.odmAI, the in-database attribute importance function that utilizes the minimum description length algorithm, can also be used for ranking attributes. However, those with a background in information theory may prefer the IV calculation.

A nice theoretical and practical overview of WOE analysis can be found in this blog post.

Computing WOE or IV may be involved and computationally intensive if the data size is large. Generally it needs to do counting for each level of the categorical features and both binning and counting for numerical features. If the data reside inside Oracle Database, it is desirable to compute this score using in-database tools. In this blog, we provide an example to show how to compute WOE using Oracle R Enterprise.

Data Overview

The data we used here is New York City Taxi data. It is a well-known public dataset. The data covers the transactional and trip information of both green and yellow taxis. General information about this data set can be found in link. There is a nice github repo to show how to download the data (see the file raw_data_urls.txt). Since it is a huge data set, we picked the Green Taxi (the taxi that is allowed only to pick up people outside Manhattan) data in December, 2016. The total size of the data is around 107 million records.

rm(list=ls())
library(ORE)
options(ore.warn.order=FALSE)
ore.connect(...)
ore.ls()
trip.file <- "green_tripdata_2016-12.csv"
nyc.data <- read.csv(file = trip.file, header=F, skip =1)
nyc.df <- nyc.data[,1:19]
headers <- read.csv(file = trip.file, header = F, nrows = 1, as.is = T)
colnames(nyc.df) <- as.character(headers[1:19])ore.create(nyc.data, table="NYC_GREEN_TAXI")

The code above reads the .csv format data into R and creates a database table in Oracle Database using ORE. Normally, such enterprise data would already be present in Oracle Database, thereby eliminating this step. First, let us have a glimpse of this data.

head(NYC_GREEN_TAXI)
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag
1        2  2016-12-01 00:38:14   2016-12-01 00:45:24                  N
2        2  2016-12-01 00:32:07   2016-12-01 00:35:01                  N
3        2  2016-12-01 00:47:06   2016-12-01 01:00:40                  N
4        2  2016-12-01 00:35:11   2016-12-01 00:53:38                  N
5        2  2016-12-01 00:24:49   2016-12-01 00:39:30                  N
6        2  2016-12-01 00:12:05   2016-12-01 00:15:58                  N
  RatecodeID PULocationID DOLocationID passenger_count trip_distance
1          1           75          262               1          1.30
2          1          193          179               1          1.21
3          1          168           94               2          3.97
4          1           25           61               1          4.29
5          1          223          129               1          2.66
6          1          129          129               2          0.89
  fare_amount extra mta_tax tip_amount tolls_amount ehail_fee
1         7.0   0.5     0.5       0.00            0        NA
2         5.5   0.5     0.5       1.70            0        NA
3        14.0   0.5     0.5       3.06            0        NA
4        16.5   0.5     0.5       3.56            0        NA
5        13.0   0.5     0.5       0.00            0        NA
6         5.0   0.5     0.5       0.00            0        NA
  improvement_surcharge total_amount payment_type trip_type
1                   0.3         8.30            2         1
2                   0.3         8.50            1         1
3                   0.3        18.36            1         1
4                   0.3        23.31            1         1
5                   0.3        14.30            1         1 
6                   0.3         6.30            2         1

Even without a data dictionary, the names of the features are self-explanatory. Transactional information can be found in this data set such as total amount, payment type, fare amount, tax, etc. Trip specific information about the pick-up and drop-off time and location, number of passengers and trip distance can also be found.
What kind of insight can we get from such a dataset?

We all know that one common practice of taking a taxi in the US is that the passenger often tips the driver. The tip amount given has no clearly defined standard and can be controversial. This is also part of the reason why Uber or Lyft have become popular since one does not need to worry about tipping at all. In this dataset, we notice that there are quite a few trips in which the passenger did not tip the driver or there is no record of a tip amount (tip amount = 0), and also there are many cases that the tip is quite high.
alt text

A fair way to study the tip is through the percentage of the tip as compared to the fare amount. The fare amount is usually calculated by meters and the tip is usually paid in proportion to the fare. Normally, in the US, the tip is around 20%. But we observed quite a lot of high percentage tips (up to 100%).

Can we predict whether a higher tip is paid to the taxi driver, given all the trip and transactional data? For this purpose, we define a binary variable "high tip", where high_tip = 1, if the tip percentage exceeds 20% and high_tip =0 otherwise. We can build a classification model to make predictions of this response variable.

Data Cleaning

The NYC taxi data has great quality in general. But there are some outliers which do not make sense and may have been generated by mistake. For instance, there are quite a few records with zero trip distance or zero total amount. We remove such data records before we move to the next step using the ORE transparency layer functions.

NYC_GREEN_TAXI_STG <- subset(NYC_GREEN_TAXI, passenger_count > 0, )
NYC_GREEN_TAXI_STG <- subset(NYC_GREEN_TAXI_STG, trip_distance >0)NYC_GREEN_TAXI_STG <- subset(NYC_GREEN_TAXI, fare_amount > 0)
NYC_GREEN_TAXI_STG$tip_percentage <- NYC_GREEN_TAXI_STG$tip_amount*1.0/NYC_GREEN_TAXI_STG$fare_amount NYC_GREEN_TAXI_STG$high_tip <- ifelse(NYC_GREEN_TAXI_STG$tip_percentage > 0.2, 1,0)
NYC_GREEN_TAXI_STG$PULocationID <- as.factor(NYC_GREEN_TAXI_STG$PULocationID)
NYC_GREEN_TAXI_STG$DOLocationID <- as.factor(NYC_GREEN_TAXI_STG$DOLocationID)
NYC_GREEN_TAXI_STG$RatecodeID <- as.factor(NYC_GREEN_TAXI_STG$RatecodeID)
NYC_GREEN_TAXI_STG$payment_type <- as.factor(NYC_GREEN_TAXI_STG$payment_type)
NYC_GREEN_TAXI_STG <- subset(NYC_GREEN_TAXI, fare_amount > 0)

One subtlety here is that the tip is often paid by cash and the amount is thus not recorded. So when the payment type is cash, the entire tip amount is recorded as zero. We can only know the tip for sure when the fare is paid by credit card. Consequently, we exclude the records with cash payment.
We create a variable "high tip" as a binary variable to indicate if a tip above 20% is paid. This variable is used as a response variable.
Next, we split the entire dataset into training and test sets.

set.seed(1)   # enable repeatable results
N <- nrow(NYC_GREEN_TAXI_STG)
sampleSize <- round(N * 0.7)             
ind        <- sample(1:N,sampleSize
group      <- as.integer(1:N %in% ind)
row.names(NYC_GREEN_TAXI_STG) <- NYC_GREEN_TAXI_STG$VendorID
NYC.train  <- NYC_GREEN_TAXI_STG[group==TRUE,]
NYC.test  <- NYC_GREEN_TAXI_STG[group==FALSE,]
ore.create(NYC.train, table = 'NYC_TRAIN')
ore.create(NYC.test, table = 'NYC_TEST')

Feature Engineering

As a first step, we pick the set of feature empirically because we want to verify later using WOE and IV to check the importance of the features. We select categorical features such as PULocationID, DOLocationID, RatecodeID, payment_type and numerical features such as trip_distance, improvement_surcharge,fare_amount, tolls_amount. Empirically, the tip percentage may have something to do with people from different neighborhoods, the trip distance or the fare amount. WOE analysis provides both attribute importance and a way to convert the categorical features into numerical values, which reduces the computational load.

Weight of Evidence Analysis

The weight of evidence is defined for each level of a feature. Consider a categorical feature X, which contains a level named x_j, the WOE is defined as

This is the logarithmic conditional probability ratio of a level x_j. Why it is defined this way?
Suppose we want to convert a categorical feature into numeric values. One straightforward way is to calculate the frequencies of the response variable according to that level. For example, for each location ID, we can calculate the frequency of high percentage tip. This is actually the posterior probability P(Y=1|x_j), where Y is the high percentage indicator and X_j is the particular location. We can use this frequency score to show the inclination to the positive or negative direction of the response variable. In this example, whether the location ID has more indication for high tips rather than low tips, is to consider the ratio P(Y=1|x_j)/ P(Y=0|x_j). In most of the cases, the ratio may be within (0,1). A logarithmic transform is desirable to flatten the distribution.
alt text

We can further break down the ratio of posterior distribution into a ratio of priors P(Y=1)/P(Y=0) and likelihood ( f(X_J|Y=1) and f(X_J|Y=0)). It is obvious that the priors are not related to X_j at all. So it is better to focus on the second part, which leads to the definition of WOE.
In other words, weight of evidence provides a way to focus on the inclination of the feature level with no influence of the priors. If we use the posterior ratio, then the imbalance of the class will weigh in and distort the picture. Imagine for a case that the records with high tips dominate the data set and every level of the feature will have very high ratio. By removing the prior component, WOE provides a "scientific" way to avoid such a problem.
The weight of evidence analysis also provides a measure for evaluation of the predictive power of a feature. This measure is called Information Value (IV). This value is defined as

The formula can be interpreted as a weighted sum of the difference of the likelihood of each level in a feature. The weight is just the WOE value. For those familiar with information theory, the information value is simply a sum of the KL divergence from two directions: IV = D( f(x_j|Y=1) | f(x_j|Y=0)) + D(f(x_j|Y=0)| f(x_j|Y=1)), where KL divergence mainly measures the difference of two distributions. The underlying intuition is that the more different the two likelihoods f(X_j|Y=1) and f(X_j|Y=0) look, the more predictive power this feature has.

Weight of Evidence Implementation

Nowadays, there are several packages in R to support WOE analysis. A library we have tried is called "klaR". However, as most R packages, it requires all the data to be in-memory and as a result is not able to process large amounts of data. If the data is large and stored in Oracle Database, a good choice is to use Oracle R Enterprise algorithms to calculate WOE values. Although there is no official API available, we can use the in-database algorithms available in ORE to help us on this task.
Note that the WOE needs to calculate the likelihoods f(X_J|Y=1). Coincidentally, this is also an intermediate result for Naive Bayes. In the ORE library, the in-database Naive Bayes function, ore.odmNB, can be called to provide the likelihood. The likelihoods can be found in the result of the summary function of the Naive Bayes model.
Conveniently, ore.odmNB also provides automatic binning for continuous numerical features. The way of binning the feature utilizes a minimum description length method that is optimized in terms of predicting power. Also one can turn off this binning function using function options if desired.
First, let us see how we calculate the WOE values for categorical features using ORE.

categorical.features <- c('PULocationID', 'DOLocationID', 'RatecodeID', 'payment_type')
categoricalModelFormula <- formula(paste0( 'high_tip ~' , paste(categorical.features, collapse = ' + ')))
model.categorical <- ore.odmNB(categoricalModelFormula,
                               data = NYC.train,
                               auto.data.prep = FALSE)
summary.cat <- ore.pull(summary(model.categorical))

Similar to most R packages, the output of ore.odmNB is the model the algorithm has learned. We can run summary() to view model details. Let us see what is inside the result of the summary. We omit some of the output to make it short:

summary.cat$tables
$DOLocationID
...
$PULocationID
...
$RatecodeID
             1            2            3            4
0 9.719172e-01 2.616082e-03 7.895998e-04 5.696033e-04
1 9.928279e-01 2.662321e-03 7.134569e-04 4.618695e-04
$payment_type
             1            2            3            4
0 2.391157e-01 7.536380e-01 4.377760e-03 2.786622e-03
1 9.999850e-01              7.510073e-06 7.510073e-06

We can see that for each feature, there is a table of the likelihoods. For instance, the RatecodeID column 1 has two values in row "0" and row "1". The values are actually P(RatecodeID =1|high tip = 0) and P(RatecodeID =1|high tip = 1). Next, we map each level to the column of the table and compute WOE values.

# calculate woe for given level
make.woe <- function(grp) {
  woe<- grp[grp$OUTCOME==1,]$FREQ/grp[grp$OUTCOME==0,]$FREQ
  return(log(woe[[1]]))
}
# generate woe lookup table for one level
woe.lookup <- function(df, N.pos, N.neg){
  colnames(df) <- c('OUTCOME', 'LEVEL', 'FREQ')
  df[is.na(df$FREQ), ]$FREQ <- 1/ifelse(df[is.na(df$FREQ), ]$OUTCOME == 1,N.pos +1, N.neg+1)

  df.woe <- df %>%
    group_by(LEVEL) %>%
    do(WOE = make.woe(.)) %>%
    mutate(WOE = WOE[[1]])
  df.woe
}

# generate woe lookup tables for all levels , stored in a list
make.lookup.cat <-  function(summary.ca, N.pos, N.neg){
  lookup.cat <- list()  
  length.tab <- length(summary.cat$tables)
  for(i in 1:length.tab){
    feature <- names(summary.cat$tables[i])

    df<- as.data.frame(summary.cat$tables[i])
    df.woe <- woe.lookup(df)
    colnames(df.woe)[colnames(df.woe) =='WOE'] <- paste0(feature,'_WOE')
    lookup.cat[[i]] <- df.woe
  }
  return(lookup.cat)
}
add.woe.cat <- function(train.woe, lookup.cat, summary.cat){
  length.tab <- length(lookup.cat)
  for(i in 1:length.tab){
    feature <- names(summary.cat$tables[i])
    Lookup <- ore.push(lookup.cat[[i]])
    Lookup$LEVEL <- as.character(Lookup$LEVEL)
        train.woe <- merge(train.woe, Lookup, by.x = feature, by.y = 'LEVEL')  
    colnames(train.woe)[colnames(train.woe) == 'FREQ'] <- paste0(feature, '.WOE')
  }
  return(train.woe)
}
N.pos <- nrow(NYC.train[NYC.train$high_tip ==1,])
N.neg <- nrow(NYC.train[NYC.train$high_tip ==0,])
lookup.cat <- make.lookup.cat(summary.cat, N.pos, N.neg)
NYC.train.woe <- add.woe.cat(NYC.train.woe, lookup.cat, summary.cat)

In the code snippet, a lookup table is created for each feature. The level and corresponding likelihood values are stored in a data frame. The lookup table is then joined with the original data frame on the level and thus we obtain the column with levels replaced by WOE values. The resulting ORE frame looks like the following:

PULocationID DOLocationID RatecodeID payment_type PULocationID_WOE DOLocationID_WOE RatecodeID_WOE payment_type_WOE
1           75           75          1            1      -0.06174966       -0.4190850     0.02128671        1.4307927
2          264          264          1            1      -0.69524126       -0.8111242     0.02128671        1.4307927
3          225           17          1            1      -0.45636689        0.0312819     0.02128671        1.4307927
4           49           97          1            1       0.16296406        0.2206145     0.02128671        1.4307927
5           41          142          1            1      -0.19541305        0.6142433     0.02128671        1.4307927
6          112           80          1            2       0.73283221        0.7252893     0.02128671        0.2828431

Next, we will convert the numerical features. This is done separately because the numerical features need to be first binned and the lookup table will be slightly different. The code for this part is

rownames(NYC.train.woe) <- NYC.train.woe$ID
numerical.features <- c('trip_distance', "fare_amount") 
numericalModelFormula <- formula(paste0( 'high_tip ~' , paste(numerical.features, collapse = ' + ')))
model.numerical <- ore.odmNB(numericalModelFormula,
                             data = NYC.train,
                             auto.data.prep = TRUE)
summary.num <- ore.pull(summary(model.numerical))
length.tab.num <- length(summary.num$tables)

We can see that this summary contains tables like

$fare_amount
    ( ; 3.25]   (3.25;  )
0 0.016610587 0.983389413
1 0.005294601 0.994705399

$trip_distance
    ( ; .235]   (.235;  )
0 0.026999879 0.973000121
1 0.007540113 0.992459887

We can further use the binning information and the WOE values to transform the numerical columns of the original data set. Note that since it involves more complicated operations, we use ORE embedded R execution, which allows us to run a function over each row of the ore.frame in parallel. Since embedded R execution usually does not materialize the data, we will first create a table with the result.

# extract upper and lower bounds from the labels
extract.bounds <- function(s){ 
  s <- gsub("\\(",'',s)
  s <- gsub("\\)",'',s)
  s <- gsub("\\]",'',s)
  parts <- strsplit(s, ";")
  left <- parts[[1]][1]
  right <- parts[[1]][2]
  left <- gsub(' ', '', left)
  right <- gsub(' ', '', right)
  if(left =='') left <- -Inf
  if(right == '') right <- Inf
  c(left, right)
}
# lookup the bin where a value belongs to
bin.woe <- function(DF, feature, df.woe){
  val <- DF[,feature][1]
  freq <- df.woe[ df.woe$MIN < val & val <= df.woe$MAX,]$WOE
  DF$bin_temp <- freq[1]
  DF
}
make.woe <- function(grp) {
  woe<- grp[grp$OUTCOME==1,]$FREQ/grp[grp$OUTCOME==0,]$FREQ
  return(log(woe[[1]]))
}

ore.create(NYC.train.woe, table='NYC_TRAIN_WOE')

# generate a list of lookup tables for WOE
make.lookup.num <-  function(summary.num){
  lookup.num <- list()
  length.num <- length(summary.num$tables)
  for(i in 1:length.num){
    df<- as.data.frame(summary.num$tables[i])    
    df.woe <- woe.lookup(df)
    df.woe$LEVEL <- as.character(df.woe$LEVEL)    
    bounds <- sapply(df.woe$LEVEL, extract.bounds)
    bounds.df <- as.data.frame(t(bounds))
    colnames(bounds.df) <- c('MIN', 'MAX')
    df.woe <- cbind(df.woe, bounds.df)
    df.woe$MIN <- as.numeric(as.character(df.woe$MIN))
    df.woe$MAX <- as.numeric(as.character(df.woe$MAX))
    lookup.num[[i]] <- df.woe  
  }
  return(lookup.num)
}

lookup.num <- make.lookup.num(summary.num)
add.woe.num <- function(Train.woe, lookup.num, summary.num){
  length.num <- length(lookup.num)
  for(i in 1:length.num){
    feature <- names(summary.num$tables[i])
    Train.woe[,feature] <- as.numeric(as.character(Train.woe[,feature]))
    df.woe <- lookup.num[[i]]
    schema.string <- paste0("extended.schema = data.frame(ID=integer(),", feature, "=numeric(), bin_temp=numeric())")
    eval(parse(text=schema.string))
    Feature.woe <- ore.rowApply(Train.woe[, c('ID', feature)],
                                bin.woe,
                                feature = feature,
                                df.woe=df.woe,
                                FUN.VALUE = extended.schema,
                                rows = 1e5,
                                parallel = TRUE)

    Train.woe <- merge(Train.woe, Feature.woe[, c('ID', 'bin_temp')], by.x = 'ID', by.y='ID')
    colnames(Train.woe)[colnames(Train.woe) == 'bin_temp'] <- paste0(feature, '_WOE')  
  }
  Train.woe
}
NYC_TRAIN_WOE <- add.woe.num(NYC_TRAIN_WOE, lookup.num, summary.num)

For the process of numerical features, ORE has an advantage of the automatic binning capability. In the open source R package such as klaR, WOE calculation is done only for categorical features. Extra work needs to be done for binning. Although packages like "woe" provide binning functions, binning result in odmNB has an advantage in classification because the partitions come from feature splits in a decision tree. This means that odmNB provides a more efficient way of binning for the purpose of classification.
After we created the WOE value columns for the training set, we can then create the WOE for the test set. Since we stored the lookup tables, we can use them for the test set. The code is as follows:

NYC.test.woe <- NYC_TEST
rownames(NYC.test.woe) <- NYC.test.woe$VendorID
NYC.test.woe$ID <- (N.train+1):(N.train + nrow(NYC.test.woe))
NYC.test.woe <- add.woe.cat(NYC.test.woe, lookup.cat, summary.cat, N.pos, N.neg)  
ore.create(NYC.test.woe, table='NYC_TEST_WOE')
NYC_TEST_WOE <- add.woe.num(NYC_TEST_WOE, lookup.num, summary.num)  

Now, we can train the model using the WOE transformed features and evaluate the model on the test set.

all.features <- append(categorical.features, numerical.features)
all.features <- sapply(all.features, function(x){ paste0(x, '_WOE')})
binaryModelFormula <- formula(paste0('high_tip ~', paste(all.features, collapse = '+')))
model.lr <- ore.odmGLM(binaryModelFormula,
                       data = NYC_TRAIN_WOE, 
                       type = "logistic")
pred.ore <- predict( model.lr, NYC_TEST_WOE, type = 'prob', supplemental.cols= 'high_tip')
calculate.AUC <- function(pred.ore){
  prob <- pred.ore[,2]
  actualClassLabels <- pred.ore[,1]
  library(ROCR)  
  pred <- prediction( prob, actualClassLabels )
  perfROC <- performance( pred, measure="tpr", x.measure="fpr")
  perfAUC <- performance( pred, measure = "auc")
  auc = perfAUC@y.values[[1]][1]
}

auc <- ore.tableApply(pred.ore,
                      calculate.AUC)
auc

The AUC (Area Under Curve) we obtained here is 0.9. The performance is great!
It is interesting to ask what if we do not use WOE to train the model. The answer is that we need extra preprocessing if we go without WOE transformation. Notice that we have two high cardinality features: the PULocationID has 241 levels and DOLocationID has 260 levels. ML algorithms such as ore.randomForest require the total level of a categorical feature should not exceed 53 levels. And GLM algorithms are very slow on data with high cardinality features. In such a case, a reasonable alternative is to combine one-hot encoding and hash tricks and apply ML algorithms that can handle sparse data, such as glmnet or xgboost.
In contrast, WOE provides a convenient solution for transforming such categorical features into numerical values before training machine learning models. At the same time, WOE keeps the interpretability of the model as compared to hashed features.

Information Values

We can also calculate Information Value using the summary of the odmNB output. The code is as follows.

cond.diff <- function(grp){
  diff <- grp[grp$OUTCOME==1,]$FREQ - grp[grp$OUTCOME==0,]$FREQ
  return(diff[[1]])
}
ibrary(dplyr)
compute.iv <- function(lookup.list, summary, N.pos, N.neg){  
  length.tab <- length(summary$tables)
  iv <- rep(0, length.tab)
  for(i in 1:length.tab){    
    df.woe <- lookup.list[[i]]
    cond <- as.data.frame(summary$tables[i])

    colnames(cond) <- c('OUTCOME', 'LEVEL', 'FREQ')
    cond[is.na(cond$FREQ), ]$FREQ <- 1/ifelse(cond[is.na(cond$FREQ), ]$OUTCOME == 1,N.pos +1, N.neg+1)
    cond <- cond %>%
      group_by(LEVEL) %>%
      do(COND_DIFF = cond.diff(.)) %>%
      mutate(COND_DIFF = COND_DIFF[[1]])
    df.combined <- merge(df.woe, cond, by = 'LEVEL')

    iv[i] <- sum(df.combined[,2]*df.combined$COND_DIFF)
  }

  iv.df <- as.data.frame(cbind(names(summary$tables), iv))
  colnames(iv.df)[1] <- 'feature'
  iv.df
}

compute.iv(lookup.num, summary.num)
compute.iv(lookup.cat, summary.cat)

The result looks like:

  feature                 iv
1   fare_amount 0.0130676330371582
2 trip_distance 0.0252081334107816
> compute.iv(lookup.cat, summary.cat)
feature                 iv
1 DOLocationID  0.391818787450526
2 PULocationID  0.325000696270622
3   RatecodeID 0.0413093811488285
4 payment_type  0.919023041303648 

Information value provides a clue to the predictive power of the features. The dominant feature here is payment_type. One possible reason is that payment_type provides status such as "void trip", "dispute", etc. These may indicate that the customer may have an unusual experience and thus the tip may not be high. The next important two features are the location features. It is possible that neighborhood can provide economic status of the passengers, which is reflected in the tip.
The result from the klaR basically looks similar. The main difference lies in the processing of empty levels. For example, some levels may have zero counts for high_tip= 1. In klaR, the related conditional probability is calculated by simply using a surrogate value supplied by the customer. But the solution above uses Laplace smoothing, which is # (total high_tip=1) +1 / #(total high tip =0) + 1.

Conclusion

In this blog, we discussed WOE and IV, illustrated how to compute these while leveraging convenient Oracle R Enterprise in-database algorithm features. This allowed us to highlight other ways in which ORE algorithms can be leveraged. The model successfully predicted the high percentage tips for NYC green taxi data as evidenced by the AUC. The relative importance of each feature is also provided by computing information value. The ORE-based solution has the advantages of computing necessary statistics in-database, thereby scaling to large amounts of data. Leveraging the ore.odmNB algorithms performs automatic binning, which is convenient for processing numerical features.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services