X

Blogs about Deep Learning, Machine Learning, AI, NLP, Security, Oracle Traffic Director,Oracle iPlanet WebServer

  • August 20, 2017

Analysing Credit Card default Datasets using Apache Spark and Scala

I started experimenting with Kaggle Dataset Default Payments of Credit Card Clients in Taiwan using Apache Spark and Scala.

I have tried different techniques like normal Logistic Regression, Logistic Regression with Weight column, Logistic Regression with K fold cross validation, Decision trees, Random forest and Gradient Boosting to see which model is the best.

On inspecting the dataset given in the study UCI_Credi_card.csv, found that the dataset is unbalanced, out of 30000 records, count of records with default.payment.next.month value is “0” is 23,364 and the rest have value “1”. If value of default.payment.next.month of 0 mean they have paid the instalment and 1 means they have defaulted. Aim of this model is to predict who will default.

Data is clean and has no NA so data cleaning was not required refer this link of UCI machine learning repository for more details about the dataset.

The following are categorical variables 

  • SEX : integer : possible values 1 or 2
  • MARRIAGE : integer : possible values 0,1,2,3
  • AGE : integer : possible values from 21 to 79. Assuming it is year as per this link of UCI machine learning repository which describes this dataset, X5 is Age (year)
  • EDUCATION : integer : possible values 0 to 6
  • PAY_0,2,3,4,5,6 : integer : possible values between -2 to 8

Correlation plot using R rattle.

This shows that Bill Amounts have high correlation with each other and PAY_* variables have high correlation. We should remove these variables which have high multicollinearity. But for simplicity of this blog I am not going into those details will address it in future blogs.  Unlike multiple linear regression, we do not have to check for normal distribution.

Install Apache Spark

Download spark-2.2.0-bin-hadoop2.7.tgz from https://spark.apache.org/downloads.html

Extract it.

$ cd spark-2.2.0-bin-hadoop2.7

In this directory keep the csv file UCI_Credit_Card.csv and the code explained below.

$./bin/spark-shell

And run code manually line by line or :load code.scala

Or $./bin/spark-shell -i code.scala

Initial Steps

First thing to do is to import all the packages

Read the csv file from file UCI_Credi_card.csv. Make sure we read its headers and we will try to infer data types otherwise it will read everything as a string. 

Check schema of the data frame

The column “ID” is of type “integer” but should be of type “string”. Rest of the types were detected properly.

Convert “ID” to string. Also rename column “default.payment.next.month" to "y". 

As dataset is unbalanced, out of 30000 records, count of records with y is 0 is 23,364 and the rest are y value is 1. To weight the records,  refer this link.

When y value is 1 classWeightCol is set to 0.7789 and 0.2212 when it is 0. 

 

Convert Categorical Variables

Convert categorical variables by using OneHotEncoder https://spark.apache.org/docs/latest/ml-features.html#onehotencoderCreate a StringIndexer for each column and then run OneHotEncoder on them. Pipeline both and run to get modified dataset. Check the schema again to observe that columns with “_Vec” in names are created.  Combine useful columns into a column named "features" on which models will be run. Check the schema again to see that the new column “features” is created.

Split test and training data, create evaluators

Split data into training (80%) and test (20%) or 70-30 ratio. Create one binary classification evaluator which looks at "rawPrediction". Metrics for this Binary Classification Evaluator is “areaUnderROC”. Regression evaluator looks at "prediction". For Regression Evaluator use metrics “rmse”. Try other metrics also like “mse” or “r2” or “mae”.

Try different models like Simple Logistic Regression, Logistic Regression with Weight column, Logistic Regression with K fold cross validation, Decision trees, Random forest and Gradient Boosting

Run logistic regression and print Area Under ROC and RMSE.  When we used logistic regression with weighted column, we get better Area Under ROC curve than linear regression. Try 10 fold cross validation. We get better Area under ROC than simple linear regression. Create a Parameters Grid with regulariziation parameters of 0.05, 0.1 and 0.2 and maximum iterations 5,10 and 15. Create a Decision Tree Regressor with maximum bins 32 and maximum depth 5.  Try random forest and also create a Gradient Boosting Regressor with maximum iterations 10 

K fold cross validation is a very popular resampling technique to train and test model k times on different subsets of training data. For more details on K fold Cross validation this link on 7 Important Model Evaluation Error Metrics Everyone should know

Output

Here is the output from the program with Area Under ROC curve and RMSE for different models.

Comparison of RMSE for different methods 

What is RMSE and why it is important factor to decide the model ?

Root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed.  It is the square root of the mean of the square of all of the error.  RMSE <- sqrt(mean((y-yPred)^2))

For more details on Area Under ROC Curve, RMSE and K fold corss validation, refer this link on 7 Important Model Evaluation Error Metrics Everyone should know
 

Comparing the outputs of all the models tried above 

Gradient boosting shows minimum RMSE.

Will blog more about this dataset later.

I have posted this at my blogging site  also

References

 
 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha