Learn data science best practices

Not Just Machine Learning: 3 Lessons Learned Working with Credit Risk Datasets

Credit risk datasets are important. They are important to financial regulators, central banks, and governments in order to prevent financial crises. Systems in banks that produce and process credit risk datasets make large numbers of calculations and predictions. When implementing those systems, testing is required. A key objective of that testing is to reconcile the expected outputs from those systems with the actual outputs. In the big data era, these datasets are frequently in the hundreds of thousands of records and can even be up to tens or hundreds of millions. As manually checking the actual and expected outputs is not feasible, automated testing solutions are the only realistic option. As someone who has worked at various multinational banks and the UK financial regulator, I’ve worked on several automation solutions in the credit risk space. I also have experience leading a team using a credit risk dataset for predictive modeling in competitive data science.

In this article, I will go over three things I learned from using credit risk datasets for automated testing and predictive modeling.

1. Unique IDs Make Joining Supplemental Data Vastly Easier 

In one particular greenfield project, the data I received from an upstream system initially arrived without a unique identifier. A unique identifier is a way of uniquely identifying a record within a database table and enables efficient linking between tables. For the automated testing solutions that I implemented, the objective was always to make comparisons between actual and expected results. I developed the tools to do this initially at the lowest, most granular level. This meant focusing on specific records without first doing any aggregation. The tool tested aggregated values only after the granular checks were complete.

While it’s true that a unique identifier is not directly useful for modeling, it is useful for other operational activities such as testing, reporting, and reconciliation processes. At one point, I used 15 fields to create a unique join between two tables in the test system. The ID alone would have enabled this same join. Although it worked, the analysis was cumbersome and the processing time was slower. Along with the main credit risk inputs, there were over 20 other parameter input files to the system under test.

In preparing early synthetic versions of this data, it was important to ensure that all necessary combinations of parameter data were provided to the financial stress testing engine. This required a good understanding of the function of each parameter input. Improving the design of the datasets, especially the provision of IDs, facilitated this. The extract from the upstream system later picked these up as the design and processes matured from the initial greenfield state.

When designing credit risk datasets, particularly selecting fields from existing systems, it is prudent to take a broader view of how that ID could be useful and add it to the dataset. It is easier to add ID fields early in the piece than to add them later when more complex operational pipelines are already in place.

2. Data Subset Selection Techniques Facilitate Good Test Coverage in a Reasonable Timeframe

Greenfield projects in credit risk may not have a ready-made reliable source of expected results when checking the model output. Consequently, it may be necessary to build test harnesses. These test tools can test various types of scenarios such as financial stress testing, but may lack the hardware firepower to do this on full-sized datasets. This is because they may be desktop-based and likely will not have the same amount of processing or memory capacity as the system under test. This is simply due to the fact that they are functional test tools. The cost of enabling a functional test harness to match the processing power of a production system can be significant and is not always justified. In response, I developed a tool with the objective of achieving maximal coverage of business scenarios using a minimal subset of the data provided. The tool had a parameterizable list of categories defined by the user. The data subset selection tool looped through all records. It selected a record for the subset only when it contained a categorical variable, from the earlier parameterized list, that was not already in the subset. This tool produced a subset of the original data that contained records including every categorical variable within the full dataset. This data subset allowed comprehensive testing on scenarios across all categorical variables. Depending on the categories selected, the tool achieved coverage of a complete selection of realistic business scenarios using as little as 1% to 4% of the original data. Without this data subset selection approach, or something similar, it would not have been possible to cover the same breadth of test scenarios. Testing using the full dataset would have required processing the data in batches through the automation test tool and hence the processing time would have increased significantly to the point of being infeasible.

This data reduction approach did not preserve the distribution of the full dataset in the way that stratified sampling would, but it ensured that each unique categorical variable was present in the reduced dataset. Any approach of this nature should be treated with caution in terms of production data science modeling; however, this approach was very useful for the intended purpose of testing. This data reduction approach could also have some benefits for prototype modeling or functional testing early versions of machine learning (ML) models.

Data subset selection techniques can help ensure good test coverage when the supplied data is more than sufficient for functional testing purposes. Without this approach, it will take a significant amount of time to cover a wide breadth of test scenarios, or significant cost to scale up the processing power of the test harness.

3. Cross-Validation Techniques Can Lead to Substantial Improvements in Evaluating Predictive Modeling Efficacy

I recently had the privilege of leading a team of talented people in Kaggle’s largest featured competition to-date where the objective was to predict the relative likelihood of default for a dataset of cash loans and revolving loans. In the competition, the team used a stratified k fold cross-validation (CV) approach with a constant seed. For stratified k fold validation, the training data is separated into k folds and the model is run k times. Each time, the model is trained on all the training data except one fold. The excluded fold changes each time and is treated as a validation set so that predictions are made on it. Those predictions can be stored before the next fold is run. At the end of this process, predictions are available for the whole of the training set. In the competition, the dataset was unbalanced. This means that there was not an even split between the number of people recorded as having defaulted (i.e., failed to pay on their loans), and those not in default (i.e., people still paying or who already paid off the loan). Less than 10% were in default. We decided that a stratified approach to k fold CV was prudent since it meant that the unbalanced distribution of the whole would be reflected in each fold.

The significance of this CV approach in the competition was that it provided a more accurate guide to how well our models would generalize to data where the target was unseen. It also allowed more accurate comparisons between models. Alternative approaches, especially those adopting a single holdout set, did not fare as well. We learned that approaches using differing stratified k fold seeds did not enable the same effectiveness in stacking base-level models. The CV approach described was demonstrably effective because it enabled the team to finish 9th of 7,198 teams in the competition.

It is in a bank’s best interest to be able to model as effectively as possible the probability of each customer repaying their loan. This can help banks make key decisions around that loan, starting from the point of application. Decisions concerning the interest rate to charge, the maximum loan amount, the length of the loan are all affected. These and other key conditions allow a bank to manage its risk and maximize its profit. Having a CV approach that generalizes more effectively to unseen data, such as the one I have described, can improve this important decision-making process.

CV is slower than other validation techniques and not always used in industry, but it is a relatively efficient alternative when collecting more data is not possible or cost-effective. It also allows all available training data to be used without having to reserve some solely for validation.


Credit risk datasets have multiple uses in industry. Machine learning models use them, and so do testing, reporting and reconciliation tasks. Each of those tasks use the data in different ways to best serve their own requirements, but they all benefit from appropriate design, sourcing, selection, and utilization. As with many things in life, success stems from proper planning.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.