Improving the art, craft and science of economic credit risk scorecards using random forests: why credit scorers and economists should use random forests.

Author:Sharma, Dhruv


The aim of this paper is to outline an approach to improving credit risk scorecards using Random Forests. We start with the benefits of random forests compared to logistic regression, the tool used most often for credit scoring systems. We then compare performance of random forests and logistic regression out of the box on a credit card dataset, a home equity loan dataset and a proprietary data set. We outline an approach to improving logistic regression using the random forest. We conclude by demonstrating how power random forests can be used to develop a model using 8 variables which is almost as good as the FICO[R] score. Thus highlighting the fact that data sets with complex interaction terms and contents can benefit from random forest models in 2 ways: 1) clear insight into the most predictive and valuable variables 2) generating robust models which maximize predictive interactions and relationships in the data not detectable by traditional regression techniques.

For the purpose of this study, model performance will be compared using Receiver Operating curves which plot the proportion of bad loans detected vs. incorrectly classified good loans for each model cut off. Numerically this will be represented by the area under the curve of the ROC plot. All performance discussed will be out of sample performance of a 30% hold out sample while the models generated are built on 70% of the dataset. All investigations into data are conducted using R and Rattle tool.


The biggest problem with traditional credit scoring based on logistic regression techniques is that as a scientist or economist one cannot interpret the importance of underlying variables to the probability of a borrower experiencing financial difficulty.

The p values of the regression are not reliable as regression assumes no multicollinearity. As such variables which might make sense from a theoretical point of view, such as cash flow surrogates, and may have strong predictive power would not appear to be statistically significant based on p value statistics. This is a problem because credit data is notoriously correlated and biased. It is well known that 'biased estimation in data ... [which] has been shown to predict and extrapolate better when predictor variables are highly correlated ...' as this is common to credit scoring (Overstreet, 1992) .

Although modelers have used skill and judgment to work past this short coming there is no way in traditional scorecards to assess the predictive value variables in a robust and reliable manner. Thus there might be many opportunities of variables and variable interactions which might be lost given the use of the current tool.

Also from a human factors and organizational point of view people are biased to test theories they have and not try things that might not make sense. Our ability to develop causal models is biased and arbitrary despite the meanings we attach to things after the fact.

The history of credit scoring literature is rife with contradictory studies from the Durand's first study in the 1930s on whether income is predictive. Yet mortgage risk models have shown the debt ratio (monthly expenses/income) to be predictive as well as month's reserves (liquid assets/monthly payment). The successes of credit scoring in the mortgage industry show that financial worth and ability to pay variables can be used effectively in models along with loan to value (loan amount/property value) to assess risk. If we step back we can see that interaction variables of affordability and credit risk have proven to be valuable predictive tools. This is also consistent with the judgment theory of credit of: credit (willingness to pay), capacity (ability to pay) and collateral, and character.

The next leap in improvement to credit scoring is to find ways to test interaction terms in a meaningful and principled way. It stands to reason econometrically that if any variable should have impact on human behavior in spending, consumption, and financial distress it should be ability to pay. The measures of this are income, current debt usage, and reserves and assets one has saved to absorb shocks or life events.

Is there a statistically reliable way to test out the importance of variables, relative to their predictive power?

Importance of Random Forests to Credit Risk and Economics in general

To date the majority of credit scorecards used in industry are linear models despite the known issues of the flat maximum and multicollinearity (Wainer, 1978; Overstreet etal 1997;). Random Forests are a powerful tool for economic science as they are able to successfully deal with correlated variables with complex interactions (Breiman, 2001).

A simple example of the power of Random Forests was shown by Breiman in the binary prediction case of hepatitis mortality in which Stanford medical school had identified variables 6, 12, 14 and 19 as most predictive of risk using logistic regression. Subsequently using the bootstrap technique Efron showed that none of these variables were significant in the random resampling trials he ran. The Random Forest variable importance measure, created by Breiman, showed variables 7 and 11 to be critical and improved the logit regression results simplifying the model and by reducing error from 17% to 12% (Breiman, 2002).

As Random Forests are non parametric the linear restrictions of the flat maximum do not come into play as such. That said predictive models tend to perform well with regards to pareto optimal trade offs in true positive and false positive rates which look like an asymptote like the flat maximum effect. The complex interactions of economic variables such as macroeconomic forces and affordability are too complex to be studied for simple linear regression anymore. Random Forests serve as good estimate for asymptote of possible predictive power in this regards and help us get past the psychological limit we may believe to exist for predictive power as Roger Banister was able to do with preconceived limit on minimum time for completing the mile run. The way Random Forests work by building large quantities of weak classifiers with random selection of variables grown with out of sample testing is analogous to the way humans make decisions in a market place (See Gigerenzer's work on "Fast and Frugal trees" on human judgment models). Humans each look at the data available to them and make quick inferences and take actions based on these data. Random Forests then take votes from these large quantities of predictors and use decisions of all the predictors to make the final decision. The fact that diverse models built on different variables and samples of data when combined outperform other simple linear models is profound.

That said the critical aspects of Random Forests of interest to economic scientists are the features Breiman intended such as:

* Random Forests never overfit the data as they are built with out of sample testing for each submodel

* Variable importance ( a measure based on the importance in accuracy each variable provides to the overall model based on permutation tests of removing variables)

* Being able to see the effects of variables on predictions (2002).

* Handling thousands of variables efficiently by sampling variables.

Random Forests help us see the true impact of complex interrelated variables. As Breiman mentioned in his Wald lecture, complex phenomenon cannot be modeled well with goodness of fit models with simplifications. A more scientific approach is to build as complex a model to fit the phenomenon being studied and then to have tools like variable importance to understand the relationship inside the phenomenon being studied (Breiman, 2002). This is an important point as economics is based more and more complex realities.

Comparison of Random Forests to Logistic Regression

We now examine random forest performance out of the box on 3 data sets. The first dataset is a private label credit card data set from the 2010 KDD contest in Pacific Asia, the second data set is the widely used home equity loans, and the third data is a proprietary dataset.

  1. Random Forest vs. Logistic Regression on Credit Card...

To continue reading