3. Modeling Probability of Default

Modeling the probability that a loan defaults is critical to our investment strategy. After our exploration of 2013 and 2014 data, we select predictors that present promising separations between defaulted and repaid loans. We also make sure not to use any predictors capturing information that we would not have at the time of selecting which loans to invest in.


Summary of variables

Response:

Newly defined variable, paid, which was a binary variable indicating whether a loan had any status other than ‘Fully Paid’.

Predictors:


Handling Imbalanced Class

The machine learning algorithms we select to predict future loans as defaulting or not defaulting have trouble learning to predict underrepresented classes. As seen in the table below, the defaulted class in the training data is under represented. To address this class imbalance problem, we use synthesis of new minority class instances (SMOTE). For each minority class, SMOTE calculates the k nearest neighbors. Depending on the amount of oversampling required, one or more of the k-nearest neighbors are selecte to create synthetic examples to augment the training dataset. We oversample the defaulted loan class to achieve a 50-50 split of classes using the imbalanced-learn library.

number of loans percent of loans SMOTE number of loans SMOTE percent of loans
repaid 289177 0.78 289177 0.5
defaulted 81266 0.22 289177 0.5

Machine Learning Models

Our goal is to achieve the highly accurate predictions of the probability of a loan defaulting with our selected features, and use this predicted probability as an input to our investment strategy. We use a suite of models to predict this probability and conduct hyperparameter searches using grid search. The models we explore are listed below with their tuned hyperparameters.

The baseline accuracy, i.e. accuracy achieved when the majority class (repaid loans) is always predicted, is 0.82 on the training set and 0.78 on the test set. While most of our models do not achieve baseline accuracy, we attribute this to discarding variables we identify as potentially containing information about protected classes.

As displayed in the table below, the highest test accuracy is achieved by the neural network, followed by QDA. However, QDA achieves a high test accuracy because it predominantly predicts that loans are repaid (only 0.07% of loans are predicted to default). A dive into QDA’s training accuracy reveals a similar pattern, where only 3.4% of the loans are predicted to default, despite 50% of the loans resulting in true defaults. Given this trend, we don’t trust that QDA has learnt the nuances of our training data, so we choose the next contender, logistic regression, to take forward to our return on investment analysis, along with the neural network.

Heuristically, it mattered to us that our models be able to assign different spectrums of default probabilities to loans that went on to default compared with those that repaid succesfully. We considered the probability distributions associated with our two outcome classes in the training data - aware of the fact that we shouldn’t read too much into this since there was clearly the chance our models were overfit.

Financial data is notoriously noisy, so it was promising that our models were generally able to differentiate between the different outcomes based on our input data. We were interesed in quantifying the predictive power of our models, hence we used ROC AUC scores as well as accuracy scores to make our final decision on which models to take forwards. As this is a highly imbalanced problem, we seek to confirm that our best contendors from an accuracy standpoint do maximize the area under the ROC curve. As shown in the table, the highest AUC is achieved by the logistic regression model, followed by the neural network, confirming our selection.


Training and test accuracies by model

Training Accuracy Training AUC Test Accuracy Test AUC
Logistic Regression 0.655 0.713 0.746 0.724
Neural Network 0.676 0.708 0.781 0.714
Light GBM 0.786 0.702 0.702 0.641
Random Forest 0.694 0.720 0.718 0.637
AdaBoost 0.896 0.960 0.663 0.622
Quadratic Discriminant Analysis 0.535 0.535 0.778 0.502

Probability of Default Distributions for Training Data

png


Area under the receiver operating characteristic (ROC) curve for training and test data

png


Analysis

From these results, it was clear we should take the Logistic Regression model and our MLP forward as our final models. Financial data is notoriously noisy, so the fact that both these models achieved close to/superior accuracy scores compared with the the baseline accuracy, and that both maintained their AUC across the training and test sets was certainly promising as we looked to use these models to predict Return on Investment.