Our first goal of was to accurately model the probability of default of a given loan, by blinding ourselves to features that contain information about protected classes (race, gender, etc.).
We achieved an improvement of 1 percentage point in testing accuracy over the baseline using a tuned multi-layer perceptron
We consider this a success given the challenge of discriminating between the two classes (repaid vs. defaulted loans) using obvious predictors, as explored in the initial data exploration. We also overcame the imbalance in classes using synthetic methods to allow our model to capture the nuances of the under represented class (defaulted loans).
Our second goal was to choose a subset of loans that would maximize our return on investment (ROI), given the predicted probability of default of loans.
We formalized our understanding of return on investment to define a real ROI metric that allowed us to rank loans and select the top subset to invest in.
Our third goal was to assess the extent to which discrimination existed and then adjust the loans we chose to attain statistical parity.
We identify unintended flaws in our initial investment strategy that led to disproportionate investment in advantaged populations, such as highly educated areas.
We stratify our investments by relevant demographics and find that we can still achieve near optimal return while meaningfully funneling our investments evenly across protected groups.
Future work
While we explored many methods and have formulated a strong investment strategy, there are many extensions that can be explored
Modeling Extensions
One potential area for exploration is to update our formula for expected ROI to include a term that accounts for average loan approval rates by zip codes to incorporate another measure of fairness into our formulations.
With:
$A$ - loan amount
$P_d$ - probability of Default (assessed by our models)
$I$ - interest rate charged on the loan
$R$ - expected return should a loan default
$\alpha$ -default penalty term
$\beta$ - tuning parameter
$P_a$ - probability of being approved for a loan
$P_{a\mid{z}}$ - probability of being approved for a loan given zip code
We could also consider investigating a time series approach to data analysis for our investment strategy.
Fairness Extensions
On the fairness front, a next step would be to implement a model that incorporates some notion of group-level fairness into the loss function. This is certainly an exciting avenue to explore, as it is a cutting-edge research area in machine learning.
It would also be wise to examine rejection data from Lending Club to inform us of characteristics that may make a loan less likely to be approved, determine if there is any implicit discrimination in that process, and potentially adjust our investment strategy.