Initial Steps — Project Default

Our model takes in a bunch of predictors and returns whether a loan is likely to default. For simplicity reasons, our baseline model uses a subset of the predictors: annual income, state (where the person lives), employment length, funded amount, grade, home ownership (e.g. rent, mortgage), and tax liens. The goal is to predict loan_status. loan_status can have a few values: current, in grace period, fully paid, late (either 16-30 days or 31-120 days), default, or charged off. The distinctions between a few of these categorizations were not entirely clear to us, so we began researching into loan classifications. We found a report by Peter Renton, who contacted Lending Club to get their official definition. They responded with this progression of states:

First, someone applies for a loan. They can either be accepted or rejected. We are only dealing with the data sets of those who were accepted, since we want to predict those that will default.
Then the person has a current loan. They start paying it off.
1. If they pay it off, then it’s fully paid.
They then progress to the grace period.
If they miss the deadline, the loan is then late. Late loans are divided first into the 16-30 period, and subsequently into the 31-120 day.
After that, the loan finally defaults. The person has failed to repay the loan.
The final step is having the loan be charged off. Lending Club’s response says that this is “when there is no reasonable expectation of sufficient payment to prevent the charge off.”

We therefore decided to use the loans that had finished as training, and current loans as testing. “Finished” means they’d either defaulted or been charged off, or were fully paid. “Current” means current, in grace period, and late.

This is a clear division that would allow us to properly train our models and test them. It is a bit unbalanced, however: there are 4 times as many paid loans as there are loans that defaulted.

Model 0: Baseline

First, as a sanity check, we look at what happens if we predict everything to default, everything to be paid, or we just predict a random outcome.

(Class 0 is defaults, class 1 is paid loans.)

As expected, the negative model perfectly predicts defaults and the positive model perfectly predicts paid loans.

The overall accuracy is very high on the negative model and very low on the positive model, consistent with the fact that are data is pretty skewed toward non-defaults.

Model 1: Alternative MOdels

We ran a slew of traditional models in order to see the spread of possible predictions. All of them had similar problems of extreme false positives. The weighted logistic, however, does passably: although it has a lower overall accuracy, it has the highest rates of prediction for defaults, which is our main goal. We will explore it further, later on.

Model 2: Random Forest

Our first advanced model was a random forest. This seemed like the intuitive baseline model for us: basic classifications like “all default” and “all paid off” can be helpful at times, but to start off, we want a more accurate model.

We were thrilled when we saw that the random forest had 79% accuracy, but we then realized that was the cumulative accuracy. When separated into the defaulted vs. paid classifications, the predictions drop: the model scores 96% on the paid off loans, but 10% on the defaulted loans. This is because our data is imbalanced.

Since the model worked well, what predictors did it think were the most important?

In this model, last_credit_pull_d is the most important feature (by a long shot). This metric conveys when was the last time someone checked their credit score. This could be telling in that someone who has their credit score checked recently presumably has their credit score checked often, which would be a sign that they are applying for lots of loans to consolidate debt. This is consistent with our earlier observation that most people who need to take out loans are trying to refinance their debt.

Since the weighted logistic worked best, we will explore it further.

←Data Cleaning Final Model→