Data Exploration: Variable Analysis

We explore variables of interest to see if they are accurate predictors.

 
 

Location

How do loan statuses vary across the states? Are certain states more likely to have defaults?

Below is a map showing the fraction of loans that defaulted. Red means a higher percentage defaulted; blue means less.

The map shows that Mississippi is extremely hot, with 27% of all their loans defaulting.

Maine, on the other hand, does very well, with only 5% of their loans defaulting.

 

Loan Grades

Loan applications are given different grades. Again, the loans we are looking at are ones that have been accepted - we're not even looking at loan applications that were declined!

Of the loans that defaulted, what were their starting grades? What about the loans that didn't default?

The green and red bars are almost exactly even, across the different grades! This means that the initial grade is not a strong predictor of whether a loan will default: of the loans that default, 28% had an A grade!Now that we've explored the statis…

The green and red bars are almost exactly even, across the different grades! This means that the initial grade is not a strong predictor of whether a loan will default: of the loans that default, 28% had an A grade!

Now that we've explored the statistics about the loan, let's look at some information about the people taking out the loans.

Borrower Demographics

Let's look at some basic stats about the status of the loan based on the people who are taking out the loans. The borrower's income and the length of time they've been employed are simple enough statistics that we'd assume would be related to loan status: you might think that someone who makes a lot of money is more capable of paying off the loan, or that someone who's employed for 10 years is more responsible than someone employed for one. However, the data contradicts these assumptions!

We'll first break down the stats among loans that defaulted, and then compare them with loans that were paid off. Among those who defaulted, how long have they usually been employed?

Counter-intuitively, the group that defaults most often is those that have been employed for 10 years or more!

What about home-ownership? The prior assumption again would be that those who own their home have demonstrated fiscal responsibility, while those who are still paying off mortgages may be in situations beyond their control and may default more commonly.

This corroborates the initial assumption.

Let's check out the borrowers' incomes. Are people with higher incomes more risk-averse and less likely to default?

Screen Shot 2016-12-14 at 10.27.18 AM.png

Whoa! That is a crazy graph. We show it to you just so you can see the absolute range of incomes among borrowers who defaulted. There's no one clear category - basically, no matter what income you have, you're still at risk of defaulting.

Now that we've broken down the stats about borrowers who defaulted, let's see how these stats compare across all loan statuses.

Default Vs. Paid Off:

Which variables matter?

 Here's a graph of the borrower's income in relation to the loan status.

Turns out that income amount is pretty consistent across all borrowers!

(The x-axis is on a log scale.)

What about the amount of time the borrower has been employed?

There's some variation, but the time of employment doesn't appear to correlate with whether the loan will be fully paid or will default.

Let's take a more intense look at the comparison between defaults and repaid loans. In order to do so, we're going to combine some of the categories together, and ignore others.

We don't know the results of loans that are current, late, or in the grace period. They're still on-going! We can't use them for predictions. The only loans that have concluded are "Fully Paid," "Default," and "Charged Off." (Charged off means the loan defaulted and is definitely not going to be repaid.) We joined "Charged Off" with "Default" for our analysis.

We're going to compare the paid loans vs. the defaulted loans as a function of two variables now. Green is paid, red is default.

Screen Shot 2016-12-14 at 11.15.13 PM.png
  • When we compare annual income to interest rate, we can see that as income goes up, the dots become mostly green, meaning people with higher incomes default less. As the interest rate increases, we see more red. So far, this makes sense.

 

 

Screen Shot 2016-12-14 at 11.15.13 PM.png
Screen Shot 2016-12-14 at 11.29.13 PM.png

 

  • Let's look at annual income vs debt-to-income ratio. The lower your debt-to-income ratio, the higher the chance you will be able to pay back your loan. (Debt-to-income ratio refers to the percent of one's monthly salary that goes to servicing your loan - meaning paying it off including interest.) If we take the annual income into consideration as well, we can see that the higher the income, the lower the chance of default, which matches with our other graph's conclusion.

 

  • Surprisingly, when we compare the amount that was loaned to the interest rate, we can see that people that take a large loan with lower interest rate have a much smaller chance of defaulting. This is counterintuitive, since one would expect that as you take a larger loan, the chance you might not be able to pay it back would increase.

We've now got a strong understanding of our data. It's time to start manipulating it.

←Loan Statistics                                                                                                     Data Cleaning→