Data Exploration: Loan Statistics
Basic stats about the LendingClub data.
First steps
We analyze the loan data made available by Lending Club at their website. They do not provide a file with the collated data, and only have individual files for different time frames. For example, for their general loan data, they have files for 2007-2011, 2012-2013, 2014, 2015, 2016 Q1, and 2016 Q2. This is an uneven division of years, which makes the data exploration slightly difficult. They also have a different division of years for their data on loan applications that were declined, which we won't focus on.
Our first task was therefore to import all the data into our workspace, create a dataframe, and start appending all the different files. We then started looking at the different loan qualities that were described in the data.
loan Statistics
The first item we looked at was the distribution of the loan statuses.
The most common status is “current” - meaning, the loan is still active, and so nothing more can be said about it. However, the second most common status is “fully paid,” which is great.
We next explored the amount of money that was loaned for each of those loan statuses. The shapes of their distributions are all pretty similar across the different loan statuses: the most common loans are in the $5,000-$10,000 range, while loans that had a lot more money (in the >$3,000 range) were the least common.
Borrower Statistics
Why are people taking out loans? Below are the reported reasons for the loans:
However, we caution against taking this data blindly - these are just the reported reasons for the loans.
Now that we've explored the general layout of our data, let's take a closer look at some of the predictors, like income or interest rate.