Data Cleaning
The steps we took to get our data into the proper format.
STEP 1: Correlation
We first checked the correlation of all the variables. This helps us see which variables are redundant: if two variables are highly corollated, then they provide the same info, and we don't need to use both.
Since there are 111 variables, we create a heatmap in order to quickly visualize the correlation. The image below is pretty big. To understand it, look along the diagonal. Variables that are clustered together in red squares are correlated with each other.
There is high correlation between sets of similar variables. An example is loan_amnt
and funded_amnt
, which makes sense because if you request to borrow a higher amount of money, and you are given a loan, you will be funded a lot of money. Correlation between predictors is a problem because it makes our model too complicated when we have a lot of predictors that aren't adding new information. This makes us more likely to overfit our data and perform worse on training data. We deal with correlation by removing duplicate columns.
step 2: Post-hoc info
Drop features that contain post-hoc information (things that we only know once the loan starts). We want to be able to make our prediction just from the data we have available at the time of the loan application. For example, it's kind of cheating if we use last payment date as a prediction of default. Someone whose last payment was 119 days ago is much more likely to default than someone who made an installment yesterday.
On that line, we also drop total payment and last payment amount.
step 3: meaningless data
We also dropped columns that had no significance. member_id
, for example, is an arbitrary number assigned to each observation for accounting purposes. This number should not provide information about the likelihood for default, so we removed it.
step 4: Nanananan
Drop columns that have too much data missing.
For each column, we check what percentage of data is null/NaN/missing. Of the 111 variables, only 35 had missing values. Their percentage of null values ranged anywhere from 10% to 90%. We chose 50% as a threshold to drop columns: if a column was more than 50% null, we removed it from the dataset. For the remaining columns, we need some way to fill in the missing data. A quick and easy way to do that is to impute (fill in) with the median value.
Step 5: individualized values
Some predictors get very specific. People who apply for the loans can fill out a custom answer for some questions, which leads to a huge range of responses. For Employer, a huge amount of people have a unique answer. Zip codes as well are too specific. Since they won't be useful in predicting new data, we drop these columns.
STEP 6: encoding
Now we're left with just the columns we want to use, but we still have some manipulation to do.
Our mathematical models don't know how to interpret strings like "debt-consolidation" or "mortgage." We need to convert these qualitative attributes into numbers that the model will be able to understand.
We encode categorical variables either with numerical labels (label encoding) or split them up into new binary predictors (one-hot encoding). We chose which method to use depending on the number of unique values in each column and interpretability of encoding. We used label encoding if we could assume there was a directionality to the variable, such as columns that involved dates (like the issue date of the loan). If it didn't make sense to encode numerically, we used one hot encoding (dummy variables), such as for home ownership.
Now that our data is all in order, it's time to build our model!