Loan Company Data Analysis

02-10

Introduction

As we all know , practice make perfect, I choose to describe my reports in English from now on.
The fourth data analysis report of my weekly plan is about a company named Lendingclub which is take loan business.
Data was downloaded from Lendingclub range from 2007 to 2011.

Content

1 Upload data and pre-processing

2 Data cleaning and feature analysis

3 Model train and predict

4 Summary

Part 1 Upload and pre-processing

After we download data from lendingclub to local. Now we could start our journey to analysis those data.

In the first place ,we could use pandas to read the file and do a pre-processing to original data.

import pandas as pd#Upload original dataset and skip the first rowloans_2007 = pd.read_csv(lendingclub/LoanStats3a.csv, skiprows=1)half_count = len(loans_2007) / 2#Exclude those columns which have missing values more than half of the whole datasetloans_2007 = loans_2007.dropna(thresh=half_count, axis=1)#Save as new csv file #problem record 1: adding encoding methodloans_2007.to_csv(loans_2007.csv, index=False, encoding=utf-8)

Part 2 Data cleaning and feature analysis

Here we are going to check our newly saved data after the first step of pre-processing.

loans_2007 = pd.read_csv("loans_2007.csv")#loans_2007.drop_duplicates()#check the info about this loan dataprint(loans_2007.iloc[0])print(loans_2007.shape[1]OUTPUT:loan_amnt 5000funded_amnt 5000funded_amnt_inv 4975term 36 monthsint_rate 10.65%installment 162.87grade Bsub_grade B2emp_title NaNemp_length 10+ yearshome_ownership RENTannual_inc 24000verification_status Verifiedissue_d Dec-2011loan_status Fully Paidpymnt_plan ndesc Borrower added on 12/22/11 > I need to upgra...purpose credit_cardtitle Computerzip_code 860xxaddr_state AZdti 27.65delinq_2yrs 0earliest_cr_line Jan-1985inq_last_6mths 1open_acc 3pub_rec 0revol_bal 13648revol_util 83.7%total_acc 9initial_list_status fout_prncp 0out_prncp_inv 0total_pymnt 5863.16total_pymnt_inv 5833.84total_rec_prncp 5000total_rec_int 863.16total_rec_late_fee 0recoveries 0collection_recovery_fee 0last_pymnt_d Jan-2015last_pymnt_amnt 171.62last_credit_pull_d Jan-2018collections_12_mths_ex_med 0policy_code 1application_type Individualacc_now_delinq 0chargeoff_within_12_mths 0delinq_amnt 0pub_rec_bankruptcies 0tax_liens 0hardship_flag Ndisbursement_method Cashdebt_settlement_flag NName: 0, dtype: object54

As show in above ,there are 54 feature columns left in current data which is too much for train a model . So we still need to analysis the feature of each colunms . According to the descriptions in the file of LCDataDictionary download from the website .

The specific Columns in below:

"funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d" ,"zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv" ,"total_rec_prncp","total_rec_int","total_rec_late_fee", "recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt", "desc","hardship_flag","disbursement_method","debt_settlement_flag"

Seems like have no relate to the result of get the loan which I choose to drop them down from the data.

loans_2007 = loans_2007.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)loans_2007 = loans_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)loans_2007 = loans_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries" , "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)loans_2007 = loans_2007.drop([desc,hardship_flag,disbursement_method, debt_settlement_flag], axis=1)

How the data looks like now? We could print it out and check the results.

print(loans_2007.iloc[0])print(loans_2007.shape[1])OUTPUT:loan_amnt 5000term 36 monthsint_rate 10.65%installment 162.87emp_length 10+ yearshome_ownership RENTannual_inc 24000verification_status Verifiedloan_status Fully Paidpymnt_plan npurpose credit_cardtitle Computeraddr_state AZdti 27.65delinq_2yrs 0earliest_cr_line Jan-1985inq_last_6mths 1open_acc 3pub_rec 0revol_bal 13648revol_util 83.7%total_acc 9initial_list_status flast_credit_pull_d Jan-2018collections_12_mths_ex_med 0policy_code 1application_type Individualacc_now_delinq 0chargeoff_within_12_mths 0delinq_amnt 0pub_rec_bankruptcies 0tax_liens 0Name: 0, dtype: object32

The number of columns was decrease from 54 to 32 .

We may have a question about how can we know the finally result of people who submit a loan request. According to the describe of LCDataDictionary . Feature column loan_status represent the outcome of each loan.

#show categories of feature column loan_status print(loans_2007[loan_status].value_counts())OUTPUT:Fully Paid 34116Charged Off 5670Does not meet the credit policy. Status:Fully Paid 1988Does not meet the credit policy. Status:Charged Off 761Name: loan_status, dtype: int64

The result shows there four different kinds of condition.However, only Fully Paid means got the loan successful and Charged Off failed. So we could drop the other two down from our data.

#Take rows "Fully Paid" and "Charged Off" as label and change them into INT#1 ； provided loan#0: not provided loanloans_2007 = loans_2007[(loans_2007[loan_status] == "Fully Paid") | (loans_2007[loan_status] == "Charged Off")]status_replace = { "loan_status" : { "Fully Paid": 1, "Charged Off": 0, }}loans_2007 = loans_2007.replace(status_replace)

Before we go ahead , there are still some filtering things need to be finished.

#lets look for any columns that contain only one unique value and remove themorig_columns = loans_2007.columnsdrop_columns = []for col in orig_columns: col_series = loans_2007[col].dropna().unique() if len(col_series) == 1: drop_columns.append(col)loans_2007 = loans_2007.drop(drop_columns, axis=1)print(drop_columns)print (loans_2007.shape)loans_2007.to_csv(filtered_loans_2007.csv, index=False,encoding=utf-8)

It seems more clean than before , but the cleaning of data need continue.

#Upload the newly saved csv fileloans = pd.read_csv(filtered_loans_2007.csv)#check the null valuesnull_counts = loans.isnull().sum()print(null_counts)OUTPUT：loan_amnt 0term 0int_rate 0installment 0emp_length 0home_ownership 0annual_inc 0verification_status 0loan_status 0purpose 0title 10addr_state 0dti 0delinq_2yrs 0earliest_cr_line 0inq_last_6mths 0open_acc 0pub_rec 0revol_bal 0revol_util 50total_acc 0last_credit_pull_d 2pub_rec_bankruptcies 697dtype: int64

As we can see above,there are 697 missing value for pub_rec_bankruptcies.We decide to drop it and the other NULLs in rows .

loans = loans.drop("pub_rec_bankruptcies", axis=1)loans = loans.dropna(axis=0)print(loans.dtypes.value_counts())OUTPUT：object 11float64 10int64 1dtype: int64

Before we training data , make sure that values in each feature column are numerical value. So lets handle those object feature.

object_columns_df = loans.select_dtypes(include=["object"])print(object_columns_df.iloc[0])OUTPUT:term 36 monthsint_rate 10.65%emp_length 10+ yearshome_ownership RENTverification_status Verifiedpurpose credit_cardtitle Computeraddr_state AZearliest_cr_line Jan-1985revol_util 83.7%last_credit_pull_d Jan-2018Name: 0, dtype: object

Check out specific columns

cols = [home_ownership, verification_status, emp_length, term, addr_state]for c in cols: print(loans[c].value_counts())OUTPUT:RENT 18881MORTGAGE 17688OWN 3056OTHER 96NONE 3Name: home_ownership, dtype: int64Not Verified 16890Verified 12833Source Verified 10001Name: verification_status, dtype: int6410+ years 8897< 1 year 45762 years 43893 years 40944 years 34355 years 32791 year 32406 years 22277 years 17718 years 14839 years 1259n/a 1074Name: emp_length, dtype: int64 36 months 29041 60 months 10683Name: term, dtype: int64CA 7095NY 3815FL 2869TX 2729NJ 1850IL 1524PA 1515VA 1407GA 1399MA 1343OH 1221MD 1053AZ 878WA 841CO 791NC 788CT 754MI 722MO 685MN 613NV 497SC 472WI 459AL 451OR 450LA 436KY 327OK 299KS 271UT 259AR 245DC 212RI 199NM 190WV 177HI 173NH 172DE 113MT 85WY 83AK 81SD 63VT 54MS 19TN 17IN 9ID 6NE 5IA 5ME 3Name: addr_state, dtype: int64print(loans["purpose"].value_counts())print(loans["title"].value_counts())OUTPUT：debt_consolidation 18661credit_card 5134other 3985home_improvement 2980major_purchase 2182small_business 1827car 1549wedding 947medical 693moving 581house 382vacation 380educational 320renewable_energy 103Name: purpose, dtype: int64Debt Consolidation 2189Debt Consolidation Loan 1732Personal Loan 661Consolidation 516debt consolidation 508Home Improvement 357Credit Card Consolidation 357Debt consolidation 334Small Business Loan 329Credit Card Loan 319Personal 309Consolidation Loan 256Home Improvement Loan 249personal loan 234personal 220Loan 213Wedding Loan 209consolidation 205Car Loan 204Other Loan 192Credit Card Payoff 155Wedding 154Major Purchase Loan 145Credit Card Refinance 144Consolidate 127Medical 122Credit Card 119home improvement 112My Loan 94Credit Cards 93 ... Right Track 1Paying of medical and small card bills 1Real Estate Investment 12002 Timberland RV travel trailer 1The Americrats Fund 1Erikas CC debt consolidation 1expanding business 1Guys Home Improvement 1TCA - CC payoff, Equip Purchase, Capital 1Burning off fat.. and debt 1New Business & Transition Expenses 1all done 1Quick Pay 1General Purpose Loan 1Boost Credit Score 1Opportunity to Excel 1Chase CreditCard Balance transfer 12008 Tax Payment 1Pay off College debt 1PGloan 1Travel Trailer - Seeing the Sites 1gouverneur home 1SAVE ME. IM DROWNING 1Consolidation2243 1Relocation expenses 1Green Home Energy Financing 1Ford Falcon 1Pay it all off! 1Paid off credit debt 1Tommy 1Name: title, Length: 19610, dtype: int64

As we can see , columns of "addr_state","title" contains varies of value which is not a better solution to keep them.

mapping_dict = { "emp_length": { "10+ years": 10, "9 years": 9, "8 years": 8, "7 years": 7, "6 years": 6, "5 years": 5, "4 years": 4, "3 years": 3, "2 years": 2, "1 year": 1, "< 1 year": 0, "n/a": 0 }}#Drop those columns that varies different value and deal with columns with string like %loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")loans = loans.replace(mapping_dict)

The last step of changing all the object stype into numerical value is adopt pandas get_dummies function.

cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]#Make columns turn to matrix with pd function get_dummiesdummy_df = pd.get_dummies(loans[cat_columns])loans = pd.concat([loans, dummy_df], axis=1)loans = loans.drop(cat_columns, axis=1)

Save as cleaned csv file

loans.to_csv(cleaned_loans2007.csv, index=False,encoding=utf-8)

Check out it.

loans = pd.read_csv("cleaned_loans2007.csv")print(loans.info())OUTPUT:<class pandas.core.frame.DataFrame>RangeIndex: 39724 entries, 0 to 39723Data columns (total 37 columns):loan_amnt 39724 non-null float64int_rate 39724 non-null float64installment 39724 non-null float64annual_inc 39724 non-null float64loan_status 39724 non-null int64dti 39724 non-null float64delinq_2yrs 39724 non-null float64inq_last_6mths 39724 non-null float64open_acc 39724 non-null float64pub_rec 39724 non-null float64revol_bal 39724 non-null float64revol_util 39724 non-null float64total_acc 39724 non-null float64home_ownership_MORTGAGE 39724 non-null int64home_ownership_NONE 39724 non-null int64home_ownership_OTHER 39724 non-null int64home_ownership_OWN 39724 non-null int64home_ownership_RENT 39724 non-null int64verification_status_Not Verified 39724 non-null int64verification_status_Source Verified 39724 non-null int64verification_status_Verified 39724 non-null int64purpose_car 39724 non-null int64purpose_credit_card 39724 non-null int64purpose_debt_consolidation 39724 non-null int64purpose_educational 39724 non-null int64purpose_home_improvement 39724 non-null int64purpose_house 39724 non-null int64purpose_major_purchase 39724 non-null int64purpose_medical 39724 non-null int64purpose_moving 39724 non-null int64purpose_other 39724 non-null int64purpose_renewable_energy 39724 non-null int64purpose_small_business 39724 non-null int64purpose_vacation 39724 non-null int64purpose_wedding 39724 non-null int64term_ 36 months 39724 non-null int64term_ 60 months 39724 non-null int64dtypes: float64(12), int64(25)memory usage: 11.2 MBNone

Finally we finished data cleaning and feature analysis. Its time to import our sklearn Machine Learning algorithm do something interesting.

Part 3 Model train and predict

As usual , import the ML algorithm ,in first should always be LogisticRegression . In the same time , we are going to split our data to feature , target.

from sklearn.linear_model import LogisticRegressionfrom sklearn.cross_validation import cross_val_predict, KFoldlr = LogisticRegression()kf = KFold(features.shape[0], random_state=1)predictions = cross_val_predict(lr, features, target, cv=kf)predictions = pd.Series(predictions)# False positives.fp_filter = (predictions == 1) & (loans["loan_status"] == 0)fp = len(predictions[fp_filter])# True positives.tp_filter = (predictions == 1) & (loans["loan_status"] == 1)tp = len(predictions[tp_filter])# False negatives.fn_filter = (predictions == 0) & (loans["loan_status"] == 1)fn = len(predictions[fn_filter])# True negativestn_filter = (predictions == 0) & (loans["loan_status"] == 0)tn = len(predictions[tn_filter])# Ratestpr = tp / float((tp + fn))fpr = fp / float((fp + tn))print(tpr)print(fpr)print (predictions[:20])OUTPUT:0.99923693246852340.99858432135905150 11 12 13 14 15 16 17 18 19 110 111 112 113 114 115 116 117 118 119 1dtype: int64

According to the result show above , the model seem like predict all the item to 1 . Thats definitely not a useful model for prediction . Consider about the unbalance of the original data , we could try to introduce a parameter named class_weight in LogisticRegression and see how it affect the outcome.

from sklearn.linear_model import LogisticRegressionfrom sklearn.cross_validation import cross_val_predict#Introduce parameter class_weight and set it value to balancedlr = LogisticRegression(class_weight="balanced")kf = KFold(features.shape[0], random_state=1)predictions = cross_val_predict(lr, features, target, cv=kf)predictions = pd.Series(predictions)# False positives.fp_filter = (predictions == 1) & (loans["loan_status"] == 0)fp = len(predictions[fp_filter])# True positives.tp_filter = (predictions == 1) & (loans["loan_status"] == 1)tp = len(predictions[tp_filter])# False negatives.fn_filter = (predictions == 0) & (loans["loan_status"] == 1)fn = len(predictions[fn_filter])# True negativestn_filter = (predictions == 0) & (loans["loan_status"] == 0)tn = len(predictions[tn_filter])# Ratestpr = tp / float((tp + fn))fpr = fp / float((fp + tn))print(tpr)print(fpr)print (predictions[:20])OUTPUT:0.66099257476594380.39780569810652980 11 02 03 14 05 16 07 08 09 010 011 112 013 114 115 016 017 118 119 1dtype: int64

The prediction isnt always be the same any more . However , what we expecting are that tpr could as higher as possible and fpr could as lower as possible . In above solution, the value of parameter class_weight was set to balanced . Lets make a litter change about it .

from sklearn.linear_model import LogisticRegressionfrom sklearn.cross_validation import cross_val_predictpenalty = { 0: 5, 1: 1}lr = LogisticRegression(class_weight=penalty)kf = KFold(features.shape[0], random_state=1)predictions = cross_val_predict(lr, features, target, cv=kf)predictions = pd.Series(predictions)# False positives.fp_filter = (predictions == 1) & (loans["loan_status"] == 0)fp = len(predictions[fp_filter])# True positives.tp_filter = (predictions == 1) & (loans["loan_status"] == 1)tp = len(predictions[tp_filter])# False negatives.fn_filter = (predictions == 0) & (loans["loan_status"] == 1)fn = len(predictions[fn_filter])# True negativestn_filter = (predictions == 0) & (loans["loan_status"] == 0)tn = len(predictions[tn_filter])# Ratestpr = tp / float((tp + fn))fpr = fp / float((fp + tn))print(tpr)print(fpr)OUTPUT:0.73134153141783820.4951336046717395

Tpr seems like higher than last solution same as Fpr . we do not expect that happens.

The LogisticRegression algorithm looks like not working very well . How about RandomForestClassifier?

from sklearn.ensemble import RandomForestClassifierfrom sklearn.cross_validation import cross_val_predictrf = RandomForestClassifier(n_estimators=10,class_weight="balanced", random_state=1)#print help(RandomForestClassifier)kf = KFold(features.shape[0], random_state=1)predictions = cross_val_predict(rf, features, target, cv=kf)predictions = pd.Series(predictions)# False positives.fp_filter = (predictions == 1) & (loans["loan_status"] == 0)fp = len(predictions[fp_filter])# True positives.tp_filter = (predictions == 1) & (loans["loan_status"] == 1)tp = len(predictions[tp_filter])# False negatives.fn_filter = (predictions == 0) & (loans["loan_status"] == 1)fn = len(predictions[fn_filter])# True negativestn_filter = (predictions == 0) & (loans["loan_status"] == 0)tn = len(predictions[tn_filter])# Ratestpr = tp / float((tp + fn))fpr = fp / float((fp + tn))print(tpr)print(fpr)OUTPUT:0.9754937927391190.9419571757211113

Well , RandomForestClassifier algorithm also not working well as expecting .

Part 4 Summary

In real world , data is uncleaned and full of null values . It will take us lots of time to do feature analysis and decide which column should be dropped.
Consider about the unbalance of original data , we could adopt the relate parameter to reducing impact of results.