基於Python的信用評分卡建模分析

基於Python的信用評分卡建模分析

信用評分技術是一種應用統計模型,其作用是對貸款申請人(信用卡申請人)做風險評估分值的方法。信用評分卡可以根據申請人的基本資料,徵信局信息等等的數據,對客戶的信用進行評估。

結合信用評分卡的建立原理:

1:建模準備:(拒絕推斷,查重、變數轉換,構造訓練集)

2:變數粗篩

3:變數清洗

4:變數細篩與變數水平壓縮

5:建模與實施

風險評分模型主要種類:

  • 申請評分:通過客戶申請時的信息預測將來發生違約/逾期等的統計概率
  • 行為評分:通過客戶以往行為表現,預測將來發生違約/逾期等的統計概率
  • 催收評分:通過客戶以往行為表現,預測已逾期賬戶清償欠款/逾期惡化的統計概率

工作原理

客戶申請評分卡是一種統計模型,它可基於對當前申請人的各項資料進行評估並給出一個分數,該評分能定量對申請人的償債能力作出預判。

客戶申請評分卡由一系列特徵項組成,每個特徵項相當於申請表上的一個問題(例如,年齡、銀行流水、收入等)。每一個特徵項都有一系列可能的屬性,相當於每一個問題的一系列可能答案(例如,對於年齡這個問題,答案可能就有30歲以下、30到45等)。在開發評分卡系統模型中,先確定屬性與申請人未來信用表現之間的相互關係,然後給屬性分配適當的分數權重,分配的分數權重要反映這種相互關係。分數權重越大,說明該屬性表示的信用表現越好。一個申請的得分是其屬性分值的簡單求和。如果申請人的信用評分大於等於金融放款機構所設定的界限分數,此申請處於可接受的風險水平並將被批准;低於界限分數的申請人將被拒絕或給予標示以便進一步審查。

1.建模準備

#讀取數據#accepts文件為風險審核通過,已貸款的數據#rejects文件為未通過風險審核的數據import pandas as pdimport numpy as npaccepts = pd.read_csv(script_credit/accepts.csv)rejects = pd.read_csv(script_credit/rejects.csv)

(1)查看數據結構

accepts.info()#查看數據缺失部分和數據dtypes

(2)對拒絕推斷rejects文件進行預測bad_ind

#拒絕推斷#先將變數與自變數分開accepts_X = accepts[[tot_derog,age_oldest_tr,rev_util,fico_score,ltv]]accepts_y = accepts[bad_ind]rejects_X = rejects[[tot_derog,age_oldest_tr,rev_util,fico_score,ltv]]

(3)對accepts_X文件中缺失數據進行填充

# 利用fancyimpute包中的knn方法進行數據填充# 由於系統為32位,未能成功安裝fancyimpute包,採用mean值填充# Use 3 nearest rows which have a feature to fill in each rows missing features# import fancyimpute as fimp#accepts_x_filled = pd.DataFrame(fimp.KNN(3).complete(accepts_x.as_matrix()))#accepts_x_filled.columns = accepts_x.columns#rejects_x_filled = pd.DataFrame(fimp.KNN(3).complete(rejects_x.as_matrix()))#rejects_x_filled.columns = rejects_x.columnsaccepts_X_filled = accepts_X.fillna(accepts_X.mean())rejects_X_filled = rejects_X.fillna(rejects_X.mean())

(4)標準化數據

# 標準化數據from sklearn.preprocessing import Normalizeraccepts_x_norm = pd.DataFrame(Normalizer().fit_transform(accepts_x_filled))accepts_x_norm.columns = accepts_x_filled.columnsrejects_x_norm = pd.DataFrame(Normalizer().fit_transform(rejects_x_filled))rejects_x_norm.columns = rejects_x_filled.columns

(5)對rejects文件中的bad_ind進行預測

# 利用knn模型進行預測from sklearn.neighbors import NearestNeighborsfrom sklearn.neighbors import KNeighborsClassifierneigh = KNeighborsClassifier(n_neighbors=5, weights=distance)neigh.fit(accepts_X_norm, accepts_y) rejects[bad_ind] = neigh.predict(rejects_X_norm)

(6)數據合併,構造訓練集

將將審核通過的申請者和未通過的申請者的數據進行合併

# accepts的數據是針對於違約用戶的過度抽樣#因此,rejects也要進行同樣比例的抽樣rejects_res = rejects[rejects[bad_ind] == 0].sample(1340)rejects_res = pd.concat([rejects_res, rejects[rejects[bad_ind] == 1]], axis = 0)data = pd.concat([accepts.iloc[:, 2:-1], rejects_res.iloc[:,1:]], axis = 0)

2.變數粗篩

#bankruptcy_ind---曾經破產標識,N沒有 Y 有bankruptcy_dict = {N:0, Y:1} #轉換為0,1data.bankruptcy_ind = data.bankruptcy_ind.map(bankruptcy_dict)

# 蓋帽法處理年份變數中的異常值,並將年份其轉化為距現在多長時間

#將小於0.1分位的數據轉為0.1分位數據;大於0.99分位數據轉為0.99分位數據year_min = data.vehicle_year.quantile(0.1)year_max = data.vehicle_year.quantile(0.99)data.vehicle_year = data.vehicle_year.map(lambda x: year_min if x <= year_min else x)data.vehicle_year = data.vehicle_year.map(lambda x: year_max if x >= year_max else x)data.vehicle_year = data.vehicle_year.map(lambda x: 2018 - x)data.drop([vehicle_make], axis = 1, inplace = True) #刪掉汽車製造商

#數據填充

#本來利用fancyimpute包中的knn方法進行數據填充data_filled = data.fillna(data.mean())data_filled.columns = data.columns

定義X,y

X = data_filled[[age_oldest_tr, bankruptcy_ind, down_pyt, fico_score, loan_amt, loan_term, ltv, msrp, purch_price, rev_util, tot_derog, tot_income, tot_open_tr, tot_rev_debt, tot_rev_line, tot_rev_tr, tot_tr, used_ind, veh_mileage, vehicle_year]]y = data_filled[bad_ind]

#粗篩變數

# 利用隨機森林篩選變數from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(max_depth=5, random_state=0)clf.fit(X,y)#將特徵的重要性,按數值排列,前9個,初篩變數importances = list(clf.feature_importances_)importances_order = importances.copy()importances_order.sort(reverse=True)cols = list(X.columns)col_top = []for i in importances_order[:9]: col_top.append((i,cols[importances.index(i)]))col_top

[(0.32921535609407487, fico_score),

(0.12722011801837413, age_oldest_tr),

(0.10428283609878117, ltv),

(0.084528506996671832, tot_derog),

(0.074201234487731263, rev_util),

(0.071344607737941074, tot_tr),

(0.067959721613501806, tot_rev_line),

(0.027759028579637572, msrp),

(0.01973823706017484, tot_rev_debt)]

col = [i[1] for i in col_top]

# 變數細篩與數據清洗

#WoE包來自GitHub#由於源代碼cuts, bins = pd.qcut(df["X"], self.qnt_num, retbins=True, labels=False)#在pd.qcut部分會出現錯誤,需增加duplicates=raise或者dropfrom WoE import *import warningswarnings.filterwarnings("ignore")iv_c = {}for i in col: try: iv_c[i] = WoE(v_type=c).fit(data_filled[i],data_filled[bad_ind]).optimize().iv() except: print(i)

#變數分箱WOE轉換,進行分箱、標準化

woe_a=data_filled[col].apply(lambda x:WoE(v_type=c).fit(x,data_filled[bad_ind]) .optimize().fit_transform(x,data_filled[bad_ind]))

# 構造分類模型

from sklearn.cross_validation import train_test_splitX = WOE_cy = data_filled[bad_ind]X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

# 構建邏輯回歸模型,進行違約概率預測

import itertoolsfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import confusion_matrix,recall_score,classification_reportlr = LogisticRegression(C =1,penalty=l1)lr.fit(X_train,y_train.values.ravel())y_pred = lr.predict(X_test.values)# Compute confusion matrixcnf_matrix = confusion_matrix(y_test,y_pred)np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

#畫混淆圖

def plot_confusion_matrix(cm,classes,title=Confusion matrix,cmap=plt.cm.Blues): import matplotlib.pyplot as plt plt.imshow(cm, interpolation=nearest, cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0) plt.yticks(tick_marks, classes) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel(True label) plt.xlabel(Predicted label)# Plot non-normalized confusion matrixclass_names = [0,1]plt.figure()plot_confusion_matrix(cnf_matrix , classes=class_names , title=Confusion matrix)plt.show()#(1744+44)/(1744+44+433+38) = 79.1%

## 加入代價敏感參數,重新計算

lr = LogisticRegression(C=1,penalty =l1,class_weight=balanced)lr.fit(X_train,y_train.values.ravel())y_pred = lr.predict(X_test.values)# Compute confusion matrixcnf_matrix = confusion_matrix(y_test,y_pred)np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))class_names = [0,1]plt.figure()plot_confusion_matrix(cnf_matrix , classes=class_names , title=Confusion matrix)plt.show() # (1170+366)/(1170+622+366+111) =68%

## 檢驗模型

# ### 檢驗模型from sklearn.metrics import roc_curve, aucfpr,tpr,threshold = roc_curve(y_test,y_pred, drop_intermediate=False) ###計算真正率和假正率roc_auc = auc(fpr,tpr) ###計算auc的值plt.figure() lw = 2 plt.figure(figsize=(10,10)) plt.plot(fpr, tpr, color=darkorange, lw=lw, label=ROC curve (area = %0.2f) % roc_auc) ###假正率為橫坐標,真正率為縱坐標做曲線 plt.plot([0, 1], [0, 1], color=navy, lw=lw, linestylex=--) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel(False Positive Rate) plt.ylabel(True Positive Rate) plt.title(Receiver operating characteristic example) plt.legend(loc="lower right") plt.show()

#利用sklearn.metrics中的roc_curve算出tpr,fpr作圖

#利用sklearn.metrics中的roc_curve算出tpr,fpr作圖fig, ax = plt.subplots()ax.plot(1 - threshold, tpr, label=tpr) # ks曲線要按照預測概率降序排列,所以需要1-threshold鏡像ax.plot(1 - threshold, fpr, label=fpr)ax.plot(1 - threshold, tpr-fpr,label=KS)plt.xlabel(score)plt.title(KS Curve)#plt.xticks(np.arange(0,1,0.2), np.arange(1,0,-0.2))#plt.xticks(np.arange(0,1,0.2), np.arange(score.max(),score.min(),-0.2*(data[反欺詐評分卡總分].max() - data[反欺詐評分卡總分].min())))plt.figure(figsize=(20,20))legend = ax.legend(loc=upper left, shadow=True, fontsize=x-large)plt.show()

# ### 評分卡開發

# 求各變數各水平得分n = 0for i in X.columns: if n == 0: temp = WoE(v_type=c).fit(data_filled[i],data_filled[bad_ind]).optimize().bins temp[name] = [i]*len(temp) scorecard = temp.copy() n += 1 else: temp = WoE(v_type=c).fit(data_filled[i],data_filled[bad_ind]).optimize().bins temp[name] = [i]*len(temp) scorecard = pd.concat([scorecard, temp], axis = 0) n += 1scorecard[score] = scorecard[woe].map(lambda x: -int(np.ceil(28.8539*x)))

# 求原始數據表中每個樣本的得分

def fico_score_cnvnt(x): if x < 6.657176e+02: return -21 else: return 16 def age_oldest_tr_cnvnt(x): if x < 1.618624e+02: return -9 else: return 20def rev_util_cnvnt(x): if x < 7.050000e+01: return 7 else: return -19 def ltv_cnvnt(x): if x < 9.450000e+01: return 16 else: return -8 def tot_tr_cnvnt(x): if x < 1.085218e+01: return -13 elif x < 1.330865e+01: return -4 elif x < 1.798767e+01: return 3 else: return 11 def tot_rev_line_cnvnt(x): if x < 1.201000e+04: return -12 else: return 19 def tot_derog_cnvnt(x): if x < 1.072596e+00: return 8 else: return -13 def purch_price_cnvnt(x): if x < 1.569685e+04: return -5 else: return 3 def tot_rev_debt_cnvnt(x): if x < 1.024000e+04: return -2 else: return 8func = [fico_score_cnvnt, age_oldest_tr_cnvnt, rev_util_cnvnt, ltv_cnvnt, tot_tr_cnvnt, tot_rev_line_cnvnt, tot_derog_cnvnt, purch_price_cnvnt, tot_rev_debt_cnvnt]

計算得分

X_score_dict = {i:j for i,j in zip(X.columns,func)}X_score = data_filled[X.columns].copy()for i in X_score.columns: X_score[i] = X_score[i].map(X_score_dict[i])X_score[SCORE] = X_score[X.columns].apply(lambda x: sum(x) + 513, axis = 1)X_score_label = pd.concat([X_score, data_filled[bad_ind]], axis = 1)X_score_label.head()import seaborn as snsfig, ax = plt.subplots()ax1 = sns.kdeplot(X_score_label[X_score_label[bad_ind] == 1][SCORE],label=1)ax2 = sns.kdeplot(X_score_label[X_score_label[bad_ind] == 0][SCORE],label=0)plt.show()

5.總結及展望

本文結合信用評分卡的建立原理,從數據的預處理建模分析創建信用評分卡建立自動評分系統,創建了一個簡單的信用評分系統。

基於AI 的機器學習評分卡系統可通過把舊數據(某個時間點後,例如2年)剔除掉後再進行自動建模模型評估、並不斷優化特徵變數,使得系統更加強大。

6.參考

Hellobi Live | 1小時學會建立信用評分卡(金融數據的小分析-Python)


推薦閱讀:

python生成器到底有什麼優點?
Python 的練手項目有哪些值得推薦?
Python單例設計模式
我用flask-sqlalchemy為什麼無法更新我的sqlite文件?
pathlib介紹-比os.path更好的路徑處理方式

TAG:Python | 數據挖掘 | 數據分析 |