信用評分卡建模分析——基於R語言

信用評分技術是一種應用統計模型,在金融系統中,其作用是對貸款申請人(信用卡申請人)做風險評估分值的方法。信用評分卡可以根據客戶提供的資料、客戶的歷史數據、第三方平台(芝麻分、京東、微信等)的數據,對客戶的信用進行評估。信用評分卡的建立是以對大量數據多維度的統計分析結果為基礎,具有相當高的準確性和可靠性

本文通過對kaggle上的Give Me Some Credit數據的挖掘分析,結合信用評分卡的建立原理,從數據的預處理建模分析創建信用評分卡建立自動評分系統,創建了一個簡單的信用評分系統。並對建立基於AI 的機器學習評分卡系統的路徑進行推測。

一、工作原理

客戶申請信用評分卡是一種統計模型,它可基於對當前申請人的各項資料進行評估並給出一個分數,該評分能定量對申請人的償債能力作出預判。

客戶申請評分卡由一系列特徵項組成,每個特徵項相當於申請表上的一個問題(例如,年齡、銀行流水、收入等)。每一個特徵項都有一系列可能的屬性,相當於每一個問題的一系列可能答案(例如,對於年齡這個問題,答案可能就有30歲以下、30到45等)。在開發評分卡系統模型中,先確定屬性與申請人未來信用表現之間的相互關係,然後給屬性分配適當的分數權重,分配的分數權重要反映這種相互關係。分數權重越大,說明該屬性表示的信用表現越好。一個申請的得分是其屬性分值的簡單求和。如果申請人的信用評分大於等於金融放款機構所設定的界限分數,此申請處於可接受的風險水平並將被批准;低於界限分數的申請人將被拒絕或給予標示以便進一步審查。

二、數據預處理

2.1載入相關的包

#載入需要的R包library(readr) # File read / writelibrary(ggplot2) # Data visualizationlibrary(ggthemes) # Data visualizationlibrary(scales) # Data visualizationlibrary(plyr)# Data manipulationlibrary(stringr) # String manipulationlibrary(InformationValue) # IV / WOE calculationlibrary(MLmetrics) # Mache learning metrics.e.g. Recall, Precision, Accuracy, AUClibrary(rpart) # Decision tree utilslibrary(randomForest) # Random Forestlibrary(dplyr) # Data manipulationlibrary(e1071) # SVMlibrary(Amelia) # Missing value utilslibrary(party) # Conditional inference treeslibrary(gbm) # AdaBoostlibrary(class) # KNNlibrary(mice)#patternlibrary(grid)library(DMwR)#knnImputationlibrary(corrplot)#corrplotlibrary(iterators)#迭代,caret依賴包library(caret)#createDataPartition(數據分割功能)library(pROC)#modelroc(用於分類器比較)library(VIM) #matrixplotlibrary(GGally)library(lattice)library(MASS)library(memisc)library(Rcpp)#使用c++語言library(gridExtra)library(tidyr)library(splines) #數據差值包

2.2載入數據

#設定路徑setwd(E:/CrediCard)#載入數據cs_training <- read_csv("cs-training.csv")#去掉第一列cs_training<- cs_training[,-1]

2.3數據預處理

2.3.1缺失值處理

#變數重命名names(cs_training)<-c("y","x1","x2","x3","x4", "x5","x6","x7","x8","x9","x10")#查看數據集缺失數據missmap(cs_training,main = "Missing values vs observed")#具體缺失參數,空值sapply(cs_training,function(z)sum(is.na(z)))#缺失值級聯表md.pattern(cs_training)

利用missmap函數對缺失值部分進行可視化展示,可以看到變數x5和x10變數有缺失值,即MonthlyIncome變數和NumberofDependents兩個變數存在缺失值,具體確實情況可以見上表,MonthlyIncome列共有缺失值29731個,NumberofDependents有3924個。

#x5(MonthlyIncome)缺失值處理(使用中位數)cs_training$x5 <- na.roughfix(cs_training$x5)#x10(NumberOfDependents)3924個缺失值,所佔比重3924/150000不大,故直接刪除cs_training <- cs_training[!is.na(cs_training$x10),]sapply(cs_training,function(z)sum(is.na(z)))

2.3.2異常值處理

首先對於x2變數,即客戶的年齡,我們可以定量分析,發現有以下值:

#對x2變數(客戶的年齡)定量分析unique(cs_training$x2)

年齡中存在0值,顯然是異常值,予以剔除

cs_training<-cs_training[-which(cs_training$x2==0),]boxplot(cs_training$x5)

而對於x3,x7,x9三個變數,由下面的箱線圖可以看出,均存在異常值,且由unique函數可以得知均存在96、98兩個異常值,因此予以剔除。同時會發現剔除其中一個變數的96、98值,其他變數的96、98兩個值也會相應被剔除。

#繪製x3,x7,x9三個變數的箱線圖

boxplot(cs_training$x3,cs_training$x7,cs_training$x9)

#對x3,x7,x9變數定量分析unique(cs_training$x3)unique(cs_training$x7)unique(cs_training$x9)

去掉異常值96和98,因為有96和98值的x3、x7、x9是在同一行,所以動一個變數即可

cs_training<-cs_training[-which(cs_training$x3==96),]cs_training<-cs_training[-which(cs_training$x3==98),]

其它變數暫不作處理。

2.3.3單變數分析

x2(age,年齡)分布

ggplot(cs_training, aes(x = x2, y = ..density..)) + geom_histogram(fill = "blue", colour = "grey60", size = 0.2, alpha = 0.2) + geom_density()

可以看到年齡變數大致呈正態分布,符合統計分析的假設。

x5(MonthlyIncome,月收入)分布:

ggplot(cs_training, aes(x = x5, y = ..density..)) + geom_histogram(fill = "blue", colour = "grey60", size = 0.2, alpha = 0.2) + geom_density() + xlim(1, 20000)

月收入也大致呈正態分布,符合統計分析的需要。

2.3.4變數相關性分析

cor1<-cor(cs_training[,1:11])corrplot(cor1,method = "number")

由上圖可以看出,各變數之間的相關性是非常小的。其實Logistic回歸同樣需要檢驗多重共線性問題,不過此處由於各變數之間的相關性較小,可以初步判斷不存在多重共線性問題,當然我們在建模後還可以用VIF(方差膨脹因子)來檢驗多重共線性問題。如果存在多重共線性,即有可能存在兩個變數高度相關,需要降維或剔除處理。

2.4切分數據集

table(cs_training$y)

由上表看出,對於響應變數SeriousDlqin2yrs,存在明顯的類失衡問題,SeriousDlqin2yrs等於1的觀測為9712,僅為所有觀測值的6.6%。因此我們需要對非平衡數據進行處理,在這裡可以採用SMOTE演算法,用R對稀有事件進行超級採樣。

我們利用caret包中的createDataPartition(數據分割功能)函數將數據隨機分成相同的兩份。

#採用SMOTE演算法,用R對稀有事件進行超級採樣set.seed(1234) splitIndex<-createDataPartition(cs_training$y,time=1, p=0.5,list=FALSE) train<-cs_training[splitIndex,] test<-cs_training[-splitIndex,] prop.table(table(train$y))prop.table(table(test$y))

兩者的分類結果是平衡的,仍然有6.6%左右的代表,我們仍然處於良好的水平。因此可以採用這份切割的數據進行建模及預測。

2.5特徵變數選擇

全變數建模

2.5.1基本公式:略

2.5.2建立模型

fit<-glm(y~.,train,family = "binomial")summary(fit)

可以看出,利用全變數進行回歸,模型擬合效果並不是很好,其中x1,x4,x6三個變數的p值未能通過檢驗,在此直接剔除這三個變數,利用剩餘的變數對y進行回歸。

fit2<-glm(y~x2+x3+x5+x7+x8+x9+x10,train,family = "binomial")summary(fit2)

第二個回歸模型所有變數都通過了檢驗,甚至AIC值(赤池信息準則)更小,所以特徵變數選擇x2+x3+x5+x7+x8+x9+x10。

2.6數據分箱

(1)age

cutx2= c(-Inf,30,35,40,45,50,55,60,65,75,Inf)plot(cut(train$x2,cutx2))

(2)NumberOfTime30-59DaysPastDueNotWorse變數(x3):

cutx3 = c(-Inf,0,1,3,5,Inf)plot(cut(train$x3,cutx3))

(3)MonthlyIncome變數(x5):

cutx5 = c(-Inf,1000,2000,3000,4000,5000,6000,7500,9500,12000,Inf)plot(cut(train$x5,cutx5))

(4)NumberOfTimes90DaysLate變數(x7):

cutx7 = c(-Inf,0,1,3,5,10,Inf)plot(cut(train$x7,cutx7))

(5)NumberRealEstateLoansOrLines變數(x8):

cutx8= c(-Inf,0,1,2,3,5,Inf)plot(cut(train$x8,cutx8))

(6)NumberOfTime60-89DaysPastDueNotWorse變數(x9):

cutx9 = c(-Inf,0,1,3,5,Inf)plot(cut(train$x9,cutx9))

(7)NumberofDependents變數(x10):

cutx10 = c(-Inf,0,1,2,3,5,Inf)plot(cut(train$x10,cutx10))

3.建模分析

3.1建模

fit2<-glm(y~x2+x3+x5+x7+x8+x9+x10,train,family = "binomial")summary(fit2)

3.2模型評估

下面首先利用模型對test數據進行預測,生成概率預測值pre <- predict(fit2,test)量化模型預測效果modelroc <- roc(test$y,pre)plot(modelroc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2), grid.col=c("green", "red"), max.auc.polygon=TRUE, auc.polygon.col="skyblue", print.thres=TRUE)

在R中,可以利用pROC包,它能方便比較兩個分類器,還能自動標註出最優的臨界點,圖看起來也比較漂亮。在下圖中最優點FPR=1-TNR=0.842,TPR=0.639,AUC值為0.811,說明該模型的預測效果還是不錯的,正確率較高。

3.3特徵屬性woe計算

證據權重(Weight of Evidence,WOE)轉換可以將Logistic回歸模型轉變為標準評分卡格式。引入WOE轉換的目的並不是為了提高模型質量,只是一些變數不應該被納入模型,這或者是因為它們不能增加模型值,或者是因為與其模型相關係數有關的誤差較大,其實建立標準信用評分卡也可以不採用WOE轉換。這種情況下,Logistic回歸模型需要處理更大數量的自變數。儘管這樣會增加建模程序的複雜性,但最終得到的評分卡都是一樣的。

用WOE(x)替換變數x。WOE()=ln[(違約/總違約)/(正常/總正常)]。

通過上述的Logistic回歸,剔除x1,x4,x6三個變數,對剩下的變數進行WOE轉換。

#計算WOE的函數totalgood = as.numeric(table(train$y))[1]totalbad = as.numeric(table(train$y))[2]getWOE <- function(a,p,q){ Good <- as.numeric(table(train$y[a > p & a <= q]))[1] Bad <- as.numeric(table(train$y[a > p & a <= q]))[2] WOE <- log((Bad/totalbad)/(Good/totalgood),base = exp(1)) return(WOE)}

比如age變數(x2)

Agelessthan30.WOE=getWOE(train$x2,-Inf,30)Age30to35.WOE=getWOE(train$x2,30,35)Age35to40.WOE=getWOE(train$x2,35,40)Age40to45.WOE=getWOE(train$x2,40,45)Age45to50.WOE=getWOE(train$x2,45,50)Age50to55.WOE=getWOE(train$x2,50,55)Age55to60.WOE=getWOE(train$x2,55,60)Age60to65.WOE=getWOE(train$x2,60,65)Age65to75.WOE=getWOE(train$x2,65,75)Agemorethan.WOE=getWOE(train$x2,75,Inf)age.WOE=c(Agelessthan30.WOE,Age30to35.WOE,Age35to40.WOE, Age40to45.WOE,Age45to50.WOE,Age50to55.WOE, Age55to60.WOE,Age60to65.WOE,Age65to75.WOE, Agemorethan.WOE)age.WOE

類似地:

NumberOfTime30-59DaysPastDueNotWorse變數(x3)

PastDuelessthan0.WOE=getWOE(train$x3,-Inf,0)PastDue0to1.WOE=getWOE(train$x3,0,1)PastDue1to3.WOE=getWOE(train$x3,1,3)PastDue3to5.WOE=getWOE(train$x3,3,5)PastDuemorethan.WOE=getWOE(train$x3,5,Inf)PastDue.WOE=c(PastDuelessthan0.WOE,PastDue0to1.WOE, PastDue1to3.WOE,PastDue3to5.WOE, PastDuemorethan.WOE)PastDue.WOE

MonthIncome變數(x5)

MonthIncomelessthan1000.WOE=getWOE(train$x5,-Inf,1000)MonthIncome1000to2000.WOE=getWOE(train$x5,1000,2000)MonthIncome2000to3000.WOE=getWOE(train$x5,2000,3000)MonthIncome3000to4000.WOE=getWOE(train$x5,3000,4000)MonthIncome4000to5000.WOE=getWOE(train$x5,4000,5000)MonthIncome5000to6000.WOE=getWOE(train$x5,5000,6000)MonthIncome6000to7500.WOE=getWOE(train$x5,6000,7500)MonthIncome7500to9500.WOE=getWOE(train$x5,7500,9500)MonthIncome9500to12000.WOE=getWOE(train$x5,9500,12000)MonthIncomemorethan.WOE=getWOE(train$x5,12000,Inf)MonthIncome.WOE=c(MonthIncomelessthan1000.WOE,MonthIncome1000to2000.WOE, MonthIncome2000to3000.WOE,MonthIncome3000to4000.WOE, MonthIncome4000to5000.WOE,MonthIncome5000to6000.WOE, MonthIncome6000to7500.WOE,MonthIncome7500to9500.WOE, MonthIncome9500to12000.WOE,MonthIncomemorethan.WOE)MonthIncome.WOE

NumberOfTime90DaysPastDueNotWorse變數(x7)

Days90PastDuelessthan0.WOE = getWOE(train$x7,-Inf,0)Days90PastDue0to1.WOE=getWOE(train$x7,0,1)Days90PastDue1to3.WOE=getWOE(train$x7,1,3)Days90PastDue3to5.WOE=getWOE(train$x7,3,5)Days90PastDue5to10.WOE=getWOE(train$x7,5,10)Days90sPastDuemorethan.WOE=getWOE(train$x7,10,Inf)Days90sPastDue.WOE=c(Days90PastDuelessthan0.WOE,Days90PastDue0to1.WOE, Days90PastDue1to3.WOE,Days90PastDue3to5.WOE, Days90PastDue5to10.WOE,Days90sPastDuemorethan.WOE)Days90sPastDue.WOE

NumberRealEstateLoansOrLines變數(x8)

RealEstatelessthan0.WOE = getWOE(train$x8,-Inf,0)RealEstate0to1.WOE=getWOE(train$x8,0,1)RealEstate1to2.WOE=getWOE(train$x8,1,2)RealEstate2to3.WOE=getWOE(train$x8,2,3)RealEstate3to5.WOE=getWOE(train$x8,3,5)RealEstatemorethan.WOE=getWOE(train$x8,5,Inf)RealEstate.WOE=c(RealEstatelessthan0.WOE,RealEstate0to1.WOE, RealEstate1to2.WOE,RealEstate2to3.WOE, RealEstate3to5.WOE,RealEstatemorethan.WOE)RealEstate.WOE

NumberOfTime60.89DaysPastDueNotWorse變數(x9)

Days60.89PastDuelessthan0.WOE = getWOE(train$x9,-Inf,0)Days60.89PastDue0to1.WOE=getWOE(train$x9,0,1)Days60.89PastDue1to3.WOE=getWOE(train$x9,1,3)Days60.89PastDue3to5.WOE=getWOE(train$x9,3,5)Days60.89PastDuemorethan.WOE=getWOE(train$x9,5,Inf)Days60.89PastDue.WOE=c(Days60.89PastDuelessthan0.WOE,Days60.89PastDue0to1.WOE, Days60.89PastDue1to3.WOE,Days60.89PastDue3to5.WOE, Days60.89PastDuemorethan.WOE)Days60.89PastDue.WOE

NumberOfDependents變數(x10)

Dependentslessthan0.WOE = getWOE(train$x10,-Inf,0)Dependents0to1.WOE=getWOE(train$x10,0,1)Dependents1to2.WOE=getWOE(train$x10,1,2)Dependents2to3.WOE=getWOE(train$x10,2,3)Dependents3to5.WOE=getWOE(train$x10,3,5)Dependentsmorethan.WOE=getWOE(train$x10,5,Inf)Dependents.WOE=c(Dependentslessthan0.WOE,Dependents0to1.WOE, Dependents1to2.WOE,Dependents2to3.WOE, Dependents3to5.WOE,Dependentsmorethan.WOE)Dependents.WOE

3.4評分卡創建

3.4.1創建評分標準

依據以上論文資料(《信用評分卡模型的建立_黎玉華》)得到:

a=log(p_good/P_bad)

Score = offset + factor * log(odds)

下面開始設立評分,假設按好壞比15為600分,每高20分好壞比翻一倍算出factor,offset。如果後期結果不明顯,可以高30-50分好壞比才翻一倍。

620 = offset + factor * log(15*2,base = 10)600 = offset + factor * log(15)factor <- 20/(log(30,base = 10)-log(15,base = 10))offset <- 600-factor*log(15,base = 10)#個人總評分=基礎分+各部分得分#基礎分為:baseScore <- a*factor+offset

3.4.2評分卡創建

#構造計算分值函數:getscore<-function(i,x){ score = round(factor*as.numeric(coe[i])*x,0) return(score)}

評分卡

4.建立自動評分系統

自動評分系統示意代碼

#計算每一個借款人的信用評分#agescore.age <- 0for(i in 1:nrow(train)) { if(train$x2[i] <= 30) score.age[i] <- Agelessthan30.SCORE else if(train$x2[i] <= 35) score.age[i] <- Age30to35.SCORE else if(train$x2[i] <= 40) score.age[i] <- Age35to40.SCORE else if(train$x2[i] <= 45) score.age[i] <- Age40to45.SCORE else if(train$x2[i] <= 50) score.age[i] <- Age45to50.SCORE else if(train$x2[i] <= 55) score.age[i] <- Age50to55.SCORE else if(train$x2[i] <= 60) score.age[i] <- Age55to60.SCORE else if(train$x2[i] <= 65) score.age[i] <- Age60to65.SCORE else if(train$x2[i] <= 75) score.age[i] <- Age65to75.SCORE else score.age[i] <- Agemorethan.SCORE}train$score.agetable(score.age)score.age[1:10]#NumberOfTime30-59DaysPastDueNotWorse變數(x3)score.PastDue <- 0for(i in 1:nrow(train)) { if(train$x3[i] <= 0) score.PastDue[i] <- PastDuelessthan0.SCORE else if(train$x3[i] <= 1) score.PastDue[i] <- PastDue0to1.SCORE else if(train$x3[i] <= 3) score.PastDue[i] <- PastDue1to3.SCORE else if(train$x3[i] <= 5) score.PastDue[i] <- PastDue3to5.SCORE else score.PastDue[i] <- PastDuemorethan.SCORE}table(score.PastDue)#MonthIncome變數(x5)score.MonthIncome <- 0for(i in 1:nrow(train)) { if(train$x5[i] <= 1000) score.MonthIncome[i] <- MonthIncomelessthan1000.SCORE else if(train$x5[i] <= 2000) score.MonthIncome[i] <- MonthIncome1000to2000.SCORE else if(train$x5[i] <= 3000) score.MonthIncome[i] <- MonthIncome2000to3000.SCORE else if(train$x5[i] <= 4000) score.MonthIncome[i] <- MonthIncome3000to4000.SCORE else if(train$x5[i] <= 5000) score.MonthIncome[i] <- MonthIncome4000to5000.SCORE else if(train$x5[i] <= 6000) score.MonthIncome[i] <- MonthIncome5000to6000.SCORE else if(train$x5[i] <= 7500) score.MonthIncome[i] <- MonthIncome6000to7500.SCORE else if(train$x5[i] <= 9500) score.MonthIncome[i] <- MonthIncome7500to9500.SCORE else if(train$x5[i] <= 12000) score.MonthIncome[i] <- MonthIncome9500to12000.SCORE else score.MonthIncome[i] <- MonthIncomemorethan.SCORE}table(score.MonthIncome)score.MonthIncome[1:10]#NumberOfTimes90DaysLate變數(x7)score.Days90PastDue <- 0for(i in 1:nrow(train)) { if(train$x7[i] <= 0) score.Days90PastDue[i] <- Days90PastDuelessthan0.SCORE else if(train$x7[i] <= 1) score.Days90PastDue[i] <- Days90PastDue0to1.SCORE else if(train$x7[i] <= 3) score.Days90PastDue[i] <- Days90PastDue1to3.SCORE else if(train$x7[i] <= 5) score.Days90PastDue[i] <- Days90PastDue3to5.SCORE else if(train$x7[i] <= 10) score.Days90PastDue[i] <- Days90PastDue5to10.SCORE else score.Days90PastDue[i] <- Days90sPastDuemorethan.SCORE}table(score.Days90PastDue)#NumberRealEstateLoansOrLines變數(x8)score.RealEstate <- 0for(i in 1:nrow(train)) { if(train$x8[i] <= 0) score.RealEstate[i] <- RealEstatelessthan0.SCORE else if(train$x8[i] <= 1) score.RealEstate[i] <- RealEstate0to1.SCORE else if(train$x8[i] <= 2) score.RealEstate[i] <- RealEstate1to2.SCORE else if(train$x8[i] <= 3) score.RealEstate[i] <- RealEstate2to3.SCORE else if(train$x8[i] <= 5) score.RealEstate[i] <- RealEstate3to5.SCORE else score.RealEstate[i] <- RealEstatemorethan.SCORE}table(score.RealEstate)#NumberOfTime60.89DaysPastDueNotWorse變數(x9)score.Days60.89PastDue <- 0for(i in 1:nrow(train)) { if(train$x9[i] <= 0) score.Days60.89PastDue[i] <- Days60.89PastDuelessthan0.SCORE else if(train$x9[i] <= 1) score.Days60.89PastDue[i] <- Days60.89PastDue0to1.SCORE else if(train$x9[i] <= 3) score.Days60.89PastDue[i] <- Days60.89PastDue1to3.SCORE else if(train$x9[i] <= 5) score.Days60.89PastDue[i] <- Days60.89PastDue3to5.SCORE else score.Days60.89PastDue[i] <- Days60.89PastDuemorethan.SCORE}table(score.Days60.89PastDue)#NumberOfDependents變數(x10)score.Dependents <- 0for(i in 1:nrow(train)) { if(train$x8[i] <= 0) score.Dependents[i] <- Dependentslessthan0.SCORE else if(train$x8[i] <= 1) score.Dependents[i] <- Dependents0to1.SCORE else if(train$x8[i] <= 2) score.Dependents[i] <- Dependents1to2.SCORE else if(train$x8[i] <= 3) score.Dependents[i] <- Dependents2to3.SCORE else if(train$x8[i] <= 5) score.Dependents[i] <- Dependents3to5.SCORE else score.Dependents[i] <- Dependentsmorethan.SCORE}table(score.Dependents)#計算每個人的信用評分#baseScore <- a*factor+offsetcreditScore<-0for(i in 1:nrow(train)){ creditScore[i]<-score.age[i]+score.PastDue[i]+score.MonthIncome[i]+ score.Days90PastDue[i]+score.RealEstate[i]+score.Days60.89PastDue[i]+ score.Dependents[i]+baseScore}train$creditScore<-round(creditScore,0)

自動評分系統可以批量計算信用評分

5.總結及展望

本文通過對kaggle上的Give Me Some Credit數據的挖掘分析,結合信用評分卡的建立原理,從數據的預處理建模分析創建信用評分卡建立自動評分系統,創建了一個簡單的信用評分系統。

基於AI 的機器學習評分卡系統可通過把舊數據(某個時間點後,例如2年)剔除掉後再進行自動建模模型評估、並不斷優化特徵變數,使得系統更加強大。


推薦閱讀:

2018數據分析實踐計劃
No4:我為什麼爬取各種數據?
沫小姐學數據分析之Python入門篇
2017年3D列印行業大數據報告,3D列印品牌數據分析
重走數據分析之路 R In Action

TAG:數據分析 | R編程語言 | Kaggle |