數據處理流程個人總結(一)

最近做了客戶流失預警的二元分類建模,今天總結一下,其實每一個模塊都有大神做出了充分詳細的解釋和說明,此篇僅供自己工作時查閱,大神勿噴!有什麼不對的地方,多多指教,謝謝!

載入相關包以及依賴包:

library(readxl) #read excel datalibrary(writexl) #write xlsx filelibrary(lattice)library(mice) #process NA datalibrary(colorspace)library(data.table) #more efficiently process data library(grid)library(VIM) #display NA datalibrary(plyr) #tidy datalibrary(dplyr) #tidy datalibrary(corrplot) #Correlation coefficient visualizationlibrary(ggplot2) #ggplot()library(magrittr) #pipe operationlibrary(ggpubr) #ggviolin()library(dfexplore)#check NA datalibrary(boot) library(pastecs) #details summary statslibrary(gmodels) #build contingency tableslibrary(DT) #datatable() interative tablelibrary(randomForest) #randomForest algorithmlibrary(caret) #evaluate model,createDataPartition() Stratified sampling functionlibrary(Matrix)library(foreach) library(glmnet) #generalized linear modellibrary(ISLR) library(e1071) #naiveBayes()library(pROC) #plot roc curvelibrary(kernlab) library(Rmisc) #multiplot~~Split plot arealibrary(gridExtra)#grid arrangelibrary(Hmisc) #separate datalibrary(proto) #smbinning dependency packageslibrary(gsubfn) library(RSQLite) #utilize sql grammarlibrary(sqldf) #count Inf and -Inf in calling smbinning funcitonlibrary(partykit) #call partykit::ctree()library(Formula) #transform formulalibrary(smbinning)#Optimal Binning For Scoring Modelinglibrary(xts)library(zoo)library(PerformanceAnalytics) #correlation between numeric variableslibrary(DMwR) #Missing interpolation precision and smote()library(rpart)library(rpart.plot) #plot dtree

空值可視化探索:

mice::md.pattern(data)VIM::aggr(data)dfexplore::dfplot(data)sapply(data,function(x) table(is.na(x)))#提取數值型和因子型變數numeric.idx <- which(sapply(repayment,is.numeric))factor.idx <- which(sapply(repayment,is.numeric))

缺失的數據有兩種類型:

  1. MCAR:完全隨機缺失,這是數據缺失的理想狀況。
  2. MNAR:非隨機缺失,這是一個比較嚴重的問題。在這種情況下,你可能需要去檢查數據的收集過程並且試著理解數據為什麼會丟失。例如,大多數人在一項調查中不回答某個問題,為什麼他們這樣做呢?是問題不清楚嗎?

marginplot(data[c(1,2)]) 箱線圖探索缺失模式,缺點是一次只能繪製兩個變數。

缺失值處理(判斷缺失狀態,結合業務情況):

(1) 中位數、均值、眾數填補();

(2) knn插補 DMwR::knnImputation() caret::preProcess(x, method="knnImputation", na.remove = TRUE, k = 5)

knn插補優點:1、插補定性和定量特徵;

2、屬性對應的預測標籤不作要求;

3、考慮數據關聯結構;

4、具有多個缺失值的屬性容易處理。

缺點:1、knn在計算大數據量時,搜索最鄰近的樣本時非常耗時;

2、 k值的選擇非常重要,較高的k值將包含我們需要的顯著不同的屬性,而較 低的k值則意味著缺少重要的屬性。

———————————————特徵探索——————————————

#############分類變數、因子變數#匯總統計get.categorical.variable.stats <- function(df,indep_var){ indep.var <- df[,which(names(df)==indep_var)] feature.name=indep_var df1 <- data.frame(table(indep.var)) colnames(df1) <- c(feature.name,"Frequency") df2 <- data.frame(prop.table(table(indep.var))) colnames(df2) <- c(feature.name,"Proportion") df <- merge(df1,df2,by=feature.name) ndf <- df[order(-df$Frequency),] if(names(dev.cur())!="null device"){ dev.off() } gridExtra::grid.table(ndf)}#列聯表展示get.contingency.table <- function(df,dep_var,indep_var,stat.tests=F){ indep.var <- df[,which(names(df)==indep_var)] dep.var <- df[,which(names(df)==dep_var)] if(stat.tests==F){ gmodels::CrossTable(dep.var,indep.var,digits=1,prop.r=F,prop.t=F,prop.chisq=F) } else { gmodels::CrossTable(dep.var,indep.var,digits=1,prop.r=F,prop.t=F,prop.chisq=F,chisq=T,fisher=T) } }#分類變數--條形圖visualize.barchart <- function(df,indep_var){ indep.var <- df[,which(names(df)==indep_var)] colnames(indep.var) <- indep_var qplot(indep.var,geom="bar",fill=I(gray),col=I(black),xlab=indep_var)+theme_bw()}#馬賽克圖--列聯表visualize.contingency.tafunction(df,indep_var){ indep.var <- df[,which(names(df)==indep_var)] p11 <- ggplot(df,aes(indep.var))+geom_histogram(binwidth_=5,fill=I(gray),color=I(black))+ labs(x=indep_var)+theme_bw() p12 <- qplot(indep.var,geom="density",fill=I(gray),color=I(black))+labs(x=indep_var)+theme_bw() gridExtra::grid.arrange(p11,p12,ncol=2)}#箱線圖: dep_var必須是因子型變數visualize.boxplot <- function(df,dep_var,indep_var){ indep.var <- df[,which(names(df)==indep_var)] dep.var <- df[,which(names(df)==dep_var)] if(names(dev.cur())!="null device"){ dev.off() } graphics::mosaicplot(dep.var~indep.var,color=T,xlab=dep_var,ylab=indep_var,main="Contingency table plot")}##################數值型變數#匯總統計get.numeirc.variable.stats <- function(df,indep_var,detailed=FALSE){ indep.var <- df[,which(names(df)==indep_var)] colnames(indep.var) <- indep_var options(scipen = 100) options(digits = 2) if (detailed){ var.stats <- stat.desc(indep.var) }else{ var.stats <- summary(indep.var) } df <- data.frame(round(as.numeric(var.stats),2)) colnames(df) <- indep_var rownames(df) <- names(var.stats) if(names(dev.cur())!="null device"){ dev.off() } gridExtra::grid.table(t(df))}#直方圖/密度圖visualize.distribulation <- visualize.boxplot <- function(df,indep_var,dep_var){ indep.var <- df[,which(names(df)==indep_var)] dep.var <- df[,which(names(df)==dep_var)] p11 <- qplot(factor(0),indep.var,geom="boxplot", xlab=indep_var,ylab="values")+theme_bw() p12 <- ggplot(df,aes(x=dep.var,y=indep.var,fill=dep.var))+ geom_boxplot(outlier.colour ="red",outlier.size = 2)+ labs(x=dep_var,y=indep_var)+theme_bw() gridExtra::grid.arrange(p11,p12,ncol=2)}

異常值檢測的方法(箱線圖(常用)、聚類(基於距離的kmeans聚類、基於密度的DBSCAN聚類)、隨機森林),安利一張最近總結的聚類結果(詳情見鏈接 zhuanlan.zhihu.com/p/33

相關聚類方法

異常值影響:

1、違反回歸、方差分析和模型假設的前提假設條件;

2、增加誤差,影響偏差和模型估計,降低模型的泛化能力

處理:

1、作特徵轉化,減小由異常值引起的變化;

2、若異常值較多,則考慮將異常值單獨建模。

—————————————— 特徵工程 —————————————

(1) 特徵轉換:

1、對於偏態分布,常取平方、平方根、立方根、對數、倒數;

2、分箱(等頻分箱、等距分箱、卡方分箱、最優分箱)

(2) 新增變數:衍生變數 、啞變數、one-hot編碼(二者區別簡單說明離散型特徵編碼方式:one-hot與啞變數* - ML小菜鳥 - 博客園)

------------------------------- 特徵選擇 -------------------------------------

基本的單變數:相關性、方差分析、統計檢驗(卡方、Fisher等)

基於模型整體:交叉驗證、PCA、Lasso回歸、基於遞歸特徵排除法(RFE)、Boruta演算法等

樣本採樣及特徵選擇 | 單向街的夏天

基於WoE和IV值進行特徵選擇,互聯網金融中的評分模型一般選擇iv值在0.05~0.5的特徵入模,過小對因變數的區分度不大,多大容易過擬合。信用卡評分鏈接:

WOE信用評分卡--R語言實例 - CSDN博客

有以下幾點注意事項:

(1) 分箱是基於證據權重(WOE)和信息值(IV),要求goodrate>0 & badrate>0

(2) indep_var是numeric/factor,dep_var 必須是integer(建議複製標籤一列為integer,方便不同要求的作圖可視化),而且length(unique(indep_var)) >5

(3) result中包含ivtable(計算過程)、iv、bands(分箱/切分點)、ctree(決策樹)

##數值型變數的分箱smbinning.Numeric <- function(df,indep_var,dep_var){ indep.var <- df[,which(names(df)==indep_var)] dep.var <- df[,which(names(df)==dep_var)] #smbinning(df,y=int(0,1),x=int,p=0.05) result=smbinning::smbinning(df,y=dep_var,x=indep_var) par(mfrow=c(2,2)) boxplot(indep.var~dep.var,horizontal=TRUE,xlab=indep_var,ylab=dep_var, frame=FALSE, col=c("steelblue","red"),main="Distribution") smbinning.plot(result,option="dist",sub="Credit Score") smbinning.plot(result,option="badrate",sub="Credit Score") smbinning.plot(result,option="WoE") return(result=list(result=result))}#因子型變數分箱smbinning.Factor <- function(df,indep_var,dep_var){ indep.var <- df[,which(names(df)==indep_var)] dep.var <- df[,which(names(df)==dep_var)] #smbinning(df,y=int(0,1),x=int,p=0.05) result=smbinning::smbinning.factor(df,y=dep_var,x=indep_var,maxcat = 11) qplot(indep.var,geom="bar",fill=I(gray),col=I(black),xlab=indep_var, main=paste(indep_var,"distribution"))+theme_bw() novalue <- c("-Inf","NaN","Inf") if(TRUE %in% !(novalue %in% result$ivtable$WoE)){ par(mfrow=c(1,3)) smbinning.plot(result,option="dist",sub=paste(indep_var,"level")) smbinning.plot(result,option="goodrate",sub=paste(indep_var,"level")) smbinning.plot(result,option="badrate",sub=paste(indep_var,"level")) }else{ par(mfrow=c(2,2)) smbinning.plot(result,option="dist",sub=paste(indep_var,"level")) smbinning.plot(result,option="goodrate",sub=paste(indep_var,"level")) smbinning.plot(result,option="badrate",sub=paste(indep_var,"level")) smbinning.plot(result,option="WoE",sub=paste(indep_var,"level")) } return(list(result=result)) } #自己寫的計算iv值的函數indep_var.iv <- function(df,indep_var,dep_var){ indep.var <- df[,which(names(df)==indep_var)] dep.var <- df[,which(names(df)==dep_var)] temp1=table(indep.var,dep.var) temp2=as.data.frame(matrix(temp1,nrow(temp1),2,dimnames =list(rownames(temp1),colnames(temp1)))) temp3=sapply(temp2,function(indep.var) indep.var/sum(indep.var)) if(!is.matrix(temp3)) { iv=0 }else{ woe=log(temp3[,1]/temp3[,2]) #goodrate/badrate iv=sum((temp3[,1]-temp3[,2])[!is.infinite(woe)]*woe[!is.infinite(woe)]) } return(iv) }

以上只是簡單介紹了相關的數據處理,未完待續!


推薦閱讀:

Python數據分析之讓數據說話
自動駕駛免費數據資源分享
利用numpy和pandas進行一維和二維數據分析。
「大數據」時代,什麼是數據分析做不了的?
剪刀石頭布的數據分析闖關之路——起跑線上的暢談

TAG:數據分析 | 數據清洗 |