python分析信用卡反欺詐(下)——兩種採樣方法解決數據不平衡及效果分析、模型調參示例

05-13

前言

關於信用卡反欺詐分析，之前已經寫了一篇上，見下面超鏈接，最好先看下這片文章，了解下大致情況，再來看本文；
本文主要是針對前面一篇文章中提到的數據不平衡，採取下採樣和過採樣的辦法規避，並試著對比二者的效果；
本文還以邏輯回歸演算法為例，對影響邏輯模型效果的最重要的2個參數C、Threshold在那種情況下較好進行了簡單調試，詳見代碼，希望能開拓大家調參的思路；
閱讀本文時，最好先了解下一些模型的基本參數和知識，如recall、TP、FN等，至少要會看混淆舉證，不然後面看的時候有點吃力；
閱讀本文大致需要20分鐘，如發現錯誤歡迎留言指正，謝謝????

python分析信用卡反欺詐(上)——邏輯回歸、隨機森林、SVM三種方法建模比較

一，數據準備

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltplt.style.use(ggplot)from imblearn.over_sampling import SMOTEfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import confusion_matrixfrom sklearn.model_selection import train_test_split/Applications/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 return f(*args, **kwds)data=pd.read_csv(./creditcard.csv)from sklearn.preprocessing import StandardScaler# 標準化Amount列數據data[normAmount]=StandardScaler().fit_transform(data[Amount].values.reshape(-1,1))data=data.drop([Amount,Time],axis=1)data.shape,data.info()<class pandas.core.frame.DataFrame>RangeIndex: 284807 entries, 0 to 284806Data columns (total 30 columns):V1 284807 non-null float64V2 284807 non-null float64V3 284807 non-null float64V4 284807 non-null float64V5 284807 non-null float64V6 284807 non-null float64V7 284807 non-null float64V8 284807 non-null float64V9 284807 non-null float64V10 284807 non-null float64V11 284807 non-null float64V12 284807 non-null float64V13 284807 non-null float64V14 284807 non-null float64V15 284807 non-null float64V16 284807 non-null float64V17 284807 non-null float64V18 284807 non-null float64V19 284807 non-null float64V20 284807 non-null float64V21 284807 non-null float64V22 284807 non-null float64V23 284807 non-null float64V24 284807 non-null float64V25 284807 non-null float64V26 284807 non-null float64V27 284807 non-null float64V28 284807 non-null float64Class 284807 non-null int64normAmount 284807 non-null float64dtypes: float64(29), int64(1)memory usage: 65.2 MB((284807, 30), None)

# 看看Class列數據的分布count_classes=pd.value_counts(data[Class],sort=True).sort_index()print (count_classes)count_classes.plot(kind = bar)plt.title("Fraud class histogram")plt.xlabel("Class")plt.ylabel("Frequency")0 2843151 492Name: Class, dtype: int64Text(0,0.5,Frequency)

如上，數據嚴重不平衡，負樣本(欺詐時的值為1的樣本)的數量太少，如果我們不進行處理，直接用這樣的數據來進行訓練建模，那得到的結果將非常糟糕。

所以我們要進行樣本數據處理，主要有2種思路：

下採樣
過採樣

下面分別展開如下。

二，下採樣處理數據

2.1 下採樣：

對於數據集中出現的數量嚴重不等的兩類數據，從數量比較多的那類樣本中，隨機選出和與數量比較少的那類樣本數量相同的樣本，最終組成正負樣本數量相同的樣本集進行訓練建模。

# 獲取原始的特徵、標籤數據集X = data.loc[:,data.columns != Class]Y = data.loc[:,data.columns == Class]X.shape,Y.shape((284807, 29), (284807, 1))# 找出負樣本的個數number_record_fraud = len(Y[Y.Class==1])# 獲取負樣本的索引fraud_indices = np.array(data[data.Class == 1].index)normal_indices = np.array(data[data.Class == 0].index)# 通過np.random.choice在正樣本的索引（normal_indices）中隨機選負樣本個數（number_record_fraud ）個索引random_normal_indices = np.array(np.random.choice(normal_indices,number_record_fraud,replace=False))# 匯總正、負樣本的索引under_sample_indices = np.concatenate([fraud_indices,random_normal_indices])# 根據匯總的索引提取數據集under_sample_data = data.iloc[under_sample_indices,:]# 在數據集中提取特徵、標籤數據X_under_sample = under_sample_data.iloc[:,under_sample_data.columns != Class]Y_under_sample = under_sample_data.iloc[:,under_sample_data.columns == Class]# 檢查獲取的樣本特徵、標籤數據X_under_sample.shape,Y_under_sample.shape((984, 29), (984, 1))# 拆分數據集from sklearn.cross_validation import train_test_split# 拆分獲取的下採樣特徵、標籤數據集X_train_under_sample,X_test_under_sample,Y_train_under_sample,Y_test_under_sample = train_test_split(X_under_sample, Y_under_sample, test_size=0.3, random_state=0)# 拆分原始的未處理的特徵、標籤數據集，以備後面之需X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)/Applications/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88 return f(*args, **kwds)# 查看採樣數據拆分後的形狀，應經常檢查，及時發現異常print(X_train_under_sample.shape, Y_train_under_sample.shape, ,X_test_under_sample.shape, Y_test_under_sample.shape)(688, 29) (688, 1) (296, 29) (296, 1)# 查看原始的未處理的數據拆分後的形狀print(X_train.shape, Y_train.shape, ,X_test.shape, Y_test.shape)(199364, 29) (199364, 1) (85443, 29) (85443, 1)

三，交叉驗證與調參

得到模型後，必不可少的步驟是驗證模型，這也將有助於我們知道模型的效果怎麼樣，適不適合應用，而調參又是決定模型好壞的最核心因素。

機器學習中，當將要採用的機器演算法確定後，模型訓練的實質就是確定一系列的參數了（調參）。調參其實就是各種試，但也是有章可循的。

1. 首先要用一些數據和某個參數來訓練得到一個模型，

2. 然後用另外一些數據來帶入剛才訓練好的模型，

3. 輸出結果和標籤進行比較，計算出來一個評價指標，

4. 根據這個評價指標來判斷剛才帶入的那個參數到底好不好。

所以我們要通知評價指標來衡量效果，這裡介紹2個重要的評價指標：

精度
recall值

import sklearnfrom sklearn.linear_model import LogisticRegressionfrom sklearn.cross_validation import KFold,cross_val_scorefrom sklearn.metrics import (confusion_matrix,recall_score, classification_report)pass# 定義求KFold的函數def printing_Kfold_scores(X_train_data,Y_train_data): fold = KFold(len(Y_train_data),5,shuffle=False) print (fold) c_param_range = [0.01,0.1,1,10,100] # results_table為創建的DataFrame對象，來存儲不同參數交叉驗證後所得的recall值 results_table = pd.DataFrame(index=range(len(c_param_range)),columns=[C_Parameter,Mean recall score]) results_table[C_Parameter] = c_param_range j=0 for c_param in c_param_range: print (c_param:,c_param) recall_accs = [] #enumerate將一個可遍歷對象（如列表、字元串）組成一個索引序列， #獲得索引和元素值，start=1表示索引從1開始（默認為0） for iteration,indices in enumerate(fold, start=1): lr = LogisticRegression(C = c_param, penalty = l1) lr.fit(X_train_data.iloc[indices[0],:],Y_train_data.iloc[indices[0],:].values.ravel()) Y_pred_undersample = lr.predict(X_train_data.iloc[indices[1],:].values) recall_acc = recall_score(Y_train_data.iloc[indices[1],:].values,Y_pred_undersample) recall_accs.append(recall_acc) print (Iteration:,iteration,recall_acc:,recall_acc) #求每個C參數的平均recall值 print (Mean recall score,np.mean(recall_accs)) results_table.loc[j,Mean recall score] = np.mean(recall_accs) j+=1 # 最佳C參數 # 千萬注意results_table[Mean recall score]的類型是object，要轉成float64！ results_table[Mean recall score]=results_table[Mean recall score].astype(float64) #hh=results_table[Mean recall score]#.idxmax() #print(hh,results_table.info()) best_c = results_table[C_Parameter].iloc[results_table[Mean recall score].idxmax()] print (best_c is :,best_c) return best_c# 帶入下採樣數據best_c = printing_Kfold_scores(X_train_under_sample, Y_train_under_sample)sklearn.cross_validation.KFold(n=688, n_folds=5, shuffle=False, random_state=None)c_param: 0.01Iteration: 1 recall_acc: 0.931506849315Iteration: 2 recall_acc: 0.917808219178Iteration: 3 recall_acc: 1.0Iteration: 4 recall_acc: 0.959459459459Iteration: 5 recall_acc: 0.954545454545Mean recall score 0.9526639965c_param: 0.1Iteration: 1 recall_acc: 0.835616438356Iteration: 2 recall_acc: 0.86301369863Iteration: 3 recall_acc: 0.915254237288Iteration: 4 recall_acc: 0.918918918919Iteration: 5 recall_acc: 0.893939393939Mean recall score 0.885348537427c_param: 1Iteration: 1 recall_acc: 0.849315068493Iteration: 2 recall_acc: 0.890410958904Iteration: 3 recall_acc: 0.966101694915Iteration: 4 recall_acc: 0.945945945946Iteration: 5 recall_acc: 0.893939393939Mean recall score 0.90914261244c_param: 10Iteration: 1 recall_acc: 0.86301369863Iteration: 2 recall_acc: 0.904109589041Iteration: 3 recall_acc: 0.966101694915Iteration: 4 recall_acc: 0.932432432432Iteration: 5 recall_acc: 0.909090909091Mean recall score 0.914949664822c_param: 100Iteration: 1 recall_acc: 0.890410958904Iteration: 2 recall_acc: 0.904109589041Iteration: 3 recall_acc: 0.983050847458Iteration: 4 recall_acc: 0.959459459459Iteration: 5 recall_acc: 0.909090909091Mean recall score 0.929224352791best_c is : 0.01

四，混淆矩陣

定義畫混淆矩陣的函數plot_confusion_matrix，如下：

import itertoolsdef plot_confusion_matrix(cm, classes, title=Confusion matrix, cmap=plt.cm.Blues): plt.imshow(cm, interpolation=nearest, cmap=cmap) plt.title(title,fontsize=22) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0) plt.yticks(tick_marks, classes) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, cm[i, j], horizontalalignment="center", fontsize=15, color="white" if cm[i, j] > thresh else "black") plt.tight_layout() plt.ylabel(True label,fontsize=15) plt.xlabel(Predicted label,fontsize=15)

將下採樣處理得到的測試數據帶入模型，利用得到的預測結果和實際標籤畫出混淆矩陣

lr = LogisticRegression(C = best_c, penalty = l1)lr.fit(X_train_under_sample,Y_train_under_sample.values.ravel())Y_pred_undersample = lr.predict(X_test_under_sample.values)cnf_matrix = confusion_matrix(Y_test_under_sample,Y_pred_undersample)np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))class_names = [0,1]f,ax=plt.subplots(figsize=(8,6))plot_confusion_matrix(cnf_matrix , classes=class_names , title=Confusion matrix)plt.show()Recall metric in the testing dataset: 0.925170068027

由上圖所示的混淆矩陣可知recall為：

recall＝TP/(TP+FN)=136/(136+11)

可見recall只和TP和FN有關係，那當FP很大時（本來為0，沒有欺詐風險，但預測為1，預測成有風險），所以在調參的時候不僅要看recall值，還要通過混淆矩陣，看看FP等參數。

上面是用下採樣處理得到的測試數據來求recall和混淆矩陣的，因為下採樣得到的數據相比於原始數據是很少的，所以這個測試結果沒什麼說服力，所以我們要用原始數據（沒有經過下採樣的數據）來進行測試。

lr = LogisticRegression(C = best_c, penalty = l1)lr.fit(X_train_under_sample,Y_train_under_sample.values.ravel())Y_pred = lr.predict(X_test.values)# Compute confusion matrixcnf_matrix = confusion_matrix(Y_test,Y_pred)np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))# Plot non-normalized confusion matrixclass_names = [0,1]f,ax=plt.subplots(figsize=(8,6))plot_confusion_matrix(cnf_matrix , classes=class_names , title=Confusion matrix)plt.show()Recall metric in the testing dataset: 0.918367346939

由上圖可知，通過下採樣處理數據得到的邏輯回歸模型，雖然recall值挺高的，但NP值非常高8404，也就是誤殺率非常高。這也是用下採樣處理數據的一個弊端，如果採用過採樣來處理數據，效果就會好很多。

用原始數據X_train,Y_train試試看效果怎麼樣：

best_c = printing_Kfold_scores(X_train,Y_train)lr = LogisticRegression(C = best_c, penalty = l1)lr.fit(X_train,Y_train.values.ravel())Y_pred = lr.predict(X_test.values)# Compute confusion matrixcnf_matrix = confusion_matrix(Y_test,Y_pred)np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))# Plot non-normalized confusion matrixclass_names = [0,1]plt.figure()plot_confusion_matrix(cnf_matrix , classes=class_names , title=Confusion matrix)plt.show()sklearn.cross_validation.KFold(n=199364, n_folds=5, shuffle=False, random_state=None)c_param: 0.01Iteration: 1 recall_acc: 0.492537313433Iteration: 2 recall_acc: 0.602739726027Iteration: 3 recall_acc: 0.683333333333Iteration: 4 recall_acc: 0.569230769231Iteration: 5 recall_acc: 0.45Mean recall score 0.559568228405c_param: 0.1Iteration: 1 recall_acc: 0.567164179104Iteration: 2 recall_acc: 0.616438356164Iteration: 3 recall_acc: 0.683333333333Iteration: 4 recall_acc: 0.584615384615Iteration: 5 recall_acc: 0.525Mean recall score 0.595310250644c_param: 1Iteration: 1 recall_acc: 0.55223880597Iteration: 2 recall_acc: 0.616438356164Iteration: 3 recall_acc: 0.716666666667Iteration: 4 recall_acc: 0.615384615385Iteration: 5 recall_acc: 0.5625Mean recall score 0.612645688837c_param: 10Iteration: 1 recall_acc: 0.55223880597Iteration: 2 recall_acc: 0.616438356164Iteration: 3 recall_acc: 0.733333333333Iteration: 4 recall_acc: 0.615384615385Iteration: 5 recall_acc: 0.575Mean recall score 0.61847902217c_param: 100Iteration: 1 recall_acc: 0.55223880597Iteration: 2 recall_acc: 0.616438356164Iteration: 3 recall_acc: 0.733333333333Iteration: 4 recall_acc: 0.615384615385Iteration: 5 recall_acc: 0.575Mean recall score 0.61847902217best_c is : 10.0Recall metric in the testing dataset: 0.619047619048

五，參數Threshold的調整

lr = LogisticRegression(C = 0.01, penalty = l1)lr.fit(X_train_under_sample,Y_train_under_sample.values.ravel())y_pred_undersample_proba = lr.predict_proba(X_test_under_sample.values)thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]plt.figure(figsize=(15,15))recall_accs = []j = 1for i in thresholds: y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i plt.subplot(3,3,j) j += 1 # Compute confusion matrix cnf_matrix = confusion_matrix(Y_test_under_sample,y_test_predictions_high_recall) np.set_printoptions(precision=2) recall_acc = float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]) print(Threshold>=%s Recall: %i, recall_acc) recall_accs.append(recall_acc) # Plot non-normalized confusion matrix class_names = [0,1] plot_confusion_matrix(cnf_matrix , classes=class_names , title=Threshold>=%s%i)Threshold>=0.1 Recall: 1.0Threshold>=0.2 Recall: 1.0Threshold>=0.3 Recall: 1.0Threshold>=0.4 Recall: 0.986394557823Threshold>=0.5 Recall: 0.925170068027Threshold>=0.6 Recall: 0.863945578231Threshold>=0.7 Recall: 0.823129251701Threshold>=0.8 Recall: 0.734693877551Threshold>=0.9 Recall: 0.571428571429

如上圖，可知Threshold＝0.5的時候效果最好。

六，過採樣

與下採樣採用減少數據的做法不同，過採樣採用的另一種思路：

過採樣：對樣本中數量較少的那一類進行生成補齊，使之達到與較多的那一類相匹配的程度。

那麼該如何生成數據，使之擴充到相匹配的程度呢？

最常用的一種方法是SMOTE演算法，關於SMOTE的詳細介紹見這篇文獻:

SMOTE

下面逐步展開。

生成數據

分離數據中的特徵和標籤
將數據分成訓練數據和測試數據，比例7:3。
利用SMOTE來處理訓練樣本，得到均衡的訓練樣本

columns=data.columnsfeatures_columns=columns.delete(len(columns)-1)features=data[features_columns]labels=data[Class]features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.3, random_state=0)oversampler=SMOTE(random_state=0)os_features,os_labels=oversampler.fit_sample(features_train,labels_train)os_features = pd.DataFrame(os_features)os_labels = pd.DataFrame(os_labels)print(len(os_labels[os_labels==1]))398038features_columnsIndex([V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, Class], dtype=object)# 檢查過採樣生成的數據集os_features.shape,os_labels.shape((398038, 29), (398038, 1))best_c = printing_Kfold_scores(os_features,os_labels)lr = LogisticRegression(C = best_c, penalty = l1)lr.fit(os_features,os_labels.values.ravel())y_pred = lr.predict(features_test.values)# 將數據帶入生成混淆矩陣的函數cnf_matrix = confusion_matrix(labels_test,y_pred)np.set_printoptions(precision=2)print("Recall metric in the testing dataset: ", float(cnf_matrix[1,1])/(cnf_matrix[1,0]+cnf_matrix[1,1]))class_names = [0,1]plt.figure()plot_confusion_matrix(cnf_matrix , classes=class_names , title=Confusion matrix)plt.show()sklearn.cross_validation.KFold(n=398038, n_folds=5, shuffle=False, random_state=None)c_param: 0.01Iteration: 1 recall_acc: 1.0Iteration: 2 recall_acc: 1.0Iteration: 3 recall_acc: 1.0Iteration: 4 recall_acc: 1.0Iteration: 5 recall_acc: 1.0Mean recall score 1.0c_param: 0.1Iteration: 1 recall_acc: 1.0Iteration: 2 recall_acc: 1.0Iteration: 3 recall_acc: 1.0Iteration: 4 recall_acc: 1.0Iteration: 5 recall_acc: 1.0Mean recall score 1.0c_param: 1Iteration: 1 recall_acc: 1.0Iteration: 2 recall_acc: 1.0Iteration: 3 recall_acc: 1.0Iteration: 4 recall_acc: 1.0Iteration: 5 recall_acc: 1.0Mean recall score 1.0c_param: 10Iteration: 1 recall_acc: 1.0Iteration: 2 recall_acc: 1.0Iteration: 3 recall_acc: 1.0Iteration: 4 recall_acc: 1.0Iteration: 5 recall_acc: 1.0Mean recall score 1.0c_param: 100Iteration: 1 recall_acc: 1.0Iteration: 2 recall_acc: 1.0Iteration: 3 recall_acc: 1.0Iteration: 4 recall_acc: 1.0Iteration: 5 recall_acc: 1.0Mean recall score 1.0best_c is : 0.01Recall metric in the testing dataset: 1.0

過採樣使得模型的recall進一步提高（訓練數據多了，模型固然更優），最主要的是誤殺率降了很多。從原來的誤殺8404到現在的0個，所以過採樣對於這種大數據量下的不平衡有很好的補充。

七，小結

拿到數據，首先應看一下數據的結構，是否存在不平衡；
若數據不平衡，應採取下採樣或過採樣的辦法獲取全新的數據集，再來選模型、演算法；
模型的調參是個痛苦的過程，只有不斷的試，才能知道最佳的參數；
預測的時候應綜合考慮精度、recall、混淆矩陣等多個參數，而不應只盯著某一個參數；

以上就是本文的全部，後面看有時間的話，再用這個數據集應用決策樹模型試試看。

謝謝你查看本文。

（人氣稀薄????，急需關愛????。如果您竟然看到了這裡還沒走開，請幫忙多多點贊、收藏哈，謝謝啦朋友們～～）