[導讀]不平衡數據的解決之道

01-28

概述

作者以募捐數據（正例:負例 < 1:20）為例，通過一系列實驗比較了不平衡數據的多種處理方法。

如果不做任何處理，用隨機森林可以達到97%的準確率，但實際上存在很多的false positives和false negatives，用平衡數據統計大概只有77%的精度。

confusionMatrix(n n # the original model predicted for leadership levels, too, which I dont care in terms of accuracyn fct_collapse(predictedoutcomes$rf, donor = c(donor, leadership))n , fct_collapse(predictedoutcomes$actual, donor = c(donor,leadership)) n n)n n## Confusion Matrix and Statisticsn## not the exact real-life numbers n## n## Referencen## Prediction donor no giftn## donor 300 250n## no gift 250 19500n## n## Accuracy : 0.9744 n## 95% CI : (0.9721, 0.9765)n## No Information Rate : 0.9691 n## P-Value [Acc &gt; NIR] : 4.31e-06 n## n## Kappa : 0.5615 n## Mcnemars Test P-Value : 0.1732 n## n## Sensitivity : 0.56000 n## Specificity : 0.98761 n## Pos Pred Value : 0.59022 n## Neg Pred Value : 0.98600 n## Prevalence : 0.03089 n## Detection Rate : 0.01730 n## Detection Prevalence : 0.02931 n## Balanced Accuracy : 0.77380 n## n## Positive Class : donor n##n

兩種解決方法：

1. 帶權法。本文主要是懲罰多樣本類別，其實還可以加權少樣本類別；

2. 採樣法。本文依然只對多樣本類別進行下採樣，對應的其實還可以對少樣本類別上採樣。

下採樣

作者做了18組不同的採樣實驗，nx和ny分別表示正例和負例的採樣數量。

possiblesizesn n## # A tibble: 18 × 2n## n_x n_yn## n## 1 50 50n## 2 50 500n## 3 50 1000n## 4 50 5000n## 5 50 20000n## 6 50 60000n## 7 500 50n## 8 500 500n## 9 500 1000n## 10 500 5000n## 11 500 20000n## 12 500 60000n## 13 1000 50n## 14 1000 500n## 15 1000 1000n## 16 1000 5000n## 17 1000 20000n## 18 1000 60000n n# plot the possible sizes for claritynpossiblesizes %&gt;% n ggplot(aes(x = n_x, y = n_y)) + n geom_jitter(size = 3, width = 50) + n ggtitle("Possible Sample Sizes")n

帶權模型

類似的，作者做了25組不同權重的實驗。

possibleweightsn n## # A tibble: 25 × 2n## p_x p_yn## n## 1 0.1 0.1n## 2 0.1 0.3n## 3 0.1 0.5n## 4 0.1 0.7n## 5 0.1 0.9n## 6 0.3 0.1n## 7 0.3 0.3n## 8 0.3 0.5n## 9 0.3 0.7n## 10 0.3 0.9n## # ... with 15 more rowsn n# plot the possible weights for claritynpossibleweights %&gt;% n ggplot(aes(x = p_x, y = p_y)) + n geom_point(size = 3) + n ggtitle("Possible Class Weights")n

比較模型

# plot all the ROCsnplot(FY16sampledrocs[[1]], main = "ROC") n nfoon

上述所有實驗的AUC比較結果如下圖所示：

AUC點陣圖

進一步畫出AUC點陣圖，代碼為：

FY16allaucs %&gt;% n ggplot( aes(x = rownum, y = auc, color = modeltype)) + n geom_point() + n ylim(.75,1) +n # 0.8507755 for bog standard RF modeln geom_hline(aes(yintercept = rfreferenceauc), color = gray) + n # 0.8410452 for caret models AUC which is what we actually usedn geom_hline(aes(yintercept = caretreference), color = orange)n

最終結果如下圖所示，其中灰色的線為bog標準隨機森林模型，AUC為0.85，橙色為Caret模型，AUC為0.84。

從圖中可以看出，採樣模型比帶權模型好很多。

最好的模型

有意思的是，最好的模型少類樣本數量都是50個，top 3模型中多類樣本分別為500，50，1000。

roundedauc sampleratio n_x n_yn----------- ------------- ---- -----n0.910 10 50 500n0.907 1 50 50n0.900 20 50 1000n

奇怪的是，採樣最差的模型少類樣本數量也同樣為50，但多類樣本的數量多很多：

roundedauc sampleratio n_x n_yn----------- ------------- ---- ------n0.836 120 500 60000n0.831 400 50 20000n0.795 1200 50 60000n

採樣率-AUC圖表明採樣率和AUC間接成正比，在採樣率大於25的區間，AUC呈對數下降：

在採樣率小於25的區間，數據波動較大，如下所示：

總結與展望

長話短說，這篇文章中採樣相對帶權看起來是贏者，但作者並未嘗試兩種方法的結合，比如利用下採樣得到一個不錯的採樣率，然後用帶權法懲罰多類樣本。

一些有用的鏈接：

Great paper on strategies for imbalanced data
Super detailed answer on how to model with downsampling
More info from Stack Exchange about weighted random forests

我愛機器學習(http://52ml.net)編者按

本文是作者關於不平衡數據的簡單實驗，但不算完善，比如作者自己提到的結合的方法，此外，實驗數據比較個例，慎重參考。

推薦額外三篇相關文章：

不均衡數據問題
解決真實世界問題：如何在不平衡類上使用機器學習？
[導讀]Learning from Imbalanced Classes

作者：我愛機器學習(52ml.net)
原文作者：jaket
原文：Solutions for Modeling Imbalanced Data
原文章節：

What to do when modeling really imbalanced data?

AF16 Model

Dealing with Rare Cases

Comparing Models

The Best Models

Summary and General Ending Thoughts