集成學習之Boosting —— AdaBoost實現

08-15

來自專欄數據科學の雜談1 人贊了文章

AdaBoost的一般演算法流程：

輸入：訓練數據集 $T = left {(x_1,y_1), (x_2,y_2), cdots (x_N,y_N) ight }$ ， $yinleft{-1,+1 ight}$ ，基學習器 $G_m(x)$ ，訓練輪數M
1. 初始化權值分布： $w_i^{(1)} = frac{1}{N}:, ;;;; i=1,2,3, cdots N$
2. for m=1 to M:

(a) 使用帶有權值分布的訓練集學習得到基學習器 $G_m(x)$ :
$G_m(x) = mathop{argmin}limits_{G(x)}sumlimits_{i=1}^Nw_i^{(m)}mathbb{I}(y_i eq G(x_i))$
(b) 計算$G_m(x)$在訓練集上的誤差率：
? $epsilon_m = frac{sumlimits_{i=1}^Nw_i^{(m)}mathbb{I}(y_i eq G_m(x_i))}{sumlimits_{i=1}^Nw_i^{(m)}}$

(c) 計算 $G_m(x)$ 的係數： $alpha_m = frac{1}{2}lnfrac{1-epsilon_m}{epsilon_m}$
(d) 更新樣本權重分布： $w_{i}^{(m+1)} = frac{w_i^{(m)}e^{-y_ialpha_mG_m(x_i)}}{Z^{(m)}}; ,qquad i=1,2,3cdots N$
其中 $Z^{(m)}$ 是規範化因子， $Z^{(m)} = sumlimits_{i=1}^Nw^{(m)}ie^{-y_ialpha_mG_m(x_i)}$ ，以確保所有的
$w_i^{(m+1)}$ 構成一個分布。
3. 輸出最終模型： $G(x) = signleft[sumlimits{m=1}^Malpha_mG_m(x) ight]?$

另外具體實現了real adaboost, early_stopping，weight_trimming和分步預測 (stage_predict，見完整代碼)。

class AdaBoost(object): def __init__(self, M, clf, learning_rate=1.0, method="discrete", tol=None, weight_trimming=None): self.M = M self.clf = clf self.learning_rate = learning_rate self.method = method self.tol = tol self.weight_trimming = weight_trimming def fit(self, X, y): # tol為early_stopping的閾值，如果使用early_stopping，則從訓練集中分出驗證集 if self.tol is not None: X, X_val, y, y_val = train_test_split(X, y, random_state=2) former_loss = 1 count = 0 tol_init = self.tol w = np.array([1 / len(X)] * len(X)) # 初始化權重為1/n self.clf_total = [] self.alpha_total = [] for m in range(self.M): classifier = clone(self.clf) if self.method == "discrete": if m >= 1 and self.weight_trimming is not None: # weight_trimming的實現，先將權重排序，計算累積和，再去除權重過小的樣本 sort_w = np.sort(w)[::-1] cum_sum = np.cumsum(sort_w) percent_w = sort_w[np.where(cum_sum >= self.weight_trimming)][0] w_fit, X_fit, y_fit = w[w >= percent_w], X[w >= percent_w], y[w >= percent_w] y_pred = classifier.fit(X_fit, y_fit, sample_weight=w_fit).predict(X) else: y_pred = classifier.fit(X, y, sample_weight=w).predict(X) loss = np.zeros(len(X)) loss[y_pred != y] = 1 err = np.sum(w * loss) # 計算帶權誤差率 alpha = 0.5 * np.log((1 - err) / err) * self.learning_rate # 計算基學習器的係數alpha w = (w * np.exp(-y * alpha * y_pred)) / np.sum(w * np.exp(-y * alpha * y_pred)) # 更新權重分布 self.alpha_total.append(alpha) self.clf_total.append(classifier) elif self.method == "real": if m >= 1 and self.weight_trimming is not None: sort_w = np.sort(w)[::-1] cum_sum = np.cumsum(sort_w) percent_w = sort_w[np.where(cum_sum >= self.weight_trimming)][0] w_fit, X_fit, y_fit = w[w >= percent_w], X[w >= percent_w], y[w >= percent_w] y_pred = classifier.fit(X_fit, y_fit, sample_weight=w_fit).predict_proba(X)[:, 1] else: y_pred = classifier.fit(X, y, sample_weight=w).predict_proba(X)[:, 1] y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15) clf = 0.5 * np.log(y_pred / (1 - y_pred)) * self.learning_rate w = (w * np.exp(-y * clf)) / np.sum(w * np.exp(-y * clf)) self.clf_total.append(classifier) early stopping if m % 10 == 0 and m > 300 and self.tol is not None: if self.method == "discrete": p = np.array([self.alpha_total[m] * self.clf_total[m].predict(X_val) for m in range(m)]) elif self.method == "real": p = [] for m in range(m): ppp = self.clf_total[m].predict_proba(X_val)[:, 1] ppp = np.clip(ppp, 1e-15, 1 - 1e-15) p.append(self.learning_rate * 0.5 * np.log(ppp / (1 - ppp))) p = np.array(p) stage_pred = np.sign(p.sum(axis=0)) later_loss = zero_one_loss(stage_pred, y_val) if later_loss > (former_loss + self.tol): count += 1 self.tol = self.tol / 2 else: count = 0 self.tol = tol_init if count == 2: self.M = m - 20 print("early stopping in round {}, best round is {}, M = {}".format(m, m - 20, self.M)) break former_loss = later_loss return self def predict(self, X): if self.method == "discrete": pred = np.array([self.alpha_total[m] * self.clf_total[m].predict(X) for m in range(self.M)]) elif self.method == "real": pred = [] for m in range(self.M): p = self.clf_total[m].predict_proba(X)[:, 1] p = np.clip(p, 1e-15, 1 - 1e-15) pred.append(0.5 * np.log(p / (1 - p))) return np.sign(np.sum(pred, axis=0))if __name__ == "__main__": #測試各模型的準確率和耗時 X, y = datasets.make_hastie_10_2(n_samples=20000, random_state=1) # data X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) start_time = time.time() model_discrete = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, method="discrete", weight_trimming=None) model_discrete.fit(X_train, y_train) pred_discrete = model_discrete.predict(X_test) acc = np.zeros(pred_discrete.shape) acc[np.where(pred_discrete == y_test)] = 1 accuracy = np.sum(acc) / len(pred_discrete) print(Discrete Adaboost accuracy: , accuracy) print(Discrete Adaboost time: , {:.2f}.format(time.time() - start_time), ) start_time = time.time() model_real = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, method="real", weight_trimming=None) model_real.fit(X_train, y_train) pred_real = model_real.predict(X_test) acc = np.zeros(pred_real.shape) acc[np.where(pred_real == y_test)] = 1 accuracy = np.sum(acc) / len(pred_real) print(Real Adaboost accuracy: , accuracy) print("Real Adaboost time: ", {:.2f}.format(time.time() - start_time), ) start_time = time.time() model_discrete_weight = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, method="discrete", weight_trimming=0.995) model_discrete_weight.fit(X_train, y_train) pred_discrete_weight = model_discrete_weight.predict(X_test) acc = np.zeros(pred_discrete_weight.shape) acc[np.where(pred_discrete_weight == y_test)] = 1 accuracy = np.sum(acc) / len(pred_discrete_weight) print(Discrete Adaboost(weight_trimming 0.995) accuracy: , accuracy) print(Discrete Adaboost(weight_trimming 0.995) time: , {:.2f}.format(time.time() - start_time), ) start_time = time.time() mdoel_real_weight = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, method="real", weight_trimming=0.999) mdoel_real_weight.fit(X_train, y_train) pred_real_weight = mdoel_real_weight.predict(X_test) acc = np.zeros(pred_real_weight.shape) acc[np.where(pred_real_weight == y_test)] = 1 accuracy = np.sum(acc) / len(pred_real_weight) print(Real Adaboost(weight_trimming 0.999) accuracy: , accuracy) print(Real Adaboost(weight_trimming 0.999) time: , {:.2f}.format(time.time() - start_time), ) start_time = time.time() model_discrete = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, method="discrete", weight_trimming=None, tol=0.0001) model_discrete.fit(X_train, y_train) pred_discrete = model_discrete.predict(X_test) acc = np.zeros(pred_discrete.shape) acc[np.where(pred_discrete == y_test)] = 1 accuracy = np.sum(acc) / len(pred_discrete) print(Discrete Adaboost accuracy (early_stopping): , accuracy) print(Discrete Adaboost time (early_stopping): , {:.2f}.format(time.time() - start_time), ) start_time = time.time() model_real = AdaBoost(M=2000, clf=DecisionTreeClassifier(max_depth=1, random_state=1), learning_rate=1.0, method="real", weight_trimming=None, tol=0.0001) model_real.fit(X_train, y_train) pred_real = model_real.predict(X_test) acc = np.zeros(pred_real.shape) acc[np.where(pred_real == y_test)] = 1 accuracy = np.sum(acc) / len(pred_real) print(Real Adaboost accuracy (early_stopping): , accuracy) print(Discrete Adaboost time (early_stopping): , {:.2f}.format(time.time() - start_time), )

輸出結果：

Discrete Adaboost accuracy: 0.954Discrete Adaboost time: 43.47 Real Adaboost accuracy: 0.9758Real Adaboost time: 41.15 Discrete Adaboost(weight_trimming 0.995) accuracy: 0.9528Discrete Adaboost(weight_trimming 0.995) time: 39.58 Real Adaboost(weight_trimming 0.999) accuracy: 0.9768Real Adaboost(weight_trimming 0.999) time: 25.39 early stopping in round 750, best round is 730, M = 730Discrete Adaboost accuracy (early_stopping): 0.9268Discrete Adaboost time (early_stopping): 14.60 early stopping in round 539, best round is 519, M = 519Real Adaboost accuracy (early_stopping): 0.974Discrete Adaboost time (early_stopping): 11.64

可以看到，weight_trimming對於Discrete AdaBoost的訓練速度無太大提升，而對於Real AdaBoost則較明顯，可能原因是Discrete AdaBoost每一輪的權重較分散，而Real AdaBoost的權重集中在少數的樣本上。

early_stopping分別發生在750和539輪，最後準確率也可以接受。

下兩張圖顯示使用weight_trimming的情況下準確率與正常AdaBoost相差無幾 (除了0.95的情況)。

Discrete AdaBoost vs. Real AdaBoost - Overfitting

AdaBoost有一個吸引人的特性，那就是其「不會過擬合」，或者更準確的說法是在訓練誤差下降到零之後繼續訓練依然能提高泛化性能。如下圖所示，訓練10000棵樹，Real AdaBoost的訓練誤差早早下降為零，而測試誤差幾乎平穩不變。而且可以看到 Real AdaBoost 對比 Discrete AdaBoost 無論是訓練速度還是準確率都更勝一籌。

Margin理論可以解釋這個現象，認為隨著訓練輪數的增加，即使訓練誤差已經至零，對於訓練樣本預測的margin依然會擴大，這等於會不斷提升預測的信心。但過去十幾年來學術界一直對該理論存在爭議，具體可參閱AdaBoost發明者的論文 [Schapire, Explaining AdaBoost]。

Learning Curve

Learning Curve是另一種評估模型的方法，反映隨著訓練集的增大，訓練誤差和測試誤差的變化情況。通常如果兩條曲線比較接近且誤差都較大，為欠擬合；如果訓練集誤差率低，測試集誤差率高，二者的曲線會存在較大距離，則為過擬合。

下面來看AdaBoost在上面數據集中的learning curve：

這裡總共只選用了5000個數據 (2500訓練集 + 2500測試集)，因為learning curve的繪製通常需要擬合N個模型 (N為訓練樣本數)，計算量太大。從上圖來看Discrete AdaBoost是欠擬合，而Real AdaBoost比較像是過擬合，如果進一步增加數據，Real AdaBoost的測試誤差率可能會進一步下降。

集成學習之Boosting —— AdaBoost實現

AdaBoost的一般演算法流程：

輸出結果：

Discrete AdaBoost vs. Real AdaBoost - Overfitting

Learning Curve

完整代碼