GBDT實踐

05-08

環境

python環境：Anaconda3

系統環境：windows 7 x64

代碼來源：sk-learn

GBDT用於分類：

from sklearn.datasets import make_hastie_10_2from sklearn.ensemble import GradientBoostingClassifierif __name__ == "__main__": X, y = make_hastie_10_2(random_state=0) X_train, X_test = X[:2000], X[2000:] y_train, y_test = y[:2000], y[2000:] clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1, max_depth=1, random_state=0).fit(X_train, y_train) socre = clf.score(X_test, y_test) print(socre)

解釋：

make_hastie_10_2：

使用Hastie et al統計方法，生成用於二分類數據。數據10個緯度，默認12000個數據。

GradientBoostingClassifier：

loss （損失函數）:

可擇: {『deviance』, 『exponential』}，默認是』deviance』（logistic regression，logistic 損失）

learning_rate （學習率）：

默認是0.1，與n_estimators 之間存在一種均衡態

max_depth （最大深度）：

用於設置殘差擬合樹的最大深度

clf.score：

通過loss損失函數計算擬合損失

GBDT用於回歸（數值預測）：

import numpy as npfrom sklearn.metrics import mean_squared_errorfrom sklearn.datasets import make_friedman1from sklearn.ensemble import GradientBoostingRegressorif __name__ == "__main__": X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0) X_train, X_test = X[:200], X[200:] y_train, y_test = y[:200], y[200:] est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss=ls).fit(X_train, y_train) squared = mean_squared_error(y_test, est.predict(X_test)) print(squared)

解釋：

make_friedman1：

產生由Friedman回歸代碼的實驗數據

GradientBoostingRegressor：

loss （損失函數）:

可選{『ls』, 『lad』, 『huber』, 『quantile』}, 默認是』ls』（方差）

learning_rate （學習率）：

默認是0.1，與n_estimators 之間存在一種均衡態

max_depth （最大深度）：

用於設置殘差擬合樹的最大深度

mean_squared_error：

計算方差均值

GBDT用於房價預測：

預測出新的房價
計算對比各階訓練數據與測試數據的損失函數
計算對比每個參數對於訓練的重要程度

import numpy as npimport matplotlib.pyplot as pltfrom sklearn import ensemblefrom sklearn import datasetsfrom sklearn.utils import shufflefrom sklearn.metrics import mean_squared_error# ############################################################################## Load databoston = datasets.load_boston()X, y = shuffle(boston.data, boston.target, random_state=13)X = X.astype(np.float32)offset = int(X.shape[0] * 0.9)X_train, y_train = X[:offset], y[:offset]X_test, y_test = X[offset:], y[offset:]# ############################################################################## Fit regression modelparams = {n_estimators: 500, max_depth: 4, min_samples_split: 2, learning_rate: 0.01, loss: ls}clf = ensemble.GradientBoostingRegressor(**params)clf.fit(X_train, y_train)mse = mean_squared_error(y_test, clf.predict(X_test))print("MSE: %.4f" % mse)# ############################################################################## Plot training deviance# compute test set deviancetest_score = np.zeros((params[n_estimators],), dtype=np.float64)for i, y_pred in enumerate(clf.staged_predict(X_test)): test_score[i] = clf.loss_(y_test, y_pred)plt.figure(figsize=(12, 6))plt.subplot(1, 2, 1)plt.title(Deviance)plt.plot(np.arange(params[n_estimators]) + 1, clf.train_score_, b-, label=Training Set Deviance)plt.plot(np.arange(params[n_estimators]) + 1, test_score, r-, label=Test Set Deviance)plt.legend(loc=upper right)plt.xlabel(Boosting Iterations)plt.ylabel(Deviance)# ############################################################################## Plot feature importancefeature_importance = clf.feature_importances_# make importances relative to max importancefeature_importance = 100.0 * (feature_importance / feature_importance.max())sorted_idx = np.argsort(feature_importance)pos = np.arange(sorted_idx.shape[0]) + .5plt.subplot(1, 2, 2)plt.barh(pos, feature_importance[sorted_idx], align=center)plt.yticks(pos, boston.feature_names[sorted_idx])plt.xlabel(Relative Importance)plt.title(Variable Importance)plt.show()

運行結果：

代碼解釋：

boston = datasets.load_boston()

遠程下載數據（sklearn提供）

mse = mean_squared_error(y_test, clf.predict(X_test))

計算方差

至此房價預測代碼結束

==============================================

開始畫圖，對比。

for i, y_pred in enumerate(clf.staged_predict(X_test)): test_score[i] = clf.loss_(y_test, y_pred)=

staged_predict根據n_estimators值，從0至n_estimators依次計算出預測值。

enumerate將數據轉換為帶有下標類型

clf.loss_(y_test, y_pred)：根據loss 設置的方法計算損失，本次為最小平方和損失

plt.plot(np.arange(params[n_estimators]) + 1, clf.train_score_, b-, label=Training Set Deviance)

np.arange(params[n_estimators]) +1：橫軸為0-n_estimators

clf.train_score_：訓練數據的各階損失函數保存在類中

plt.plot(np.arange(params[n_estimators]) + 1, test_score, r-, label=Test Set Deviance)

np.arange(params[n_estimators]) +1：橫軸為0-n_estimators

test_score：上文中計算的測試函數損失

feature_importance = clf.feature_importances_

獲取特徵重要程度：

通過將所有的特徵分別與最高值相比，後轉換成百分制

特徵重要度（分）的計算原理[1-2]：

Friedman在GBM的論文中提出的方法：
特徵 $j$ 的全局重要度通過特徵 $j$ 在單顆樹中的重要度的平均值來衡量：
$hat{J_{j}^2}=frac1M sum_{m=1}^Mhat{J_{j}^2}(T_m)$
其中， $L$ 為樹的葉子節點數量， $L-1$ 即為樹的非葉子節點數量（構建的樹都是具有左右孩子的二叉樹）， $v_{t}$ 是和節點相關聯的特徵，t是節點分裂之後平方損失的減少值。 $hat{i_{t}^2}$ 是節點 $t$ 分裂之後平方損失的減少值。

參考文獻：

[1]Tree ensemble演算法的特徵重要度計算

[2]Gradient Boosted Feature Selection

寫在最後

Note

Classification with more than 2 classes requires the induction of n_classes regression trees at each iteration, thus, the total number of induced trees equals n_classes * n_estimators. For datasets with a large number of classes we strongly recommend to use RandomForestClassifier as an alternative to GradientBoostingClassifier .

SKLearn官網談到：當分類數多於2時，殘差擬合樹實際上需要n_classes （種類數）* n_estimators（設置的殘差擬合層數，深度）。所以，當分類種類數很大時，RandomForestClassifier （隨機森林）相對於GradientBoostingClassifier （GBDT）更好