GBDT的python源碼實現

05-10

1.前言

這次我們來講解Gradient Bosting Desicion Tree的python實現，關於GBDT的原理我瀏覽了許多教材和blog都沒有發現講解的非常清晰的，後面翻牆去谷歌看了一篇PPT講解的非常透徹，豁然開朗，雖然ppt是全英的，但閱讀難度真心不大，大家可以去看看http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf

建議閱讀順序：先閱讀源代碼，再來看源碼關鍵方法的講解，源碼地址RRdmlearning/Machine-Learning-From-Scratch

不知為何知乎上的代碼格式沒有原文章便於理解，大家可在cs229論壇社區|深度學習社區|機器學習社區|人工智慧社區上閱讀

2.源碼講解

GBDT與隨機森林一樣需要使用到決策樹的子類，對於決策樹子類的代碼講解在我上一篇文章中。

若是大家之前沒有了解過決策樹可以看我這一篇文章隨機森林，gbdt，xgboost的決策樹子類講解。

2.1 __init__()

"""Parameters:-----------n_estimators: int 樹的數量 The number of classification trees that are used.learning_rate: float 梯度下降的學習率 The step length that will be taken when following the negative gradient during training.min_samples_split: int 每棵子樹的節點的最小數目（小於後不繼續切割） The minimum number of samples needed to make a split when building a tree.min_impurity: float 每顆子樹的最小純度（小於後不繼續切割） The minimum impurity required to split the tree further.max_depth: int 每顆子樹的最大層數（大於後不繼續切割） The maximum depth of a tree.regression: boolean 是否為回歸問題 True or false depending on if were doing regression or classification.""" def __init__(self, n_estimators, learning_rate, min_samples_split, min_impurity, max_depth, regression): self.n_estimators = n_estimators self.learning_rate = learning_rate self.min_samples_split = min_samples_split self.min_impurity = min_impurity self.max_depth = max_depth self.regression = regression # 進度條 processbar self.bar = progressbar.ProgressBar(widgets=bar_widgets) self.loss = SquareLoss() if not self.regression: self.loss = SotfMaxLoss() # 分類問題也使用回歸樹，利用殘差去學習概率 self.trees = [] for i in range(self.n_estimators): self.trees.append(RegressionTree(min_samples_split=self.min_samples_split, min_impurity=self.min_impurity, max_depth=self.max_depth))

創建n_estimators棵樹的GBDT，注意這裡的分類問題也使用回歸樹，利用殘差去學習概率

2.2 fit()

def fit(self, X, y): # 讓第一棵樹去擬合模型 self.trees[0].fit(X, y) y_pred = self.trees[0].predict(X) for i in self.bar(range(1, self.n_estimators)): gradient = self.loss.gradient(y, y_pred) self.trees[i].fit(X, gradient) y_pred -= np.multiply(self.learning_rate, self.trees[i].predict(X))

for循環的過程就是不斷讓下一棵樹擬合上一顆樹的"殘差"(梯度)。

而"殘差"是由梯度求出。在square loss中，gradient = yi - F(xi),此時梯度剛好等於殘差(這裡是真正的殘差)。

在其他的損失函數中其實擬合的是梯度，具體的細節可以查看我上面推薦的ppt，講的非常詳細。

2.3 predict()

def predict(self, X): y_pred = self.trees[0].predict(X) for i in range(1, self.n_estimators): y_pred -= np.multiply(self.learning_rate, self.trees[i].predict(X)) if not self.regression: # Turn into probability distribution y_pred = np.exp(y_pred) / np.expand_dims(np.sum(np.exp(y_pred), axis=1), axis=1) # Set label to the value that maximizes probability y_pred = np.argmax(y_pred, axis=1) return y_pred

for循環的過程就是匯總各棵樹的殘差得到最後的結果

3.源碼地址

https://github.com/RRdmlearning/Machine-Learning-From-Scratch/tree/master/gradient_boosting_decision_tree

直接運行gbdt_classifier_example.py或gbd_regressor_example.py文件即可。

項目包括了許多機器學習演算法的簡潔實現

此文章為記錄自己一路的學習路程，也希望能給廣大初學者們一點點幫助，如有錯誤,疑惑歡迎一起交流。