Learn R | GBDT of Data Mining（四）

01-24

本文是GBDT演算法的第四篇，在完成XGBoost的基本介紹與數學推導後，接下來學習XGBoost區別於GBDT的一些獨特之處以及演算法的R實現。

一、XGBoost的優良特性

同樣是梯度提升，同樣是集成學習，那麼XGBoost比GBDT要好在哪裡呢？結合前面的推導過程與相關博客文章（見文末參考資料），可大致總結為以下幾點：

GBDT是以CART為基分類器，但XGBoost在此基礎上還支持線性分類器，此時XGBoost相當於帶 $L_1$ 和 $L_2$ 正則化項的Logistics回歸（分類問題）或者線性回歸（回歸問題）
XGBoost在目標函數里加入了正則項，用於控制模型的複雜度。正則項里包含了樹的葉子節點個數和每棵樹葉子節點上面輸出分數的 $L_2$ 模平方。從偏差方差權衡的角度來講，正則項降低了模型的variance，使學習出來的模型更加簡單，防止過擬合
傳統的GBDT在優化時只用到一階導數，XGBoost則對目標函數進行了二階泰勒展開，同時用到了一階和二階導數。（順便提一下，XGBoost工具支持自定義代價函數，只要函數可一階和二階求導）
樹節點在進行分裂時，我們需要計算每個特徵的每個分割點對應的增益，即用貪心法枚舉所有可能的分割點。當數據無法一次載入內存或者在分散式情況下，貪心演算法效率就會變得很低，所以XGBoost採用了一種近似的演算法。大致的思想是根據百分位法列舉幾個可能成為分割點的候選者，然後從候選者中根據上面求分割點的公式計算找出最佳的分割點
Shrinkage（縮減），相當於學習速率（XGBoost中的eta）。XGBoost在進行完一次迭代後，會將葉子節點的權重乘上該係數，主要是為了削弱每棵樹的影響，讓後面有更大的學習空間。實際應用中，一般把eta設置得小一點，然後迭代次數設置得大一點。（當然普通的GBDT實現也有學習速率）
特徵列排序後以塊的形式存儲在內存中，在迭代中可以重複使用；雖然boosting演算法迭代必須串列，但是在處理每個特徵列時可以做到並行
列抽樣（column subsampling）：XGBoost借鑒了隨機森林的做法，支持列抽樣，不僅能降低過擬合，還能減少計算，這也是XGBoost異於傳統GBDT的一個特性
除此之外，XGBoost還考慮了當數據量比較大，內存不夠時怎麼有效的使用磁碟，主要是結合多線程、數據壓縮、分片的方法，儘可能的提高演算法效率

二、xgboost包安裝與數據準備

在R中，xgboost包用於演算法的實現，首先進行安裝

# xgboost包在安裝時需要把R升級到3.3.0以上的版本,否則安裝不成功> install.packages("xgboost")# 也可使用devtools包安裝github版本> devtools::install_github("dmlc/xgboost", subdir="R-package")> library(xgboost)

在包中有一組蘑菇數據集可供使用，我們的目標是預測蘑菇是否可以食用（分類任務），此數據集已被分割成訓練數據與測試數據。

> data(agaricus.train, package="xgboost")> data(agaricus.test, package="xgboost")> train <- agaricus.train > test <- agaricus.test

# 整個數據集是由data和label組成的list> class(train)[1] "list"# 查看數據維度> dim(train$data)[1] 6513 126> dim(test$data)[1] 1611 126

# 在此數據集中，data是一個dgCMatrix類的稀疏矩陣,label是一個由{0,1}構成的數值型向量> str(train)List of 2 $ data :Formal class "dgCMatrix" [package "Matrix"] with 6 slots .. ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... .. ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ... .. ..@ Dim : int [1:2] 6513 126 .. ..@ Dimnames:List of 2 .. .. ..$ : NULL .. .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ... .. ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... .. ..@ factors : list() $ label: num [1:6513] 1 0 0 1 0 0 0 1 0 0 ...

三、構建模型和預測實現

xgboost包提供了兩個函數用於模型構建，分別是xgboost()與xgb.train()，前者可以滿足對演算法參數的基本設置，而後者的話在此基礎上可以實現一些更為高級的功能。

# data與label分別指定數據與標籤# max.deph：樹的深度,默認值為6,在此數據集中的分類問題比較簡單，設置為2即可# nthread：並行運算的CPU的線程數,設置為2;# nround：生成樹的棵數# objective = "binary:logistic"：設置邏輯回歸二分類模型> xgboost_model <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")# 得到兩次迭代的訓練誤差[1] train-error:0.046522 [2] train-error:0.022263

xgboost函數可調用的參數眾多，在此不在詳細展開介紹，可參閱博客文章[譯]快速上手：在R中使用XGBoost演算法中的"在xgboost中使用參數"一節，該文章將這些參數歸為通用、輔助和任務參數三大類，對我們掌握演算法與調參有著很大幫助。

# 設置verbose參數,可以顯示內部的學習過程> xgboost_model <- xgboost(data = train$data, label = train$label, + max.depth = 2, eta = 1, nthread = 2, nround = 2, verbose = 2,+ objective = "binary:logistic")[13:56:36] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2[1] train-error:0.046522 [13:56:36] amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2[2] train-error:0.022263

# 將建立好的模型用於預測新的數據集> xgboost_pred <- predict(xgboost_model, test$data)> head(xgboost_pred)[1] 0.28583017 0.92392391 0.28583017 0.28583017 0.05169873 0.92392391# 以上給出的是每一個樣本的預測概率值,進一步轉化後可得到具體的預測分類> prediction <- as.numeric(xgboost_pred > 0.5)> head(prediction)[1] 0 1 0 0 0 1> model_accuracy <- table(prediction,test$label)> model_accuracy prediction 0 1 0 813 13 1 22 763> model_accuracy_1 <- sum(diag(model_accuracy))/sum(model_accuracy)> model_accuracy_1[1] 0.9782744

四、XGBoost的高級功能

xgb.train()函數可以實現一些高級功能，幫助我們對模型進行進一步的優化。

# 在使用函數前需要將數據集進行轉換為xgb.Dmatrix格式> dtrain <- xgb.DMatrix(data = train$data, label=train$label)> dtest <- xgb.DMatrix(data = test$data, label=test$label)

# 使用watchlist參數,可同時得到訓練數據與測試數據的誤差> watchlist <- list(train=dtrain, test=dtest)> xgboost_model <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2,+ nround = 3,objective = "binary:logistic",watchlist = watchlist)[1] train-error:0.046522 test-error:0.042831 [2] train-error:0.022263 test-error:0.021726[3] train-error:0.007063 test-error:0.006207

# 自定義損失函數,可同時觀察兩種損失函數的表現# eval.metric可使用的參數包括"logloss"、"error"、"rmse"等> xgboost_model <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2,+ nround=3, watchlist=watchlist, eval.metric = "error", + eval.metric = "logloss", objective = "binary:logistic")[1] train-error:0.046522 train-logloss:0.233376 test-error:0.042831 test-logloss:0.226686 [2] train-error:0.022263 train-logloss:0.136658 test-error:0.021726 test-logloss:0.137874 [3] train-error:0.007063 train-logloss:0.082531 test-error:0.006207 test-logloss:0.080461

# 查看特徵的重要性,方便我們在模型優化時進行特徵篩選> importance_matrix <- xgb.importance(model = xgboost_model)> importance_matrix Feature Gain Cover Frequency1: 28 0.60036585 0.41841659 0.2502: 55 0.15214681 0.16140352 0.1253: 59 0.10936624 0.13772146 0.1254: 101 0.04843973 0.07979724 0.1255: 110 0.03391602 0.04120512 0.1256: 66 0.02973248 0.03859211 0.1257: 108 0.02603288 0.12286396 0.125# 使用xgb.plot.importance()函數進行可視化展示> xgb.plot.importance(importance_matrix)

# 使用xgb.dump()查看模型的樹結構> xgb.dump(xgboost_model,with_stats = T) [1] "booster[0]" [2] "0:[f28<-9.53674e-007] yes=1,no=2,missing=1,gain=4000.53,cover=1628.25" [3] "1:[f55<-9.53674e-007] yes=3,no=4,missing=3,gain=1158.21,cover=924.5" [4] "3:leaf=0.513653,cover=812" [5] "4:leaf=-0.510132,cover=112.5" [6] "2:[f108<-9.53674e-007] yes=5,no=6,missing=5,gain=198.174,cover=703.75" [7] "5:leaf=-0.582213,cover=690.5" [8] "6:leaf=0.557895,cover=13.25" ---# 將上述結果通過樹形結構圖表達出來 > xgb.plot.tree(model = xgboost_model)

至此，XGBoost演算法及其R實現就簡單介紹到這裡。雖然貌似講了好多，但我們所學的不過是一些皮毛而已，無論是XGBoost本身所具有的優良性能、通過複雜的調參對不同任務的實現支持，還是在實際應用中的高精度預測，這些優勢都將意味著XGBoost演算法有著巨大的潛力空間，值得我們一直探索下去。

References：

Get Started with XGBoost
嚴酷的魔王：xgboost：速度快效果好的boosting模型 | 統計之都
xgboost：速度快效果好的Boosting模型
機器學習演算法中GBDT和XGBOOST的區別有哪些？ - 知乎
[譯]快速上手：在R中使用XGBoost演算法 - FinanceR