機器學習實戰|GBDT Xgboost LightGBM對比
來自專欄數據分析俠23 人贊了文章
Mnist數據集識別
使用Sklearn的GBDT
GradientBoostingClassifier
GradientBoostingRegressor
import gzipimport pickle as pklfrom sklearn.model_selection import train_test_splitdef load_data(path): f = gzip.open(path, rb) try: #Python3 train_set, valid_set, test_set = pkl.load(f, encoding=latin1) except: #Python2 train_set, valid_set, test_set = pkl.load(f) f.close() return(train_set,valid_set,test_set)path = mnist.pkl.gz train_set,valid_set,test_set = load_data(path)Xtrain,_,ytrain,_ = train_test_split(train_set[0], train_set[1], test_size=0.9)Xtest,_,ytest,_ = train_test_split(test_set[0], test_set[1], test_size=0.9)print(Xtrain.shape, ytrain.shape, Xtest.shape, ytest.shape)
(5000, 784) (5000,) (1000, 784) (1000,)
參數說明:
learning_rate: The learning parameter controls the magnitude of this change in the estimates. (default=0.1)n_extimators: The number of sequential trees to be modeled. (default=100)
max_depth: The maximum depth of a tree. (default=3)min_samples_split: Tthe minimum number of samples (or observations) which are required in a node to be considered for splitting. (default=2)min_samples_leaf: The minimum samples (or observations) required in a terminal node or leaf. (default=1)min_weight_fraction_leaf: Similar to min_samples_leaf but defined as a fraction of the total number of observations instead of an integer. (default=0.)subsample: The fraction of observations to be selected for each tree. Selection is done by random sampling. (default=1.0)max_features: The number of features to consider while searching for a best split. These will be randomly selected. (default=None)max_leaf_nodes: The maximum number of terminal nodes or leaves in a tree. (default=None)min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value. (default=0.)
from sklearn.ensemble import GradientBoostingClassifierimport numpy as npimport time clf = GradientBoostingClassifier(n_estimators=10, learning_rate=0.1, max_depth=3)# start trainingstart_time = time.time()clf.fit(Xtrain, ytrain)end_time = time.time()print(The training time = {}.format(end_time - start_time))# prediction and evaluation pred = clf.predict(Xtest)accuracy = np.sum(pred == ytest) / pred.shape[0]print(Test accuracy = {}.format(accuracy))
The training time = 11.989675521850586
Test accuracy = 0.825集成演算法可以得出特徵重要性,說白了就是看各個樹使用特徵的情況,使用的多當然就重要了,這是分類器告訴我們的。
%matplotlib inlineimport matplotlib.pyplot as pltplt.hist(clf.feature_importances_)print(max(clf.feature_importances_), min(clf.feature_importances_))
0.0249318971528 0.0
一般情況下,我們還可以篩選一下。
from collections import OrderedDictd = {}for i in range(len(clf.feature_importances_)): if clf.feature_importances_[i] > 0.01: d[i] = clf.feature_importances_[i]sorted_feature_importances = OrderedDict(sorted(d.items(), key=lambda x:x[1], reverse=True))D = sorted_feature_importancesrects = plt.bar(range(len(D)), D.values(), align=center)plt.xticks(range(len(D)), D.keys(),rotation=90)plt.show()
由於是像素點,所以看的沒那麼直觀,正常特徵看起來其實蠻直接的。
XGBoost
加入了更多的剪枝策略和正則項,控制過擬合風險。傳統的GBDT用的是CART,Xgboost能支持的分類器更多,也可以是線性的。GBDT只用了一階導,但是xgboost對損失函數做了二階的泰勒展開,並且還可以自定義損失函數。
import xgboost as xgbimport numpy as npimport time# read data into Xgboost DMatrix formatdtrain = xgb.DMatrix(Xtrain, label=ytrain)dtest = xgb.DMatrix(Xtest, label=ytest)# specify parameters via mapparams = { booster:gbtree, # tree-based models objective: multi:softmax, num_class:10, eta: 0.1, # Same to learning rate gamma:0, # Similar to min_impurity_decrease in GBDT alpha: 0, # L1 regularization term on weight (analogous to Lasso regression) lambda: 2, # L2 regularization term on weights (analogous to Ridge regression) max_depth: 3, # Same as the max_depth of GBDT subsample: 1, # Same as the subsample of GBDT colsample_bytree: 1, # Similar to max_features in GBM min_child_weight: 1, # minimum sum of instance weight (Hessian) needed in a child nthread:1, # default to maximum number of threads available if not set}num_round = 10# start trainingstart_time = time.time()bst = xgb.train(params, dtrain, num_round)end_time = time.time()print(The training time = {}.format(end_time - start_time))# get prediction and evaluateypred = bst.predict(dtest)accuracy = np.sum(ypred == ytest) / ypred.shape[0]print(Test accuracy = {}.format(accuracy))
The training time = 13.496984481811523
Test accuracy = 0.821Xgboost參數
LightGBM
放到最後肯定有一堆優點的:
- 更快的訓練效率
- 低內存使用
- 更好的準確率
- 支持並行學習
- 可處理大規模數據
它摒棄了現在大部分GBDT使用的按層生長(level-wise)的決策樹生長策略,使用帶有深度限制的按葉子生長(leaf-wise)的策略。level-wise過一次數據可以同時分裂同一層的葉子,容易進行多線程優化,也好控制模型複雜度,不容易過擬合。但實際上level-wise是一種低效的演算法,因為它不加區分的對待同一層的葉子,帶來了很多沒必要的開銷,因為實際上很多葉子的分裂增益較低,沒必要進行搜索和分裂。
Leaf-wise則是一種更為高效的策略,每次從當前所有葉子中,找到分裂增益最大的一個葉子,然後分裂,如此循環。因此同Level-wise相比,在分裂次數相同的情況下,Leaf-wise可以降低更多的誤差,得到更好的精度。Leaf-wise的缺點是可能會長出比較深的決策樹,產生過擬合。因此LightGBM在Leaf-wise之上增加了一個最大深度的限制,在保證高效率的同時防止過擬合。
安裝指引
import lightgbm as lgbtrain_data = lgb.Dataset(Xtrain, label=ytrain)test_data = lgb.Dataset(Xtest, label=ytest)# specify parameters via mapparams = { num_leaves:31, # Same to max_leaf_nodes in GBDT, but GBDTs default value is None max_depth: -1, # Same to max_depth of xgboost tree_learner: serial, application:multiclass, # Same to objective of xgboost num_class:10, # Same to num_class of xgboost learning_rate: 0.1, # Same to eta of xgboost min_split_gain: 0, # Same to gamma of xgboost lambda_l1: 0, # Same to alpha of xgboost lambda_l2: 0, # Same to lambda of xgboost min_data_in_leaf: 20, # Same to min_samples_leaf of GBDT bagging_fraction: 1.0, # Same to subsample of xgboost bagging_freq: 0, bagging_seed: 0, feature_fraction: 1.0, # Same to colsample_bytree of xgboost feature_fraction_seed: 2, min_sum_hessian_in_leaf: 1e-3, # Same to min_child_weight of xgboost num_threads: 1}num_round = 10# start trainingstart_time = time.time()bst = lgb.train(params, train_data, num_round)end_time = time.time()print(The training time = {}.format(end_time - start_time))# get prediction and evaluateypred_onehot = bst.predict(Xtest)ypred = []for i in range(len(ypred_onehot)): ypred.append(ypred_onehot[i].argmax())accuracy = np.sum(ypred == ytest) / len(ypred)print(Test accuracy = {}.format(accuracy))
The training time = 4.891559839248657
Test accuracy = 0.902參數解釋
結果對比
| | time(s) | accuracy(%) | |----------|---------|-------------| | GBDT | 11.98 | 0.825 | | XGBoost | 13.49 | 0.821 | | LightGBM | 4.89 | 0.902 |
http://lightgbm.apachecn.org/cn/latest/Parameters-Tuning.html
—END—
微信公眾號:數據分析聯盟
加群微信助手:lestat911
——
【Python全套代碼案例材料】Python全套代碼 實戰 圖片 數據演示 案例
手機淘寶用戶複製下面:
【Python全套代碼 實戰 圖片 數據演示 案例】http://m.tb.cn/h.34wSLrP 點擊鏈接,再選擇瀏覽器咑閞;或復·制這段描述€hi79bdU0FGR€後到??淘♂寳♀??[來自超級會員的分享]
推薦閱讀:
※大數據告訴你:「滴滴們」到底有多不安全
※大數據為什麼選擇了Python語言
※又是一個平安年(大數據觀察)
※擎天柱說:工程車,變身!
※手把手教你完成一個數據科學小項目(1):數據爬取