Day7-《The Introduction of Statistical Learning》學習筆記

05-08

第八章--樹方法

將解釋變數空間劃分為一系列小空間，用新樣本值的解釋變數所在的小空間的均值作為新樣本值的被解釋變數的預測值。

樹方法優勢是便於解釋，直觀表示成樹狀圖，缺陷在於準確性不足，因此需要用bagging, random forest, boosting 等方法提高準確度。

8.1決策樹基礎

8.1.1回歸樹

8.1.2分類樹

8.1.3對比樹和線性模型

8.1.4樹的優勢和不足

8.2 Bagging，隨機森林，Boosting

8.2.1 Bagging

8.2.2 隨機森林

8.2.3 Boosting

8.1決策樹基礎

8.1.1回歸樹

--構建整棵完整的數：

STEP1: 出於簡單和便於解釋的考慮，我們將解釋變數空間劃分為一系列矩形或長方體。具體來說，首先選擇變數Xj及其分叉點s，使得最小化：

其中，

STEP2：分別在R1（j，s）和R1（j，s）中尋找各自分叉變數及其分叉點，使得各自的RSS最小。

STEP3: 重複上述步驟直至每個小區域內樣本點小於5個（例如）。對每一個落入某個小區域的樣本點都作相同的預測，即預測值等於落入該區域的訓練集樣本點的均值。

只要停止分叉條件足夠嚴格，樹的形狀足夠複雜，就能幾乎完美的對訓練集數據做出預測，但這意味著過度擬合，往往在測試集數據上表現很差。因此，我們需要尋找一棵使得test error 最小的子樹。

--修剪出更精簡的樹

由於子樹太多，不可能對每棵子樹進行cross-validation求出test error，我們考慮pruning方法。

pruning法只考慮每個 $alpha$ 值對應的子樹。對應 $alpha$ 值的子樹就是最小化下面方程的解（類似Lasso回歸）：

調整參數 $alpha$ 代表訓練集擬合程度和樹複雜度之間的權衡，若 $alpha$ =0，即只考慮訓練集擬合程度；若 $alpha$ 較大，意味著對增加終點結/樹複雜度施加嚴重懲罰，從而得到較精簡的樹。

由此我們得到了 $alpha$ 與子樹對應的列表，而每個子樹都有相應的MSE，從而只要用cross validation確定使得test error最小的 $alpha$ ，即可得到最優的子樹。

8.1.2分類樹

用分類錯誤率代替回歸問題中的RSS；引進兩個比分類錯誤率更敏感的指標來衡量哪種分叉更好：

Gini index和cross-entropy：當p越趨向於0或1時（即預測得越準確），G和D越小。

8.1.3對比樹和線性模型

線性模型假設f為：

樹模型假設f為：

當被解釋變數和解釋變數的關係為高度複雜的非線性關係時，樹更優。

8.1.4樹的優勢和不足

不足：預測準確度不如回歸模型；樹模型不穩健，高variance，數據集的微小變化可能導致整棵樹改變

8.2 Bagging，隨機森林，Boosting

8.2.1 Bagging

用bootstrap法在原訓練集中重複抽樣，得到B個數據集（B越大越好，不會引起擬合過度），分別對每個數據集擬合出決策樹（未經pruning，故低bias，高variance），然後取平均/多數：

進步之處：在保持低bias基礎上，較好的降低了variance。

8.2.2 隨機森林

和bagging的區別在於：每棵樹都從p個解釋變數中隨機抽取m個（一般m= $sqrt{p}$ ）解釋變數用來分叉。

進步之處：若存在某個很強的解釋變數，則在bagging中，每棵樹都會以這個變數開始分叉，最終各棵樹都差不多，導致每棵樹的預測結果高度相關，不能有效降低variance。隨機森林通過隨機抽取的m個變數來分叉，降低了樹之間的相關性。

8.2.3 Boosting

和bagging與隨機森林不同，boosting不需要用bootstrap產生多個數據集。每棵樹都使用前面的樹擬合剩下的殘差作為被解釋變數，將新擬合的樹加上原來的樹，逐步學習，更新到殘差不再有任何信息為止。

調整參數 $lambda$ （一般為0.01）越小，需要的B越大。B過大會引起過度擬合問題。

一般來說每棵樹分叉一次，效果更好，因為前面的樹會成為整個模型的一部分。

一般來說，緩慢學習的統計學習方法表現更好。

R語言代碼：

#classification tree

library(ISLR)

#change variable Sales into a binary variable so as to classify;define Sales <=8 as No

high=ifelse(Carseats$Sales<=8,No,Yes)

carseats=data.frame(Carseats,high)

#set training set and test set

set.seed(2)

train=sample(1:nrow(carseats),nrow(carseats)/2)

test=(-train)

#use tree function as lm function;delete the Sales variable in the regression

library(tree)

tree.carseats=tree(high~.-Sales,data=carseats[train,])

plot(tree.carseats)

text(tree.carseats,pretty=0)

#use the model we have just formed to make prediction on the test set

#predict function cannot have data=;and remeber the type

pred=predict(tree.carseats,carseats[test,],type=class)

table(pred,carseats$high[test])

#the whole tree maybe too complex and overfitted,so we need to prune the tree to make it concise

#use cv.tree function to do cross_validation to decide the size of the tree conresponse to the min cv error, which is a measurement of the test error

#care to set seed for cross_validation

set.seed(3)

cv=cv.tree(tree.carseats,FUN=prune.misclass);cv

par(mfrow=c(1,2))

plot(cv$size,cv$dev,type=b)

plot(cv$k,cv$dev,type=b)

#after knowing the best size,use the prune.misclass function to fit the pruning model

prune=prune.misclass(tree.carseats,best=9)

plot(prune)

text(prune,pretty=0)

pred=predict(prune,carseats[test,],type=class)

table(pred,carseats$high[test])

#regression tree

library(MASS)

set.seed(1)

train=sample(1:nrow(Boston),nrow(Boston)/2)

test=(-train)

tree.boston=tree(medv~.,Boston[train,])

par(mfrow=c(1,1))

plot(tree.boston)

text(tree.boston,pretty=0)

pred=predict(tree.boston,Boston[test,])

#cannot use table function to find out correct rate;instead we use mse

mean((pred-Boston$medv[test])^2)

#no need the FUN=prune.misclass parameter

set.seed(1)

cv=cv.tree(tree.boston);cv

par(mfrow=c(1,2))

plot(cv$size,cv$dev,type=b)

plot(cv$k,cv$dev,type=b)

prune=prune.tree(tree.boston,best=5)

plot(prune)

text(prune,pretty=0)

pred=predict(prune,Boston[test,])

plot(pred,Boston$medv[test])

mean((pred-Boston$medv[test])^2)

#bagging & randomforest

#bagging is a special case of randomforest when m=p

library(randomForest)

set.seed(1)

train=sample(1:nrow(Boston),nrow(Boston)/2)

test=(-train)

#care to set seed for randomForest function

set.seed(1)

bag.boston=randomForest(medv~.,Boston[train,],mtry=13,importance=TRUE);bag.boston

pred=predict(bag.boston,Boston[test,])

plot(pred,Boston$medv[test])

#add the line (0,1)

abline(0,1)

mean((pred-Boston$medv[test])^2)

#by default,randomForest choose mtry=p/3 for regression, and mtry=sqrt(p) for classification

set.seed(1)

randomforest.boston=randomForest(medv~.,Boston[train,],mtry=6,importance=TRUE);randomforest.boston

pred=predict(randomforest.boston,Boston[test,])

mean((pred-Boston$medv[test])^2)

importance(randomforest.boston)

varImpPlot(randomforest.boston)

#boosting

library(gbm)

#do not forget to set seed

#gaussian distribution for regression ; bernoulli for classification

set.seed(1)

boosting.boston=gbm(medv~.,Boston[train,],distribution=gaussian,n.tree=5000,interaction.depth=4)

summary(boosting.boston)

par(mfrow=c(1,2))

plot(boosting.boston,i=rm)

plot(boosting.boston,i=lstat)

#care the parameter n.tree in predict function

pred=predict(boosting.boston,Boston[test,],n.tree=5000)

mean((pred-Boston$medv[test])^2)

#the default shrinkage--lambda is 0.001,and verbose=CV;if we choose lambda , then need to turn off the verbose parameter

boosting.boston=gbm(medv~.,Boston[train,],distribution=gaussian,n.tree=5000,interaction.depth=4,shrinkage=0.2,verbose=F)

pred=predict(boosting.boston,Boston[test,],n.tree=5000)

mean((pred-Boston$medv[test])^2)