Day8-《The Introduction of Statistical Learning》學習筆記

05-09

第九章--支持向量機（SVM）

maximal margin classifier(線性邊界，完美分離)---support vector classifier（線性邊界，部分分離）---support vector machine（非線性邊界）

9.1最大margin分類器（maximal margin classifier）

9.1.1什麼是超平面

9.1.2用超平面來完全分離

9.1.3最大margin分類器/最優分離超平面

9.1.4構造最大margin分類器

9.1.5不可分離情形

9.2支持向量分類器（support vector classifiers）

9.2.1概況

9.2.1細節

9.3支持向量機（support vector machines）

9.3.1非線性分界的分類

9.3.2 SVM

9.3.3應用例子

9.4 被解釋變數多於兩類時的SVM

9.4.1一對一分類

9.4.1一對所有分類

9.5 SVM與邏輯回歸（logistic regression）的關係

9.1最大margin分類器

9.1.1什麼是超平面

2維空間（平面）的超平面就是某個1維子空間（線），即2維空間中所有滿足以下方程的X=(X1,X2)所組成的集合，該超平面將原空間劃分為兩部分：>0 ; <0

3維空間（立體）的超平面就是某個2維空間（面）；

p維空間的的超平面就是某個p-1維的子空間，即p維空間中滿足以下方程的X=(X1,X2……Xp)所組成的子空間，該超平面將原空間劃分為兩部分：>0 ; <0

9.1.2用超平面來完全分離

完全分離，意味著對於所有i=1，2……n，均有：

等價於：

以下3個超平面（線）均實現了完全分離：

9.1.3最大margin分類器/最優分離超平面

如何在實現了完全分離的超平面中選擇最優的那個？

最優超平面，定義為擁有最大margin的超平面。

每個超平面都有各自對應的margin。

margin代表最小距離，具體來說是每個樣本離超平面的距離中最小的那個。

因為當 $eta _{0} +eta _{1}*x_{i1} +……+eta _{1}*x_{ip}$ 越遠離於0，說明我們對這個樣本的分類越肯定，即margin（最小距離）代表我們對最不肯定的樣本的分類預測的肯定程度，所以最優超平面就取完全分離的超平面中margin最高的那個。

9.1.4構造最大margin分類器

最優超平面即下面最優化問題的解：

對於第二個約束，因為完全分離的超平面等價於右邊必大於0，現在謹慎起見，要求右邊大於等於一個很小的正數，即此條件保證了每個樣本點都正確的出現在超平面的一側，完全分離；

對於第一個約束，作用是保證每個樣本點不僅在正確的一側，而且是正確的出現在margin的一側，即出現在超平面薄板（薄板的厚度為兩倍margin）以外，非出現在薄板內。因為對於任意k！=0，k*（ $eta _{0} +eta _{1}*x_{i1} +……+eta _{1}*x_{ip}$ ）=0表示某平行於超平面 $eta _{0} +eta _{1}*x_{i1} +……+eta _{1}*x_{ip}$ =0的超平面，即y*（ $eta _{0} +eta _{1}*x_{i1} +……+eta _{1}*x_{ip}$ ）=0也是某個與 $eta _{0} +eta _{1}*x_{i1} +……+eta _{1}*x_{ip}$ =0平行的超平面。此時在第一個約束下，可證y*（ $eta _{0} +eta _{1}*x_{i1} +……+eta _{1}*x_{ip}$ ）即為第i個樣本離超平面的距離。故結合第二個約束條件有：所有樣本點離超平面的距離都要大於等於M，即M為margin。所以我們需要最大化M。

9.1.5不可完全分離情形

最大margin分類器無法解決不可完全分離的情形，且要求出現在正確的一邊和margin以外可能意味著過度擬合訓練集，在測試集中表現不好。

9.2支持向量分類器（support vector classifiers）

9.2.1概況

svc允許部分樣本出現在薄板以內，甚至出現在錯誤的一邊。

9.2.1細節

svc是下面最優化問題的解：

選擇變數 $epsilon$ ，即選擇每個樣本點相對於超平面和薄板的位置。當 $epsilon$ 等於0時，樣本點出現在正確的一邊且薄板以外。當 $epsilon$ 大於0小於1時，出現在正確的一邊但在薄板以內。當 $epsilon$ 大於1時，出現在錯誤的一側。

調整參數C代表著容錯程度。當C=0時，意味著所有 $epsilon$ =0，即和最大margin分類器一樣。當C大於0時，意味著出現在錯誤一邊或薄板以內的樣本數要小於C個。隨著C增大，越能容忍錯誤，margin越寬，support vectors 越多。

C實質上控制著bias-variance tradeoff，一般用cv檢驗最優C。

可以證明，分界線只由薄板（虛線之間）之內的點決定，因此稱這些點為support vectors。

svc和logistic regression 一樣，受離分界處較遠的點的影響較小，LDA受所有點影響。

9.3支持向量機（support vector machines）

9.3.1非線性分界的分類器

類比回歸時通過使用解釋變數的函數（如多項式）來擬合解釋變數與被解釋變數的非線性關係：

對於解釋變數空間 $x_{1}$ ， $x_{2}$ ， $x_{1}^{2}$ ， $x_{2}^{2}$ ，分界線仍是線性的；但對於解釋變數空間 $x_{1}$ ， $x_{2}$ ，分界線是非線性的。

9.3.2 SVM

使用核方法來擴展解釋變數空間從而能產生非線性邊界，易於計算。

可以證明，線性的support vectors classifier的解是：

其中，內積定義為：

即要判斷某點（有p個特徵）的類別，分別求出該點與n個樣本點的內積，然後將n個內積加權求和。事實上，所有非support vectors的權數都是0，因此只需分別求出該點與所有support vectors的內積，在加權求和即可。

當用某種核函數K（ $x_{i}$ ， $x_{i}^{/}$ ）代替內積（線性核函數，d階多項式核函數的特例）時，即得support vectors machine，可以產生非線性邊界。

---d階多項式核函數：

---高斯核函數（radial）：

若測試集的某點x離訓練集的某點xi很遠，意味著歐氏距離 $sum_{j}^{p}{}$ （xj-xij）^2很大，即K很小，表示xi對於預測x的類別影響較小。

關鍵參數 $gamma$ 越大，flexibility越大，越能完美擬合訓練集，然而可能擬合過度，在測試集中表現較差。

9.4 被解釋變數多於兩類時的SVM

9.4.1one versus one/all-pairs

若解釋變數有3類，則產生2條分界線，劃分了3個區域。將測試集的某點代入第一條分界線，確定點落入哪個區域；再代入第二條分界線，確定點落入哪個區域。最終將該點判定為最多落入次數的區域的類別。

9.4.1one versus all

9.5 SVM與邏輯回歸（logistic regression）的關係

R語言代碼：

#support vector classifier---for linear bounary

set.seed(1)

x=matrix(rnorm(20*2),ncol=2)

y=c(rep(-1,10),rep(1,10))

x[y==1,]=x[y==1,]+1

#care to change quality variable into factor

#care to use x= , so we can use x for both xtrain and xtest

traindata=data.frame(x=x,y=as.factor(y))

plot(x,col=(3-y))

library(e1071)

#choose the best cost

set.seed(1)

cv=tune(svm,y~.,data=traindata,kernel=linear,ranges = list(cost=c(0.001,0.01,0.1,1,5,10,100)))

summary(cv)

#using the tune function, we can not only know the best cost, but also the best model

best=cv$best.model

summary(best)

#the first parameter in plot is the bounary,the second is the sample data

plot(best,traindata)

x=matrix(rnorm(20*2),ncol=2)

y=sample(c(-1,1),20,rep=TRUE)

x[y==1,]=x[y==1,]+1

#the columns name in testdata shoule be the same as those in traindata

testdata=data.frame(x=x,y=as.factor(y))

pred=predict(best,testdata)

table(pred,testdata$y)

#if we choose another cost

trainsvm=svm(y~.,traindata,kernel=linear,cost=0.01,scale=FALSE)

summary(trainsvm)

plot(trainsvm,traindata)

pred=predict(trainsvm,testdata)

table(pred,testdata$y)

#support vectors machine---for nonlinear bounary

library(e1071)

set.seed(1)

x=matrix(rnorm(200*2),ncol=2)

x[1:100,]=x[1:100,]+2

x[101:150,]=x[101:150,]-2

y=c(rep(1,150),rep(2,50))

dat=data.frame(x=x,y=as.factor(y))

plot(x,col=y)

train=sample(200,100)

test=(-train)

svmfit=svm(y~.,data=dat[train,],kernel=radial,gamma=1,cost=1)

summary(svmfit)

plot(svmfit,dat[train,])

#choose the best cost and gamma

set.seed(1)

cv=tune(svm,y~.,data=dat[train,],kernel=radial,

ranges=list(cost=c(0.1,1,10,100,1000),gamma=c(0.5,1,2,3,4)))

summary(cv)

best=cv$best.model

summary(best)

plot(best,dat[train,])

pred=predict(best,dat[test,])

table(pred,dat$y[test])

#when the class more than two--same as the two class case

set.seed(1)

x=rbind(x,matrix(rnorm(50*2),ncol=2))

y=c(y,rep(0,50))

x[y==0,]=x[y==0,]+2

dat=data.frame(x=x,y=as.factor(y))

plot(x,col=(5-y))

svmfit=svm(y~.,dat,kernel=radial,cost=10,gamma=2)

#plot the roc curve

library(gplots)

library(ROCR)

rocplot=function(pred,truth,...){

predob=prediction(pred,truth)

perf=performance(predob,tpr,fpr)

plot(perf)}

par(mfrow=c(1,2))

svmfit.opt=svm(y~.,data=dat[train,],kernel=radial,cost=1,gamma=2,decision.values=T)

fitted=attributes(predict(svmfit.opt,dat[train,],decision.values=TRUE))$decision.values

rocplot(fitted,dat$y[train],main=training data)

svmfit.flex=svm(y~.,data=dat[train,],kernel=radial,cost=1,gamma=50,decision.values=T)

fitted=attributes(predict(svmfit.flex,dat[train,],decision.values=T))$decision.values

rocplot(fitted,dat$y[train],add=T,col=red)

fitted=attributes(predict(svmfit.opt,dat[test,],decision.values=TRUE))$decision.values

rocplot(fitted,dat$y[test],main=test data)

fitted=attributes(predict(svmfit.flex,dat[test,],decision.values=T))$decision.values

rocplot(fitted,dat$y[test],add=T,col=red)

#example

library(ISLR)

names(Khan)

dat.train=data.frame(x=Khan$xtrain,y=as.factor(Khan$ytrain))

library(e1071)

#since there are many predictors,we choose linear would be enough flexible

svmtrain=svm(y~.,dat.train,kernel=linear,cost=10)

summary(svmtrain)

table(svmtrain$fitted,dat.train$y)

dat.test=data.frame(x=Khan$xtest,y=as.factor(Khan$ytest))

pred=predict(svmtrain,dat.test)

table(pred,dat.test$y)