採用LibSVM進行分類

05-06

之前的五篇我們已經介紹了SVM的問題定義、對偶優化問題、SMO演算法求導等，本篇我們採用LibSVM來對實際問題進行解決。

1. Data Preprocessing

1.1 Categorical Feature

SVM需要將每個樣本數據表示為實數組成的向量，因此對於分類的屬性，需要將其轉化為numeric data。推薦對m-category attribute表示為m個number，其中1個為1，剩餘m-1個為0.

例如水果的顏色包括{紅，綠，藍}，那麼採用categorical feature進行表示得到(0,0,1)(0,1,0)和(1,0,0). 我們的經驗是如果屬性值的數量沒有特別大，那麼比用單個數字表示是更穩定的。

1.2 Scaling

在應用SVM之前對數據進行標準化，這是為了

避免有的數據特別大而有的數據特別小導致dominate
為了避免計算難度，因為kernel值往往依賴於兩個特徵向量的內積

我們推薦講屬性線性縮放到[-1, +1]或者[0,1].當然需要同時針對訓練集和測試集進行這樣的操作。例如訓練集的數據[-10,+10]被scale to [-1,+1] . 而測試集的屬性是[-11，+8]，那麼就要scale to [-1.1, +0.8].

2. Model Selection

??????2.1 模型選擇

我們之前提到了四個常見的kernel

linear: $K(x,z)=x^{T}z$
polynomial: $K(x,z)=(gamma x^{T}z+gamma)^{d},gamma>0$
radial basis function(RBF): $K(x,z)=e^{-gamma ||x-z||^{2}}$
sigmoid kernel: $K(x,z)=tanh(gamma x^{T}z+r).$

其中 $gamma,r,d$ 是kernel的參數。

我們必須決定先試哪個，還需要選擇相應的懲罰參數C和kernel參數。

總體上，RBF kernel是一個合理的首選，因為：

RBF kernel講樣本非線性化地映射到更高維度空間，這樣可以處理class label和屬性之間的關係非線性的情況。而實際上線性kernel是RBF的一種特殊情況，RBF kernel採用某種參數 $(C,gamma)$ 可以得到線性kernel的效果。同樣，sigmoid kernel也可以取得特定參數組合的RBF kernel。
影響模型選擇複雜度的超參數的數量，polynomial kernel比RBF kernel的超參數更多。

??????2.2 Cross-Validation and Grid-search

在2.1中我們採用了RBF kernel，包括兩個參數： $C,gamma$ ，對於給定的問題並不知道什麼樣的參數是最好的，因此需要做parameter search，目標是找到一組好的參數 $C,gamma$ 使得分類器能夠更精確的預測未知數據。在訓練數據上達到很好的精度未必是非常有用的。一個好的方法是採用cross-validation(交叉驗證)，交叉驗證包括以下幾種：

hold-out cross validation
k-folder cross-validation
leave one out cross validation

這裡我們將專門寫一篇來進行介紹，在模型選擇和特徵選擇中介紹，這裡不做詳細介紹。這裡採用k-folder cross-validation，將訓練集分成k份，其中k-1份作為訓練集，剩下一份作為測試集，因此cross-validation的精度就是在所有測試集上測試結果的平均值。

cross-validation能夠部分解決過擬合問題，如下圖所示，左圖中在訓練數據上得到過擬合的分類器，雖然訓練數據上都能完全分類，但是在右圖中測試數據上得到的效果並不好。

在下圖(c)中得到的是未過擬合的分類器，圖(d)中給出的測試精度更好。

我們推薦在 $C,gamma$ 上採用cross-validation的方法使用「grid-search」來找到最好的參數組合，嘗試不同的參數對 $(C,gamma)$ 並且選擇cross-validation accuracy的參數對。我們發現採用指數增長的序列是一個比較實際的方法，例如 $C=2^{-5},2^{-3},...,2^{15},2^{17},gamma=2^{-15},2^{-13},...,2^{3}$

Grid-search方法是很直接的，但是很naive，實際上有幾種更高級的方法能夠減少計算，例如approximating the cross-validation rate. 但是有兩個原因我們傾向於這種簡單的方法：

心理上我們會覺得不安全，如果採用避免參數搜索的方法，例如採用近似或者啟發式方法；
計算時間並沒有比其他方法多很多因為只有兩個參數，況且grid-search可以並行因為搜索的 $C,gamma$ 是獨立的，其他方法往往是iterative process，例如沿著路徑查找，這很難並行

鑒於完整的grid-search計算時間太長，我們推薦首先採用coarse grid(粗網格)在網格上識別出一個較好的區域，然後在這個區域上進行一個細粒度的查找。如下圖所示：

首先進行粗粒度的查找，發現最好的 $(C,gamma)=(2^{3},2^{-5})$ 得到cross-validation rate是77.5%，然後在 $(C,gamma)=(2^{3},2^{-5})$ 周圍進行細粒度的查找，發現最好的cross-validation rate是77.6%，參數對是 $(C,gamma)=(2^{3.25},2^{-5.25})$ .

上面的方法對於幾千個數據的問題效果很好，對於數據量極大的情況下，更好的方法是隨意選擇一個子集，在子集上進行grid-search，然後在整個數據集上進行better-region-only grid-search。

3. 代碼實現

??????3.1 數據格式

LibSVM要求訓練數據和測試數據的格式如下：

label1 index1:value1 index2:value2 ...label2 index1:value1 index2:value2 ...

例如我們本次採用的數據集來自於LIBSVM DATA的svmguide1，數據是需要進行scale的，第1列是label $left{ 0,1 ight}$ ，其餘四列是四個attribute

1 1:5.490192e+01 2:2.300120e+02 3:1.124727e-01 4:1.082362e+021 1:2.431879e+01 2:7.146220e+01 3:-3.444115e-01 4:1.214914e+021 1:2.481550e+01 2:8.496351e+01 3:2.453685e-01 4:1.399707e+021 1:4.530499e+01 2:2.619430e+02 3:-2.311574e-01 4:1.553381e+021 1:6.451801e+01 2:1.884440e+02 3:7.265563e-02 4:1.333321e+021 1:8.675299e+01 2:3.088610e+02 3:-9.522417e-02 4:1.430497e+021 1:5.171198e+01 2:2.807610e+02 3:-1.852275e-01 4:1.526079e+020 1:1.785300e+01 2:1.493100e+01 3:1.706039e-01 4:6.352117e+010 1:1.681499e+01 2:2.620200e+01 3:1.487285e-01 4:4.935408e+010 1:1.794760e+01 2:3.439160e+01 3:6.074293e-01 4:1.535747e+020 1:1.643700e+01 2:2.080002e-01 3:4.028665e-01 4:3.551385e+010 1:1.635220e+01 2:3.144360e+01 3:-2.683164e-01 4:3.086381e+010 1:1.820799e+01 2:4.019299e+01 3:5.212566e-01 4:1.246281e+02

??????3.2 libsvm代碼介紹

主要採用了三個文件：

svm-train
svm-predict
svm-scale

svm-train參數要求

Usage: svm-train [options] training_set_file [model_file]options:-s svm_type : set type of SVM (default 0) 0 -- C-SVC (multi-class classification) 1 -- nu-SVC (multi-class classification) 2 -- one-class SVM 3 -- epsilon-SVR (regression) 4 -- nu-SVR (regression)-t kernel_type : set type of kernel function (default 2) 0 -- linear: u*v 1 -- polynomial: (gamma*u*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) 3 -- sigmoid: tanh(gamma*u*v + coef0) 4 -- precomputed kernel (kernel values in training_set_file)-d degree : set degree in kernel function (default 3)-g gamma : set gamma in kernel function (default 1/num_features)-r coef0 : set coef0 in kernel function (default 0)-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)-m cachesize : set cache memory size in MB (default 100)-e epsilon : set tolerance of termination criterion (default 0.001)-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)-b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)-wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1)-v n: n-fold cross validation mode-q : quiet mode (no outputs)

svm-predict 參數要求

Usage: svm-predict [options] test_file model_file output_fileoptions:-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); for one-class SVM only 0 is supported-q : quiet mode (no outputs)

svm-scale參數要求

Usage: svm-scale [options] data_filenameoptions:-l lower : x scaling lower limit (default -1)-u upper : x scaling upper limit (default +1)-y y_lower y_upper : y scaling limits (default: no y scaling)-s save_filename : save scaling parameters to save_filename-r restore_filename : restore scaling parameters from restore_filename

??????3.3 non-scale+默認參數進行訓練預測

./svm-train svmguide1.txt...*..*optimization finished, #iter = 5371nu = 0.606150obj = -1061.528918, rho = -0.495266nSV = 3053, nBSV = 722Total nSV = 3053./svm-predict svmguide1.t svmguide1.txt.model svmguide1.t.predictAccuracy = 66.925% (2677/4000) (classification)

採用默認參數，並且不進行scale，得到的accuracy=66%

??????3.4 scale+默認參數進行訓練預測

??首先對訓練集進行scale，並且將scale的結果存入到svmguide1.scale, scale參數存入到range1文件中

/svm-scale -l -1 -u 1 -s range1 svmguide1.txt > svmguide1.scale

scale的結果如下所示

1 1:-0.630352 2:-0.198921 3:0.177151 4:0.164773 1 1:-0.836265 2:-0.74039 3:-0.444672 4:0.319044 1 1:-0.83292 2:-0.694281 3:0.358023 4:0.534116 1 1:-0.694967 2:-0.0898725 3:-0.290532 4:0.712971 1 1:-0.565608 2:-0.340882 3:0.12296 4:0.456853 1 1:-0.415903 2:0.0703588 3:-0.105526 4:0.569952 1 1:-0.65183 2:-0.0256065 3:-0.228021 4:0.681195 0 1:-0.879798 2:-0.933452 3:0.256268 4:-0.355646 0 1:-0.886787 2:-0.89496 3:0.226495 4:-0.520531 0 1:-0.879161 2:-0.866991 3:0.850791 4:0.692447 0 1:-0.889332 2:-0.983733 3:0.572379 4:-0.681611

??採用相同的scaling parameter對測試集數據進行scale

./svm-scale -r range1 svmguide1.t >svmguide1.t.scale

scale的結果如下所示

0 1:-0.912194 2:-0.924597 3:0.531213 4:0.035627 0 1:-0.930858 2:-0.944502 3:0.41847 4:0.280396 0 1:-0.869272 2:-0.89418 3:0.425484 4:0.783106 0 1:-0.863384 2:-0.811593 3:-0.388981 4:0.293535 0 1:-0.88641 2:-0.925034 3:0.395975 4:0.160618 0 1:-0.900741 2:-0.877244 3:0.828971 4:0.688611 0 1:-0.886672 2:-0.890022 3:0.591167 4:-0.559946 0 1:-0.888692 2:-0.815783 3:-0.304775 4:-0.669528 1 1:-0.475186 2:-0.323007 3:0.239772 4:0.328027 1 1:-0.66172 2:0.081458 3:-0.202763 4:0.999777 1 1:-0.765757 2:-0.556267 3:0.221668 4:0.141725 1 1:-0.711618 2:-0.242799 3:-0.169259 4:0.524334

??針對scaled 訓練集數據進行訓練

./svm-train svmguide1.scale*optimization finished, #iter = 496nu = 0.202599obj = -507.307046, rho = 2.627039nSV = 630, nBSV = 621Total nSV = 630

??生成模型存入默認文件svmguide1.scale.model中

svm_type c_svckernel_type rbfgamma 0.25nr_class 2total_sv 630rho 2.62704label 1 0nr_sv 316 314SV1 1:-0.823781 2:-0.783405 3:-0.233795 4:0.361305 1 1:-0.740805 2:-0.842831 3:-0.232668 4:0.347846 1 1:-0.838324 2:-0.724825 3:0.412097 4:0.0563095 1 1:-0.849309 2:-0.70294 3:-0.359031 4:-0.0476145 1 1:-1 2:-0.688314 3:0.595954 4:0.416735 ...

??對scaled test data進行預測

./svm-predict svmguide1.t.scale svmguide1.scale.model svmguide1.t.predictAccuracy = 96.15% (3846/4000) (classification)

得到accuracy=96.15%

測試結果存入文件svmguide1.t.predict中，如下所示：

0000000000000...

??????3.5 scale+parameter selection進行訓練預測

首先採用grid-search選擇最好的參數 $C,gamma$

python grid.py ../svmguide1.scale[local] 5 -7 95.5973 (best c=32.0, g=0.0078125, rate=95.5973)[local] -1 -7 85.2056 (best c=32.0, g=0.0078125, rate=95.5973)[local] 5 -1 96.8598 (best c=32.0, g=0.5, rate=96.8598)......[local] 13 3 94.9822 (best c=8.0, g=2.0, rate=96.9246)[local] 13 -9 96.1476 (best c=8.0, g=2.0, rate=96.9246)[local] 13 -3 96.7627 (best c=8.0, g=2.0, rate=96.9246)8.0 2.0 96.9246

因此得到的參數為 $C,gamma=(8.0,2.0)$ ,cross-validation rate 為96.9246%

然後採用新的參數進行訓練

./svm-train -s 0 -t 2 -c 2 -g 2 svmguide1.scale

訓練得到的模型存入svmguide1.scale.model

svm_type c_svckernel_type rbfgamma 2nr_class 2total_sv 368rho 0.0558534label 1 0nr_sv 188 180SV2 1:-1 2:-0.688314 3:0.595954 4:0.416735 2 1:-0.89062 2:-0.819471 3:0.773802 4:-0.78854 2 1:-0.867871 2:-0.816313 3:0.609679 4:-0.00541841 2 1:-0.860513 2:-0.813079 3:0.246757 4:0.28727 ......

然後根據新的模型進行預測，得到精度為96.875%，比3.4中默認參數精度96.15%得到了提升。

./svm-predict svmguide1.t.scale svmguide1.scale.model svmguide1.t.predictAccuracy = 96.875% (3875/4000) (classification)

這一系列操作libsvm已經提供了一個腳本直接完成

python easy.py ../svmguide1.txt ../svmguide1.tScaling training data...Cross validation...Best c=8.0, g=2.0 CV rate=96.9246Training...Output model: svmguide1.txt.modelScaling testing data...Testing...Accuracy = 96.95% (3878/4000) (classification)Output prediction: svmguide1.t.predict

可以看到精度與自行操作得到的精度甚至更高。。。