Kaggle 實戰之數字識別 -- 新手入門SVM分類演算法(Python)

01-27

介紹

正如《機器學習實戰》所說：「機器學習的主要任務就是分類。」這項數字識別競賽就是在考驗對機器學習的分類功能的掌握水平。這篇文章面向新手而寫，自然有其局限性，但卻很適合那些對這項競賽無從下手的新手們。同為機器學習新手的我，也在這些實踐中獲益匪淺。

首先我們導入相關的庫：

import pandas as pdnimport matplotlib.pyplot as plt, matplotlib.image as mpimgnfrom sklearn.model_selection import train_test_splitnfrom sklearn import svmn%matplotlib inlinen

載入數據

首先使用pandas包中的read_csv()函數來讀取train.csv文件，將數據存入數據框中；
然後為之後的監督學習分離出圖片和標籤；
使用train_test_split()來將數據分為兩半，一個是訓練數據，另一個是測試數據。這能讓我們通過在測試數據里的結果了解到我們的模型的分類表現如何。

考慮到時間問題，我們將只用5000個圖片數據。你也可以增加或減少使用的圖片數量來觀察數據量是如何影響訓練模型的。

labeled_images = pd.read_csv(../input/train.csv)nimages = labeled_images.iloc[0:5000,1:]nlabels = labeled_images.iloc[0:5000,:1]ntrain_images, test_images,train_labels, test_labels = train_test_split(images, labels, train_size=0.8, random_state=0)n

預覽圖片

由於圖片是一維數據，我們可以將它導入到一個 numpy 數組中，並將其重塑變為二維數組(28x28像素)；
然後使用matplotlib來繪製圖形並註上標籤。

在此可以改變i值來查看其它的圖片和標籤。

i=1nimg=train_images.iloc[i].as_matrix()nimg=img.reshape((28,28))nplt.imshow(img,cmap=gray)nplt.title(train_labels.iloc[i,0])n

輸出結果：

Out[3]: <matplotlib.text.Text at 0x7f76efe57978>n

檢查像素值

注意：這些圖片並非是黑白的。他們是有不同灰度值的(0-255)。

用直方圖顯示該圖片的像素值：

plt.hist(train_images.iloc[i])n

輸出結果：

Out[4]: (array([ 682., 9., 10., 7., 10., 18., 7., 17., 7., 17.]),narray([ 0. , 25.5, 51. , 76.5, 102. , 127.5, 153. , 178.5,n204. , 229.5, 255. ]),n<a list of 10 Patch objects>)n

訓練模型

首先使用sklearn.svm庫來創建向量分類器；
然後將訓練數據導入分類器的fit方法，它將訓練我們的模型；
最後，測試圖形和標籤將傳入score方法來測試其有效性。fit方法將返回一個0到1之間的浮點值以表示模型在分類測試數據時的準確性。

你也可以試試svm.SVC()中的參數來觀察結果是如何變化的。

clf = svm.SVC()nclf.fit(train_images, train_labels.values.ravel())nclf.score(test_images,test_labels)n

輸出結果：

Out[5]: 0.10000000000000001n

如何改善模型呢？

上述測試後，得到的準確性約為0.10，是個十分不理想的結果。10%的準確性和你隨機猜一個數字的幾率一樣。我們可以有不少方法來提高準確性，包括不使用向量分類器，但在此我講介紹個較為簡單的方法。先將圖片簡化為黑白。

也就是說，圖片中的任何像素將簡化為1或0；
然後再次繪圖，看看效果如何。

test_images[test_images>0]=1ntrain_images[train_images>0]=1nnimg=train_images.iloc[i].as_matrix().reshape((28,28))nplt.imshow(img,cmap=binary)nplt.title(train_labels.iloc[i])n

輸出結果：

Out[6]: /opt/conda/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: nA value is trying to be set on a copy of a slice from a DataFrame.nTry using .loc[row_indexer,col_indexer] = value insteadnnSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copynif __name__ == __main__:n/opt/conda/lib/python3.5/site-packages/pandas/core/frame.py:2392: SettingWithCopyWarning: nA value is trying to be set on a copy of a slice from a DataFramennSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copynself.where(-key, value, inplace=True)n/opt/conda/lib/python3.5/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: nA value is trying to be set on a copy of a slice from a DataFrame.nTry using .loc[row_indexer,col_indexer] = value insteadnnSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copynfrom ipykernel import kernelapp as appnn<matplotlib.text.Text at 0x7f76efd54da0>n

生成直方圖：

plt.hist(train_images.iloc[i])n

輸出結果：

Out[7]: (array([ 668., 0., 0., 0., 0., 0., 0., 0., 0., 116.]),narray([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),n<a list of 10 Patch objects>)n

重新訓練模型

這個過程和之前的過程相同，但現在訓練數據和測試數據都是黑白圖而不是不同灰度值的圖。score結果依舊不是很好，但有了很大提升。

clf = svm.SVC()nclf.fit(train_images, train_labels.values.ravel())nclf.score(test_images,test_labels)n

輸出結果：

Out[8]: 0.88700000000000001n

給測試數據加標籤

現在我們可以載入並預測test.csv中的未標籤數據。這次我們依舊只使用前5000張圖，然後輸出數據到results.csv中。

test_data=pd.read_csv(../input/test.csv)ntest_data[test_data>0]=1nresults=clf.predict(test_data[0:5000])n

輸出結果：

Out[10]: array([2, 0, 9, ..., 1, 7, 3])n

存入文件：

df = pd.DataFrame(results)ndf.index.name=ImageIdndf.index+=1ndf.columns=[Label]ndf.to_csv(results.csv, header=True)n

大功告成！