cs231n筆記1-用KNN演算法進行圖像分類

04-02

一. 下載數據集

下載CIFAR-10的Python版本壓縮包

將其解壓縮到

./input

文件夾.

二. 熟悉數據集

查看有哪些文件

from subprocess import check_outputprint(check_output(["ls", "input/cifar-10-batches-py"]).decode())

輸出

batches.metadata_batch_1data_batch_2data_batch_3data_batch_4data_batch_5readme.htmltest_batch

上面的batch文件都是Python的持久化對象，按照CIFAR-10官網的提示可以將其unpickle成字典

def unpickle(file): import pickle with open(file, rb) as fo: dict = pickle.load(fo, encoding=bytes) return dict

我們看下data_batch_1

data_batch_1 = unpickle("input/cifar-10-batches-py/data_batch_1")data_batch_1.keys()

輸出

dict_keys([bfilenames, bdata, blabels, bbatch_label])

看下每個item的類型

for k in data_batch_1.keys(): print(k.decode(), :, type(data_batch_1[k]))

輸出

filenames : <class list>data : <class numpy.ndarray>labels : <class list>batch_label : <class bytes>

查看data的shape和labels的長度

print(data_batch_1[bdata].shape)print(len(data_batch_1[blabels]))

輸出

(10000, 3072)10000，

實際上, data是10000個樣本數據的集合，每個樣本都是一個32*32像素的RGB圖片，所以每個樣本是用32*32*3=3072個數字來表示. data的一行就是一個樣本，看起來是這樣：

array([ 59, 154, 255, ..., 71, 250, 62], dtype=uint8)

而labels就是10000個樣本對應的標籤. 其取值範圍為0~９, 具體的類別可從batchs.meta得知

meta = unpickle(input/cifar-10-batches-py/batches.meta)meta[blabel_names]

輸出

[bairplane, bautomobile, bbird, bcat, bdeer, bdog, bfrog, bhorse, bship, btruck]

總而言之，CIFAR10訓練集一共有50000個樣本，分別劃分到了5個data_batch. 並且還有10000測試樣本，在test_batch中, 作為測試集。

下面我們將所有訓練樣本和測試樣本unpickle, 並將5個data_batch合併成50000*3072的訓練集

import osdef load_CIFAR_batch(filepath): datadict = unpickle(filepath) X = datadict[bdata].astype(float) Y = datadict[blabels] Y = np.array(Y) return X,Ydef load_CIFAR10(ROOT): xs = [] ys = [] for b in range(1,6): f = os.path.join(ROOT,data_batch_%d % (b,)) X, Y = load_CIFAR_batch(f) xs.append(X) ys.append(Y) Xtr = np.concatenate(xs) #將list中5個10000*3072的array疊在一起，變成50000*3072 Ytr = np.concatenate(ys) Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, test_batch)) return Xtr, Ytr, Xte, Ytecifar10_dir = input/cifar-10-batches-pyX_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)print(Training data shape: , X_train.shape)print(Training labels shape: , y_train.shape)print(Test data shape: , X_test.shape)print(Test labels shape: , y_test.shape)

輸出

Training data shape: (50000, 3072)Training labels shape: (50000,)Test data shape: (10000, 3072)Test labels shape: (10000,)

不過，為了加快訓練速度，我們訓練集取5000個樣本，測試集取500個樣本就算了

num_training = 5000mask = list(range(num_training))X_train = X_train[mask]y_train = y_train[mask]num_test = 500mask = list(range(num_test))X_test = X_test[mask]y_test = y_test[mask]

三. KNN演算法實現

KNN並沒有明顯的訓練過程，它只是直接記住所有訓練樣本，然後在預測時，計算出測試樣本到所有訓練樣本的距離(我們將採用歐式距離)，再取其中距離前K小的訓練樣本的集合，將它們中出現次數最多的類別作為預測類別。

由於有500個測試樣本和5000個訓練樣本，我們採用500*5000的二維數組來存儲距離，每一行就是一個測試樣本分別到5000個訓練樣本的歐式距離。

現在問題是，怎樣計算這個500*5000的距離數組呢？assignment中要求用3種不同方法計算：二重循環，一重循環和不用循環。

由於經常被numpy的數組運算搞暈，所以還是寫個簡化的例子。假設我們的X_train現在是4*2的(實際為5000*3072)，X_test是3*2的(實際為500*3072)。不妨就設

$X_{train}=egin{bmatrix} 1 & 2 \ 3 & 4 \ 5 & 6 \ 7 & 8 end{bmatrix}$

$X_{test}=egin{bmatrix} a & b \ c & d \ e & f end{bmatrix}$

(只是為了表示方便，不要在乎數據類型)，無論怎麼算，我們最終都得得到如下的3*4的dists矩陣 $dists=egin{bmatrix} (a-1)^2 + (b-2)^2 & (a-3)^2+(b-4)^2 & (a-5)^2+(b-6)^2 & (a-7)^2+(b-8)^2 \ (c-1)^2 + (d-2)^2 & (c-3)^2+(d-4)^2 & (c-5)^2+(d-6)^2 & (c-7)^2+(d-8)^2 \ (e-1)^2 + (f-2)^2 & (e-3)^2+(f-4)^2 & (e-5)^2+(f-6)^2 & (e-7)^2+(f-8)^2 end{bmatrix}$

(最後還得取sqrt)

兩重循環太naive，直接上代碼

def compute_distances_two_loops(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): for j in range(num_train): dists[i][j] = np.sqrt(np.sum(np.square(X[i]-self.X_train[j]))) return dists

一重循環其實還是差不多，因為在numpy中

$egin{bmatrix} a&b end{bmatrix} - egin{bmatrix} 1 & 2 \ 3 & 4 \ 5 & 6 \ 7 & 8 end{bmatrix} = egin{bmatrix} a-1 & b-2 \ a-3 & b-4 \ a-5 & b-6 \ a-7 & b-8 end{bmatrix}$

np.sqaure上面的結果，得到

$egin{bmatrix} (a-1)^2 & (b-2)^2 \ (a-3)^2 & (b-4)^2 \ (a-5)^2 & (b-6)^2 \ (a-7)^2 & (b-8)^2 end{bmatrix}$

再沿著第二維(axis=1)sum，就得到了一個測試樣本到所有訓練樣本的距離

$egin{bmatrix} (a-1)^2 + (b-2)^2 \ (a-3)^2 + (b-4)^2 \ (a-5)^2 + (b-6)^2 \ (a-7)^2 + (b-8)^2 end{bmatrix}$

上面雖然寫的是矩陣，但實際上sum出來的是numpy一維數組，其shape是(4,)而不是(4,1)

所以，直接將其賦給dists[i]就可以。

這樣，一重循環的代碼是

def compute_distances_one_loop(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): dists[i] = np.sqrt(np.sum(np.square(X[i]-self.X_train), axis=1)) return dists

最後一種方法，不用循環怎麼搞?

答案是利用完全平方公式展開。

首先，因為X_train是4*2的，X_test是3*2的，為了得到3*4的結果，我們先用X_test乘以X_train的轉置看看

$X_{test}X_{train}^T = egin{bmatrix} a*1+b*2 & a*3+b*4 & a*5+b*6 & a*7+b*8 \ c*1+d*2 & c*3+d*4 & c*5+d*6 & c*7+d*8 \ e*1+f*2 & e*3+f*4 & e*5+f*6 & e*7+f*8 end{bmatrix}$

對比其與dists的差別，就是用-2乘以上矩陣之後，該矩陣的每一行加個

$egin{bmatrix} 1^2+2^2 & 3^2+4^2 & 5^2+6^2 & 7^2+8^2 end{bmatrix}$

這可以通過X_train得到。

每一列加個

$egin{bmatrix} a^2+b^2\ c^2+d^2\ e^2+f^2 end{bmatrix}$

這可以通過X_test得到。

代碼

def compute_distances_no_loops(self, X): test_sums = np.sum(np.square(X), axis=1) train_sums = np.sum(np.square(self.X_train), axis=1) return np.sqrt(-2*np.dot(X, self.X_train.T) + train_sums + test_sums.reshape(-1,1))

算距離的方法就到此為止了。

有了距離矩陣，我們接下來就對每個測試樣本進行預測了。

為了預測第i個樣本，我們需要確定距離矩陣第i行中前K小的距離所在的下標，以便從y_test中找到它們對於的類別。這用numpy的argsort函數可以完成。

比如，對[8,7,10,9]進行argsort將得到一個[?, ?, ?,?]

由於7是最小的，並且其下標為1，所以有[1,?,?,?]

8是次小的，其下標為0, 所以有[1,0,?,?]

...

最後，得到的是[1,0,3,2]

所以，取argsort(dist[i])[:k]便得到了我們需要的下標。最後，我們只需找出

y_train[np.argsort(dists[i])[:k]]

中的眾數. 這通過bincount和argmax函數來完成。

bincount函數是什麼意思呢？舉個例子

對[1,5,3,4,4]進行bincount，由於最大的數是5, bincount將統計0~5中每個數字出現的次數。得到的結果就是out=[0,1,0,1,2,1], out[i]=x代表數字i出現了x次。

而argmax, 顧名思義就是找到最大的數的下標。

所以，預測演算法是這樣的

def predict_labels(self, dists, k=1): num_test = dists.shape[0] y_pred = np.zeros(num_test) for i in range(num_test): closest_y = self.y_train[np.argsort(dists[i])[:k]] y_pred[i] = np.argmax(np.bincount(closest_y)) return y_pred

最終，完整的KNN類是

class KNearestNeighbor(object): def __init__(self): pass def train(self, X, y): self.X_train = X self.y_train = y def predict(self, X, k=1, num_loops=0): if num_loops == 0: dists = self.compute_distances_no_loops(X) elif num_loops == 1: dists = self.compute_distances_one_loop(X) elif num_loops == 2: dists = self.compute_distances_two_loops(X) else: raise ValueError(Invalid value %d for num_loops % num_loops) return self.predict_labels(dists, k=k) def compute_distances_two_loops(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): for j in range(num_train): dists[i][j] = np.sqrt(np.sum(np.square(X[i]-self.X_train[j]))) return dists def compute_distances_one_loop(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): dists[i] = np.sqrt(np.sum(np.square(X[i]-self.X_train), axis=1)) return dists def compute_distances_no_loops(self, X): test_sums = np.sum(np.square(X), axis=1) train_sums = np.sum(np.square(self.X_train), axis=1) return np.sqrt(-2*np.dot(X, self.X_train.T) + train_sums + test_sums.reshape(-1,1)) def predict_labels(self, dists, k=1): num_test = dists.shape[0] y_pred = np.zeros(num_test) for i in range(num_test): closest_y = self.y_train[np.argsort(dists[i])[:k]] y_pred[i] = np.argmax(np.bincount(closest_y)) return y_pred

跑一下

classifier = KNearestNeighbor()classifier.train(X_train, y_train)y_test_pred = classifier.predict(X_test)num_correct = np.sum(y_test_pred == y_test)accuracy = float(num_correct) / num_testprint(Got %d / %d correct => accuracy: %f % (num_correct, num_test, accuracy))

結果

Got 137 / 500 correct => accuracy: 0.274000

27%的正確率，起碼比隨機猜(10%)好。

四. 選擇合適的K

用交叉驗證法。

將數據集劃分成num_folds折

num_folds = 5X_train_folds = np.array_split(X_train, num_folds)y_train_folds = np.array_split(y_train, num_folds)

嘗試一系列不同的k

k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]k_to_accuracies = {}for k_ in k_choices: k_to_accuracies.setdefault(k_, [])for i in range(num_folds): classifier = KNearestNeighbor() X_val_train = np.vstack(X_train_folds[0:i]+X_train_folds[i+1:]) y_val_train = np.vstack(y_train_folds[0:i]+y_train_folds[i+1:]) y_val_train = y_val_train.flatten() classifier.train(X_val_train, y_val_train) for k_ in k_choices: y_val_pred = classifier.predict(X_train_folds[i], k=k_) num_correct = np.sum(y_val_pred==y_train_folds[i].flatten()) accuracy = float(num_correct) / len(y_val_pred) k_to_accuracies[k_] = k_to_accuracies[k_] + [accuracy]

輸出結果

for k in sorted(k_to_accuracies): for accuracy in k_to_accuracies[k]: print(k = %d, accuracy = %f % (k, accuracy))

結果

k = 1, accuracy = 0.263000k = 1, accuracy = 0.257000k = 1, accuracy = 0.264000k = 1, accuracy = 0.278000k = 1, accuracy = 0.266000k = 3, accuracy = 0.239000k = 3, accuracy = 0.249000k = 3, accuracy = 0.240000k = 3, accuracy = 0.266000k = 3, accuracy = 0.254000k = 5, accuracy = 0.248000k = 5, accuracy = 0.266000k = 5, accuracy = 0.280000k = 5, accuracy = 0.292000k = 5, accuracy = 0.280000k = 8, accuracy = 0.262000k = 8, accuracy = 0.282000k = 8, accuracy = 0.273000k = 8, accuracy = 0.290000k = 8, accuracy = 0.273000k = 10, accuracy = 0.265000k = 10, accuracy = 0.296000k = 10, accuracy = 0.276000k = 10, accuracy = 0.284000k = 10, accuracy = 0.280000k = 12, accuracy = 0.260000k = 12, accuracy = 0.295000k = 12, accuracy = 0.279000k = 12, accuracy = 0.283000k = 12, accuracy = 0.280000k = 15, accuracy = 0.252000k = 15, accuracy = 0.289000k = 15, accuracy = 0.278000k = 15, accuracy = 0.282000k = 15, accuracy = 0.274000k = 20, accuracy = 0.270000k = 20, accuracy = 0.279000k = 20, accuracy = 0.279000k = 20, accuracy = 0.282000k = 20, accuracy = 0.285000k = 50, accuracy = 0.271000k = 50, accuracy = 0.288000k = 50, accuracy = 0.278000k = 50, accuracy = 0.269000k = 50, accuracy = 0.266000k = 100, accuracy = 0.256000k = 100, accuracy = 0.270000k = 100, accuracy = 0.263000k = 100, accuracy = 0.256000k = 100, accuracy = 0.263000

上面的正確率都沒有超過30%的。

可視化結果

import matplotlib.pyplot as pltplt.rcParams[figure.figsize] = (10,8)# plot the raw observationsfor k in k_choices: accuracies = k_to_accuracies[k] plt.scatter([k] * len(accuracies), accuracies)# plot the trend line with error bars that correspond to standard deviationaccuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])plt.errorbar(k_choices, accuracies_mean,yerr=accuracies_std)plt.title(Cross-validation on k)plt.xlabel(k)plt.ylabel(Cross-validation accuracy)plt.show()

從均值來看，k=10時均值最大。但此時標準差也挺大的。