cs231n筆記1-用KNN演算法進行圖像分類
一. 下載數據集
下載CIFAR-10的Python版本壓縮包
將其解壓縮到
./input
文件夾.
二. 熟悉數據集
查看有哪些文件
from subprocess import check_outputprint(check_output(["ls", "input/cifar-10-batches-py"]).decode())
輸出
batches.metadata_batch_1data_batch_2data_batch_3data_batch_4data_batch_5readme.htmltest_batch
上面的batch文件都是Python的持久化對象,按照CIFAR-10官網的提示可以將其unpickle成字典
def unpickle(file): import pickle with open(file, rb) as fo: dict = pickle.load(fo, encoding=bytes) return dict
我們看下data_batch_1
data_batch_1 = unpickle("input/cifar-10-batches-py/data_batch_1")data_batch_1.keys()
輸出
dict_keys([bfilenames, bdata, blabels, bbatch_label])
看下每個item的類型
for k in data_batch_1.keys(): print(k.decode(), :, type(data_batch_1[k]))
輸出
filenames : <class list>data : <class numpy.ndarray>labels : <class list>batch_label : <class bytes>
查看data的shape和labels的長度
print(data_batch_1[bdata].shape)print(len(data_batch_1[blabels]))
輸出
(10000, 3072)10000,
實際上, data是10000個樣本數據的集合,每個樣本都是一個32*32像素的RGB圖片,所以每個樣本是用32*32*3=3072個數字來表示. data的一行就是一個樣本,看起來是這樣:
array([ 59, 154, 255, ..., 71, 250, 62], dtype=uint8)
而labels就是10000個樣本對應的標籤. 其取值範圍為0~9, 具體的類別可從batchs.meta得知
meta = unpickle(input/cifar-10-batches-py/batches.meta)meta[blabel_names]
輸出
[bairplane, bautomobile, bbird, bcat, bdeer, bdog, bfrog, bhorse, bship, btruck]
總而言之,CIFAR10訓練集一共有50000個樣本,分別劃分到了5個data_batch. 並且還有10000測試樣本,在test_batch中, 作為測試集。
下面我們將所有訓練樣本和測試樣本unpickle, 並將5個data_batch合併成50000*3072的訓練集
import osdef load_CIFAR_batch(filepath): datadict = unpickle(filepath) X = datadict[bdata].astype(float) Y = datadict[blabels] Y = np.array(Y) return X,Ydef load_CIFAR10(ROOT): xs = [] ys = [] for b in range(1,6): f = os.path.join(ROOT,data_batch_%d % (b,)) X, Y = load_CIFAR_batch(f) xs.append(X) ys.append(Y) Xtr = np.concatenate(xs) #將list中5個10000*3072的array疊在一起,變成50000*3072 Ytr = np.concatenate(ys) Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, test_batch)) return Xtr, Ytr, Xte, Ytecifar10_dir = input/cifar-10-batches-pyX_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)print(Training data shape: , X_train.shape)print(Training labels shape: , y_train.shape)print(Test data shape: , X_test.shape)print(Test labels shape: , y_test.shape)
輸出
Training data shape: (50000, 3072)Training labels shape: (50000,)Test data shape: (10000, 3072)Test labels shape: (10000,)
不過,為了加快訓練速度,我們訓練集取5000個樣本,測試集取500個樣本就算了
num_training = 5000mask = list(range(num_training))X_train = X_train[mask]y_train = y_train[mask]num_test = 500mask = list(range(num_test))X_test = X_test[mask]y_test = y_test[mask]
三. KNN演算法實現
KNN並沒有明顯的訓練過程,它只是直接記住所有訓練樣本,然後在預測時,計算出測試樣本到所有訓練樣本的距離(我們將採用歐式距離),再取其中距離前K小的訓練樣本的集合,將它們中出現次數最多的類別作為預測類別。
由於有500個測試樣本和5000個訓練樣本,我們採用500*5000的二維數組來存儲距離,每一行就是一個測試樣本分別到5000個訓練樣本的歐式距離。
現在問題是,怎樣計算這個500*5000的距離數組呢?assignment中要求用3種不同方法計算:二重循環,一重循環和不用循環。
由於經常被numpy的數組運算搞暈,所以還是寫個簡化的例子。假設我們的X_train現在是4*2的(實際為5000*3072),X_test是3*2的(實際為500*3072)。不妨就設
(只是為了表示方便,不要在乎數據類型),無論怎麼算,我們最終都得得到如下的3*4的dists矩陣
(最後還得取sqrt)
兩重循環太naive,直接上代碼
def compute_distances_two_loops(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): for j in range(num_train): dists[i][j] = np.sqrt(np.sum(np.square(X[i]-self.X_train[j]))) return dists
一重循環其實還是差不多,因為在numpy中
np.sqaure上面的結果,得到
再沿著第二維(axis=1)sum,就得到了一個測試樣本到所有訓練樣本的距離
上面雖然寫的是矩陣,但實際上sum出來的是numpy一維數組,其shape是(4,)而不是(4,1)
所以,直接將其賦給dists[i]就可以。
這樣,一重循環的代碼是
def compute_distances_one_loop(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): dists[i] = np.sqrt(np.sum(np.square(X[i]-self.X_train), axis=1)) return dists
最後一種方法,不用循環怎麼搞?
答案是利用完全平方公式展開。
首先,因為X_train是4*2的,X_test是3*2的,為了得到3*4的結果,我們先用X_test乘以X_train的轉置看看
對比其與dists的差別,就是用-2乘以上矩陣之後,該矩陣的每一行加個
這可以通過X_train得到。
每一列加個
這可以通過X_test得到。
代碼
def compute_distances_no_loops(self, X): test_sums = np.sum(np.square(X), axis=1) train_sums = np.sum(np.square(self.X_train), axis=1) return np.sqrt(-2*np.dot(X, self.X_train.T) + train_sums + test_sums.reshape(-1,1))
算距離的方法就到此為止了。
有了距離矩陣,我們接下來就對每個測試樣本進行預測了。
為了預測第i個樣本,我們需要確定距離矩陣第i行中前K小的距離所在的下標,以便從y_test中找到它們對於的類別。這用numpy的argsort函數可以完成。
比如,對[8,7,10,9]進行argsort將得到一個[?, ?, ?,?]
由於7是最小的,並且其下標為1,所以有[1,?,?,?]
8是次小的,其下標為0, 所以有[1,0,?,?]
...
最後,得到的是[1,0,3,2]
所以,取argsort(dist[i])[:k]便得到了我們需要的下標。最後,我們只需找出
y_train[np.argsort(dists[i])[:k]]
中的眾數. 這通過bincount和argmax函數來完成。
bincount函數是什麼意思呢?舉個例子
對[1,5,3,4,4]進行bincount,由於最大的數是5, bincount將統計0~5中每個數字出現的次數。得到的結果就是out=[0,1,0,1,2,1], out[i]=x代表數字i出現了x次。
而argmax, 顧名思義就是找到最大的數的下標。
所以,預測演算法是這樣的
def predict_labels(self, dists, k=1): num_test = dists.shape[0] y_pred = np.zeros(num_test) for i in range(num_test): closest_y = self.y_train[np.argsort(dists[i])[:k]] y_pred[i] = np.argmax(np.bincount(closest_y)) return y_pred
最終,完整的KNN類是
class KNearestNeighbor(object): def __init__(self): pass def train(self, X, y): self.X_train = X self.y_train = y def predict(self, X, k=1, num_loops=0): if num_loops == 0: dists = self.compute_distances_no_loops(X) elif num_loops == 1: dists = self.compute_distances_one_loop(X) elif num_loops == 2: dists = self.compute_distances_two_loops(X) else: raise ValueError(Invalid value %d for num_loops % num_loops) return self.predict_labels(dists, k=k) def compute_distances_two_loops(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): for j in range(num_train): dists[i][j] = np.sqrt(np.sum(np.square(X[i]-self.X_train[j]))) return dists def compute_distances_one_loop(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): dists[i] = np.sqrt(np.sum(np.square(X[i]-self.X_train), axis=1)) return dists def compute_distances_no_loops(self, X): test_sums = np.sum(np.square(X), axis=1) train_sums = np.sum(np.square(self.X_train), axis=1) return np.sqrt(-2*np.dot(X, self.X_train.T) + train_sums + test_sums.reshape(-1,1)) def predict_labels(self, dists, k=1): num_test = dists.shape[0] y_pred = np.zeros(num_test) for i in range(num_test): closest_y = self.y_train[np.argsort(dists[i])[:k]] y_pred[i] = np.argmax(np.bincount(closest_y)) return y_pred
跑一下
classifier = KNearestNeighbor()classifier.train(X_train, y_train)y_test_pred = classifier.predict(X_test)num_correct = np.sum(y_test_pred == y_test)accuracy = float(num_correct) / num_testprint(Got %d / %d correct => accuracy: %f % (num_correct, num_test, accuracy))
結果
Got 137 / 500 correct => accuracy: 0.274000
27%的正確率,起碼比隨機猜(10%)好。
四. 選擇合適的K
用交叉驗證法。
將數據集劃分成num_folds折
num_folds = 5X_train_folds = np.array_split(X_train, num_folds)y_train_folds = np.array_split(y_train, num_folds)
嘗試一系列不同的k
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]k_to_accuracies = {}for k_ in k_choices: k_to_accuracies.setdefault(k_, [])for i in range(num_folds): classifier = KNearestNeighbor() X_val_train = np.vstack(X_train_folds[0:i]+X_train_folds[i+1:]) y_val_train = np.vstack(y_train_folds[0:i]+y_train_folds[i+1:]) y_val_train = y_val_train.flatten() classifier.train(X_val_train, y_val_train) for k_ in k_choices: y_val_pred = classifier.predict(X_train_folds[i], k=k_) num_correct = np.sum(y_val_pred==y_train_folds[i].flatten()) accuracy = float(num_correct) / len(y_val_pred) k_to_accuracies[k_] = k_to_accuracies[k_] + [accuracy]
輸出結果
for k in sorted(k_to_accuracies): for accuracy in k_to_accuracies[k]: print(k = %d, accuracy = %f % (k, accuracy))
結果
k = 1, accuracy = 0.263000k = 1, accuracy = 0.257000k = 1, accuracy = 0.264000k = 1, accuracy = 0.278000k = 1, accuracy = 0.266000k = 3, accuracy = 0.239000k = 3, accuracy = 0.249000k = 3, accuracy = 0.240000k = 3, accuracy = 0.266000k = 3, accuracy = 0.254000k = 5, accuracy = 0.248000k = 5, accuracy = 0.266000k = 5, accuracy = 0.280000k = 5, accuracy = 0.292000k = 5, accuracy = 0.280000k = 8, accuracy = 0.262000k = 8, accuracy = 0.282000k = 8, accuracy = 0.273000k = 8, accuracy = 0.290000k = 8, accuracy = 0.273000k = 10, accuracy = 0.265000k = 10, accuracy = 0.296000k = 10, accuracy = 0.276000k = 10, accuracy = 0.284000k = 10, accuracy = 0.280000k = 12, accuracy = 0.260000k = 12, accuracy = 0.295000k = 12, accuracy = 0.279000k = 12, accuracy = 0.283000k = 12, accuracy = 0.280000k = 15, accuracy = 0.252000k = 15, accuracy = 0.289000k = 15, accuracy = 0.278000k = 15, accuracy = 0.282000k = 15, accuracy = 0.274000k = 20, accuracy = 0.270000k = 20, accuracy = 0.279000k = 20, accuracy = 0.279000k = 20, accuracy = 0.282000k = 20, accuracy = 0.285000k = 50, accuracy = 0.271000k = 50, accuracy = 0.288000k = 50, accuracy = 0.278000k = 50, accuracy = 0.269000k = 50, accuracy = 0.266000k = 100, accuracy = 0.256000k = 100, accuracy = 0.270000k = 100, accuracy = 0.263000k = 100, accuracy = 0.256000k = 100, accuracy = 0.263000
上面的正確率都沒有超過30%的。
可視化結果
import matplotlib.pyplot as pltplt.rcParams[figure.figsize] = (10,8)# plot the raw observationsfor k in k_choices: accuracies = k_to_accuracies[k] plt.scatter([k] * len(accuracies), accuracies)# plot the trend line with error bars that correspond to standard deviationaccuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])plt.errorbar(k_choices, accuracies_mean,yerr=accuracies_std)plt.title(Cross-validation on k)plt.xlabel(k)plt.ylabel(Cross-validation accuracy)plt.show()
從均值來看,k=10時均值最大。但此時標準差也挺大的。
推薦閱讀:
※決策樹與隨機森林
※記第一次參加的數據挖掘比賽
※【翻譯】Brian2高級指導_外部代碼交互
※SRCNN 論文閱讀
※過擬合與模型容量
TAG:機器學習 |