CS231n:Assignment1(Q1-Q3,梯度圖解)

CS231n:Assignment1(Q1-Q3,梯度圖解)

來自專欄 AI研習社社區4 人贊了文章

Assignment1(Q1-Q3) 主要用到以下知識點:

一、距離度量

常用的有:

L1範數(曼哈頓距離):d_1(I_1,I_2)=sum_p|I^p_1-I^p_2|

L2範數(歐幾里得距離):d_2(I_1,I_2)=sqrt{ sum_p(I^p_1-I^p_2)^2}

二、KNN演算法

三、訓練集、驗證集、測試集的劃分,以及超參數的調優

超參數調優一般使用交叉驗證。例如KNN演算法中k的選擇,即使用交叉驗證來獲取。

當數據量較大時,交叉驗證會耗費較多的計算資源。因此一般直接把訓練集按照50%-90%的比例分成訓練集和驗證集。

但是當數據集較小,或者超參數的數量較多時,最好還是用交叉驗證吧。

四、Score function

線性分類器:displaystyle f(x_i,W,b)=Wx_i+b

將權重和偏差合併後為:displaystyle f(x_i,W)=Wx_i

五、Loss function

5.1 Multiclass Support Vector Machine Loss

對於某一個圖像 x_i ,其對應的loss為:

L_i=sum_{j
ot=y_i}max(0,s_j-s_{y_i}+Delta)=sum_{j
ot=y_i}max(0,w^T_jx_i-w^T_{y_i}x_i+Delta)

其中max(0,-)函數,通常被稱為合頁損失(hinge loss)。

當正確分類的score比其他分類的score高出delta的距離時,損失為0;否則,就開始計算損失

正則化(Regularization)

最常用的正則化懲罰是L2範式,L2範式通過對所有參數進行逐元素的平方懲罰來抑制大數值的權重

R(W)=sum_k sum_l W^2_{k,l}

加入正則項後,總的loss為:

L=displaystyle underbrace{ frac{1}{N}sum_i L_i}_{data  loss}+underbrace{lambda R(W)}_{regularization  loss}

    =frac{1}{N}sum_isum_{j
ot=y_i}[max(0,f(x_i;W)_j-f(x_i;W)_{y_i}+Delta)]+frac{lambda}{2} sum_k sum_l W^2_{k,l}

Deltalambda 都是用於權衡data loss和regularization loss相對大小的超參數,通過改變 W 的大小,不同類別score之間的差異也會改變。因此Delta 一般設置為1,而lambda通過交叉驗證獲取。

加入正則項的意義:對較大數值的權重進行懲罰,從而使得最終的分類器不依賴於某一兩個維度,這樣所有權重的係數會更加分散,從而提高泛化性能。

各矩陣的維數與編程作業中的矩陣維數保持一致(為方便,只給了5個分類類別,即C=5)

圖1. 從圖像數據到損失、梯度的示意圖

梯度求解

作業中,最困難的是以向量的形式求解梯度。

雖然講義中給出了梯度公式,但是不是很明晰。這裡給出我自己的理解,希望對大家有所幫助。

設對於圖像 x_{i} ,其對應的正確分類為第 k 類,則與 x_i 對應的 L_i 為:

egin{aligned} L_i &=max(0,x_iw_0-x_iw_k+Delta) \ &+max(0,x_iw_1-x_iw_k+Delta) \ & +... \ &+max(0,x_iw_{k-1}-x_iw_k+Delta) \ &+max(0,x_iw_{k+1}-x_iw_k+Delta) \ & +... \ end{aligned} 	ag{1}

上式中的每一行表示類別 j 的分數對L_i的貢獻(比如,第一行表示類別 0L_i的貢獻,第二行表示類別 1L_i的貢獻,等等),而正確分類的分數對最終的 L_i 沒有貢獻(沒有第k行)。

M_j=x_iw_j-x_iw_k+Delta (Margins,表示邊界)。

那麼,當 M_j>0 時, frac{partial{L_i}}{w_j} 才存在

下面開始對 w_0,w_1,... 求偏導:

frac{partial{L_i}}{w_0} = dagger{{M_1>0}} cdot x_i^T

frac{partial{L_i}}{w_1} = dagger{{M_2>0}} cdot x_i^T

...

frac{partial{L_i}}{w_{k-1}} =dagger{{M_{k-1}>0}} cdot x_i^T

egin{aligned} frac{partial{L_i}}{w_{k}} &=- dagger{{M_0>0}} cdot x_i^T- dagger{{M_1>0}} cdot x_i^T-dagger{{M_2>0}} cdot x_i^T-... \ &= [-sum_{j≠k}{ dagger{{M_j>0}}]} cdot x_i^T end{aligned}

frac{partial{L_i}}{w_{k+1}} =dagger{{M_{k+1}>0}} cdot x_i^T

...

因此SVM Loss的梯度為:

j
e k時,frac{partial{L_i}}{w_{j}} =dagger{{M_{j}>0}} cdot x_i^T

j= k 時,frac{partial{L_i}}{w_{j}} = [-sum_{j≠k}{ dagger{{M_j>0}}]} cdot x_i^T

其中:

  1. 這裡的 dagger{{ cdot }} 為指示函數,當括弧中的表達式為真時,其值為1,否則為0。大家可以把它理解為一個非0即1的整數。
  2. L_iw_k 的偏導與其他的 w_j 不同。從(1)式可以看出,L_i每一行的求和項中w_k均存在,而w_j只在第 j 行中存在。

5.2 Softmax Loss

對於圖像數據x_i ,其對應的loss為:

displaystyle L_i=-log(frac{e^{f_{y_i}}}{sum_je^{f_j}})=-f_{y_i}+log(sum_je^{f_j})

考慮正則項後,總的loss為:

L= frac{1}{N}sum_i{L_i}+lambda R(W)

   =frac{1}{N}sum_i {[-log(frac{e^{f_{y_i}}}{sum_je^{f_j}})]}+frac{lambda}{2} sum_k sum_l W^2_{k,l}

對於Softmax Loss,理解以下幾點:

第一點:Softmax僅僅是一個數值壓縮函數:

Softmax(a_j)=frac{e^{a_j}}{sum_ie^{a_i}}

輸入向量 ar{a}=[a_1, a_2, ...,a_i,... ] ,則輸入Softmax函數後,其輸出也是一個向量,其中每個元素值在0到1之間,且所有元素之和為1

第二點:Softmax損失認為f_i=x_iw_j未歸一化的對數概率,則:

e^{f_i}未歸一化的概率

Softmax(e^{f_i})歸一化概率

L_i = -log(Softmax(e^{f_i}))負對數歸一化概率

第三點:在實際操作中 Softmax(cdot) 中有 e^{(cdot)} 函數存在,當分數較大時(如999),計算機無法計算。講義中給出了正確做法:frac{e^{f_{y_i}}}{sum_je^{f_j}}=frac{Ce^{f_{y_i}}}{Csum_je^{f_j}}=frac{e^{f_{y_i}+logC}}{sum_je^{f_j+logC}} (logC=-maxf_j)

例如輸入向量為:a=[a_1, a_2, ...,a_i,... ] ,令 a=[a_1-max(a_i), a_2-max(a_i), ...,a_i-max(a_i),... ] ,此時向量中所有的數值均≤0,而 e^{x}(x≤0)in(0,1]

梯度求解

對於圖像 x_{i} ,其對應的正確分類為第 k 類,則與 x_i 對應的 L_i 為:

egin{aligned} L_i&=-f_{k}+log(sum_je^{f_j}) \ &=-w_kx_i+log(sum_j{e^{w_jx_i}}) \ &= -w_kx_i+log(e^{w_0x_i} + e^{w_1x_i} + ...+e^{w_kx_i} +... ) \ end{aligned} 	ag{2}

frac{partial{L_i}}{w_0} = frac{e^{w_0x_i}}{sum_j{e^{w_jx_i}}} cdot x_i^T=P_0 cdot x_i^T

frac{partial{L_i}}{w_1} = frac{e^{w_1x_i}}{sum_j{e^{w_jx_i}}} cdot x_i^T=P_1 cdot x_i^T

...

frac{partial{L_i}}{w_k} = frac{e^{w_kx_i}}{sum_j{e^{w_jx_i}}} cdot x_i^T-x_i^T=(P_k-1) cdot x_i^T

...

其中 P_j = frac{e^{w_jx_i}}{sum_j{e^{w_jx_i}}}

因此Softmax Loss的梯度為:

frac{partial{L_i}}{w_j} =( P_j-dagger{{j==k}}) cdot x_i^T

六、梯度下降

包括批量梯度下降,小批量數據梯度下降(Mini-batch gradient descent),以及隨機梯度下降(Stochastic Gradient Descent 簡稱SGD,也稱為在線梯度下降)。

七、其他

圖像數據預處理:一般對圖像進行零均值的中心化。

注意:這裡的平均值僅僅是訓練集的平均值。

# first: compute the image mean based on the training datamean_image = np.mean(X_train, axis=0)# second: subtract the mean image from train and test dataX_train -= mean_imageX_val -= mean_imageX_test -= mean_imageX_dev -= mean_image

Q1作業代碼

k_nearest_neighbor.py

import numpy as npfrom scipy.stats import modeclass KNearestNeighbor(object): def __init__(self): pass def train(self, X, y): self.X_train = X self.y_train = y def predict(self, X, k=1, num_loops=0): if num_loops == 0: dists = self.compute_distances_no_loops(X) elif num_loops == 1: dists = self.compute_distances_one_loop(X) elif num_loops == 2: dists = self.compute_distances_two_loops(X) else: raise ValueError(Invalid value %d for num_loops % num_loops) return self.predict_labels(dists, k=k) def compute_distances_two_loops(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): for j in range(num_train): dists[i][j] = np.sqrt(np.sum((X[i] - self.X_train[j]) ** 2)) return dists def compute_distances_one_loop(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): dists[i] = np.sqrt(np.sum((X[i, :] - self.X_train) ** 2, axis=1)) return dists def compute_distances_no_loops(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) dists = np.dot(X, self.X_train.T) * -2 sq1 = np.sum(np.square(X), axis=1, keepdims=True) sq2 = np.sum(np.square(self.X_train), axis=1) dists = np.add(dists, sq1) dists = np.add(dists, sq2) dists = np.sqrt(dists) return dists def predict_labels(self, dists, k=1): y_pred = mode(self.y_train[np.argsort(dists)[:, :k]], axis=1)[0] return y_pred.ravel()

其中,以向量形式計算測試圖像與訓練圖像之間的距離(函數compute_distances_no_loops),主要有兩種方法:

u = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])v = np.array([[1, 0, 1], [4, 6, 9]])# way1dists = np.multiply(np.dot(u, v.T), -2)sq1 = np.sum(np.square(u), axis=1, keepdims=True)sq2 = np.sum(np.square(v), axis=1)dists = np.add(dists, sq1)dists = np.add(dists, sq2)dists = np.sqrt(dists)# way2diff = u[np.newaxis, :, :] - v[:, np.newaxis, :]distance = np.sqrt(np.sum(diff ** 2, axis=-1))ind = np.argmin(distance, axis=0)

其中第二種方法,是將二維數組的維度擴展至三維,形式更加簡便,也更容易理解,參考地址。但是缺點是,非常耗內存。我在作業中使用該方法時,內存直接溢出。這可能就是維度打擊吧。

knn.ipynb中關於Cross Validation的代碼如下:

num_folds = 5k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]X_train_folds = []y_train_folds = []X_train_folds = np.array_split(X_train, num_folds)y_train_folds = np.array_split(y_train, num_folds)k_to_accuracies = {}classifier = KNearestNeighbor()for k in k_choices: accuracy = [] for n in range(num_folds): temp_X = X_train_folds[:] temp_y = y_train_folds[:] X_val = temp_X.pop(n) y_val = temp_y.pop(n) X_train_ = np.vstack((temp_X)) y_train_ = np.vstack((temp_y)).ravel() # 每次訓練時重新設置訓練集 classifier.train(X_train_, y_train_) y_test_pred = classifier.predict(X_val, k=k) # Compute and print the fraction of correctly predicted examples num_correct = np.sum(y_test_pred == y_val) accuracy.append(float(num_correct) / len(y_val)) k_to_accuracies[k] = accuracy# Print out the computed accuraciesfor k in sorted(k_to_accuracies): for accuracy in k_to_accuracies[k]: print(k = %d, accuracy = %f % (k, accuracy))

Q2作業代碼

linear_svm.py

import numpy as npdef svm_loss_naive(W, X, y, reg): dW = np.zeros(W.shape) # initialize the gradient as zero # compute the loss and the gradient num_classes = W.shape[1] num_train = X.shape[0] loss = 0.0 for i in range(num_train): scores = X[i].dot(W) correct_class_score = scores[y[i]] for j in range(num_classes): # 參考公式(1),這裡的y[i]即正確分類k # 而正確分類對loss是沒有貢獻的 if j == y[i]: continue # 即M_j margin = scores[j] - correct_class_score + 1 # note delta = 1 # margin>0才需要計算損失 if margin > 0: # 這裡可以參考下公式(1) # 兩層循環,其實是Li中的每一行分別對w_j和w_k求偏導 # 由於Li中每一行都與w_k有關,因此dW[:, y[i]]每一次都需要更新 dW[:, y[i]] += -X[i].T dW[:, j] += X[i].T loss += margin loss /= num_train loss += 0.5 * reg * np.sum(W * W) dW /= num_train dW += reg * W return loss, dWdef svm_loss_vectorized(W, X, y, reg): num_train = X.shape[0] # 得到所有的分數 scores = X.dot(W) # 正確類別對應的分數 correct_class_score = scores[range(num_train), y].reshape(-1, 1) # 計算loss margins = scores - correct_class_score + 1 margins[margins < 0] = 0 # 所有的loss loss = np.sum(margins) / num_train + 0.5 * reg * np.sum(W * W) # 計算梯度 margins[margins > 0] = 1.0 row_sum = np.sum(margins, axis=1) # 1 by N margins[range(num_train), y] -= row_sum dW = np.dot(X.T, margins) / num_train + reg * W # D by C return loss, dW

上文給出了SMV Loss的梯度求解公式,對於編寫svm_loss_naive函數應該是足夠了。但是在進行向量化時,還是遠遠不夠。我花了很長時間在梯度的向量化求解上,下面用圖表進行大致說明,希望對於理解有所裨益。

圖2. 梯度示意

圖2中的 Margins 與圖1中的 Score 有所不同,經過了以下轉換:

# 得到所有的分數scores = X.dot(W)# 正確類別對應的分數correct_class_score = scores[range(num_train), y].reshape(-1, 1)# 計算邊界margins = scores - correct_class_score + 1margins[margins < 0] = 0

現在我們已經知道了 frac{partial{L_i}}{w_{j}} 的公式,在進行向量化時,比較麻煩的是當 j= k 時梯度公式與其他情況下不一致。

如上圖所示,假設正確分類的類別為第3類,即 k=3 ,那麼 :

j=3 時,frac{partial{L_i}}{w_{3}} = [-sum_{j≠3}{ dagger{{M_j>0}}]} cdot x_i^T

j=0,1,2,4 時,frac{partial{L_i}}{w_{j}} =x_i^T

但是總的形式其實沒有變,即 L_iw_j 的梯度是一個係數與 x_i^T 相乘:

frac{partial{L_i}}{w_{j}} =(ullet) cdot x_i^T

關鍵是要得到該係數矩陣。

仔細想一下該係數矩陣的含義,當j
e k時,係數應該為1(前提是 margins>0 );當 j=k 時,係數應該是 margins 中所有大於0的元素個數,然後再加一個負號。

理解了以上的話,通過以下三步就可以得到係數矩陣:

圖3. 係數矩陣的獲取

然後用該係數矩陣與數據矩陣 X 相乘(相乘時注意維度)就可以得到最終的梯度。

svm.ipynb中關於交叉驗證的代碼:

learning_rates = [1e-7, 5e-5]regularization_strengths = [2.5e4, 5e4]results = {}best_val = -1 # The highest validation accuracy that we have seen so far.best_svm = None # The LinearSVM object that achieved the highest validation rate.for lr in np.linspace(learning_rates[0], learning_rates[1], 5): for reg in np.linspace(regularization_strengths[0], regularization_strengths[1], 5): svm = LinearSVM() loss_hist = svm.train(X_train, y_train, learning_rate=lr, reg=reg, num_iters=1500, verbose=False) y_train_pred = svm.predict(X_train) y_val_pred = svm.predict(X_val) accuracy_train = np.mean(y_train.ravel() == y_train_pred) accuracy_val = np.mean(y_val.ravel() == y_val_pred) results[(lr, reg)] = [accuracy_train, accuracy_val] if(best_val < accuracy_val): best_val = accuracy_val best_svm = svm# Print out results.for lr, reg in sorted(results): train_accuracy, val_accuracy = results[(lr, reg)] print(lr %e reg %e train accuracy: %f val accuracy: %f % ( lr, reg, train_accuracy, val_accuracy)) print(best validation accuracy achieved during cross-validation: %f % best_val)

交叉驗證損失

學習得到的權重

Q3作業代碼

softmax.py

import numpy as npdef softmax_loss_naive(W, X, y, reg): loss = 0.0 dW = np.zeros_like(W) num_classes = W.shape[1] num_train = X.shape[0] for i in range(num_train): scores = X[i].dot(W) scores -= np.max(scores) loss += -np.log(np.exp(scores[y[i]]) / np.sum(np.exp(scores))) for j in range(num_classes): if j == y[i]: dW[:, j] += X[i].T * np.exp(scores[j]) / np.sum(np.exp(scores)) - X[i].T else: dW[:, j] += X[i].T * np.exp(scores[j]) / np.sum(np.exp(scores)) loss /= num_train loss += 0.5 * reg * np.sum(W ** 2) dW /= num_train dW += reg * W return loss, dWdef softmax_loss_vectorized(W, X, y, reg): num_train = X.shape[0] # 求解loss scores = X.dot(W) scores -= np.max(scores, axis=1).reshape(-1, 1) loss = - np.sum(np.log(np.exp(scores[range(num_train), y]) / np.sum(np.exp(scores), axis=1))) loss /= num_train loss += 0.5 * reg * np.sum(W ** 2) # 求解grad p = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True) t = np.zeros_like(p) t[range(num_train), y] = 1 dW = X.T.dot(p - t) dW /= num_train dW += reg * W return loss, dW

softmax loss梯度的向量化求解比svm loss簡單很多,關鍵問題也是求係數矩陣,這裡就不再描述。

softmax.ipynb中關於交叉驗證的代碼:

from cs231n.classifiers import Softmaxresults = {}best_val = -1best_softmax = Nonelearning_rates = [1e-7, 5e-7]regularization_strengths = [2.5e4, 5e4]for lr in np.linspace(learning_rates[0], learning_rates[1], 5): for reg in np.linspace(regularization_strengths[0], regularization_strengths[1], 5): softmax = Softmax() loss_hist = softmax.train(X_train, y_train, learning_rate=lr, reg=reg, num_iters=1500, verbose=True) y_train_pred = softmax.predict(X_train) y_val_pred = softmax.predict(X_val) accuracy_train = np.mean(y_train.ravel() == y_train_pred) accuracy_val = np.mean(y_val.ravel() == y_val_pred) results[(lr, reg)] = [accuracy_train, accuracy_val] if(best_val < accuracy_val): best_val = accuracy_val best_softmax = softmax # Print out results.for lr, reg in sorted(results): train_accuracy, val_accuracy = results[(lr, reg)] print(lr %e reg %e train accuracy: %f val accuracy: %f % ( lr, reg, train_accuracy, val_accuracy)) print(best validation accuracy achieved during cross-validation: %f % best_val)

linear_classifier.py

import numpy as npfrom cs231n.classifiers.linear_svm import *from cs231n.classifiers.softmax import *class LinearClassifier(object): def __init__(self): self.W = None def train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=100, batch_size=200, verbose=False): num_train, dim = X.shape num_classes = np.max(y) + 1 # assume y takes values 0...K-1 where K is number of classes if self.W is None: # lazily initialize W self.W = 0.001 * np.random.randn(dim, num_classes) # Run stochastic gradient descent to optimize W loss_history = [] for it in range(num_iters): indices = np.random.choice(num_train, size=batch_size, replace=True) X_batch = X[indices] y_batch = y[indices] # evaluate loss and gradient loss, grad = self.loss(X_batch, y_batch, reg) loss_history.append(loss) self.W -= learning_rate * grad if verbose and it % 100 == 0: print(iteration %d / %d: loss %f % (it, num_iters, loss)) return loss_history def predict(self, X): y_pred = np.argmax(X.dot(self.W), axis=1) return y_pred def loss(self, X_batch, y_batch, reg): passclass LinearSVM(LinearClassifier): """ A subclass that uses the Multiclass SVM loss function """ def loss(self, X_batch, y_batch, reg): return svm_loss_vectorized(self.W, X_batch, y_batch, reg)class Softmax(LinearClassifier): """ A subclass that uses the Softmax + Cross-entropy loss function """ def loss(self, X_batch, y_batch, reg): return softmax_loss_vectorized(self.W, X_batch, y_batch, reg)

推薦閱讀:

TAG:深度學習DeepLearning | 計算機視覺 | 人工智慧 |