CS231n Assignment3

04-03

知識儲備：

一 Spatial Localization and Detection（空間定位和檢測）

以一張圖片介紹分類，定位，檢測的區別：

1 分類加定位：

輸入: 圖像

輸出: class label, box in the image (x, y, w, h)（要求覆蓋率50%以上）

evaluation metric: Accuracy, Intersection over Union (IoU)

實現方法1：我們可以將定位看成回歸問題：

實現方法2：Sliding Windows: Overfeat

對於不同大小的圖片，在方法1的基礎上，對圖片進行滑動窗口和尺寸金字塔，看哪個尺寸和位置的評分最高，以找到具體位置。但是滑動窗口的效率太低，提高效率的方法是，把網路的全連接成變成卷積層替代，這樣就可以輸入不同尺寸的圖片。

2 目標檢測

由於每個圖像中目標個數不一樣，要定位的坐標數量也不一樣，所以使用回歸方法解決並不是一個很好的思路；另一個思路是將其看成分類問題，不過我們需要在不同位置進行很多次分類，這會很耗時（因為位置是需要我們選定的）

R-CNN：

輸入一張圖片，我們先定位出2K個物體候選框，然後採用CNN提取每個候選框中圖片的特徵向量，特徵向量的維度為4096維，接著採用SVM演算法對各個候選框中的物體進行分類識別。

二：Recurrent Neural Networks

RNNs主要用來處理序列數據。在傳統的神經網路模型中，是從輸入層到隱含層再到輸出層，層與層之間是全連接的，每層之間的節點是無連接的。但是這種普通的神經網路對於很多問題卻無能無力。例如，你要預測句子中的下一個單詞是什麼，一般需要用到前面的單詞，因為一個句子中前後單詞並不是獨立的。RNNs之所以稱為循環神經網路，即一個序列當前的輸出與前面的輸出也有關。具體的表現形式為網路會對前面的信息進行記憶並應用於當前輸出的計算中，即隱藏層之間的節點不再無連接而是有連接的，並且隱藏層的輸入不僅包括輸入層的輸出還包括上一時刻隱藏層的輸出

一個RNN層和一個輸出層的network

為了更好地說明RNN，我們可以將網路按照時間進行展開：

在RNN中每一個時間步驟用到的參數（U, W, V）都是一樣的。一般來說，每一時間的輸入和輸出是不一樣的，比如對於序列數據就是將序列項依次傳入，每個序列項再對應不同的輸出（比如下一個序列項）。

RNN可用來實現圖像描述（image caption)

上圖中，我們用CNN來對輸入圖像進行特徵提取，然後將提取到的特徵作為RNN隱藏層的初始態（相當於t = -1時，隱藏層的輸出值）輸入到第一個時間點（t = 0）的隱藏層。RNN每個時間點的輸出是當前輸入序列項的下一項（比如，輸入"straw"，輸出"hat"）。

詳細流程見下圖：

三：LSTM

RNN訓練的時候容易出現梯度爆炸和梯度消失的問題，LSTM效果更好。和RNN一樣，LSTM也是隨著時間序列重複著一樣的模塊，只是LSTM的每個某塊比RNN更加複雜，擁有四個層（3個門+1個記憶單元）。下圖方框內上方的那條水平線，被稱為胞元狀態（cell state），LSTM通過門結構對記憶單元上的信息進行線性修改，保證了當時間序列變得很長的時候，前後信息的關聯度不會衰減。

作業：

在本作業中，你將實現循環網路，並將其應用於在微軟的COCO資料庫上進行圖像標註。我們還會介紹TinyImageNet數據集，然後在這個數據集使用一個預訓練的模型來查看圖像梯度的不同應用。本作業的目標如下：

理解循環神經網路（RNN）的結構，知道它們是如何隨時間共享權重來對序列進行操作的。
理解普通循環神經網路和長短基記憶（Long-Short Term Memory）循環神經網路之間的差異。
理解在測試時如何從RNN生成序列。
理解如何將卷積神經網路和循環神經網路結合在一起來實現圖像標註。
理解一個訓練過的卷積神經網路是如何用來從輸入圖像中計算梯度的。
進行高效的交叉驗證並為神經網路結構找到最好的超參數。
實現圖像梯度的不同應用，比如顯著圖，搞笑圖像，類別可視化，特徵反演和DeepDream。

一：使用普通RNN進行圖像標註

在本練習中，我們將使用2014年發布的Microsoft COCO數據集，它已成為圖像字幕的標準測試平台。數據集包括80,000個訓練圖像和40,000個驗證圖像。我們已經預處理了數據和提取的功能。對於所有圖像，我們從ImageNet上預先訓練的VGG-16網路的fc7層中提取了特徵。所以不必實現使用CNN去提取特徵。為了減少處理時間和內存需求，我們將特徵的維數從4096減少到512。

RNN_Captioning(主）

data = load_coco_data(pca_features=True)# Print out all the keys and values from the data dictionaryfor k, v in data.iteritems(): if type(v) == np.ndarray: print k, type(v), v.shape, v.dtype else: print k, type(v), len(v)idx_to_word <type list> 1004train_captions <type numpy.ndarray> (400135, 17) int32val_captions <type numpy.ndarray> (195954, 17) int32train_image_idxs <type numpy.ndarray> (400135,) int32val_features <type numpy.ndarray> (40504, 512) float32val_image_idxs <type numpy.ndarray> (195954,) int32train_features <type numpy.ndarray> (82783, 512) float32train_urls <type numpy.ndarray> (82783,) |S63val_urls <type numpy.ndarray> (40504,) |S63word_to_idx <type dict> 1004

rnn.layers

import numpy as npdef rnn_step_forward(x, prev_h, Wx, Wh, b): """ The input data has dimension D, the hidden state has dimension H, and we use a minibatch size of N. Inputs: - x: Input data for this timestep, of shape (N, D). - prev_h: Hidden state from previous timestep, of shape (N, H) - Wx: Weight matrix for input-to-hidden connections, of shape (D, H) - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H) - b: Biases of shape (H,) Returns a tuple of: - next_h: Next hidden state, of shape (N, H) - cache: Tuple of values needed for the backward pass. """ next_h, cache = None, None next_h = np.tanh(np.dot(prev_h, Wh) + np.dot(x, Wx) + b) cache = { prev_h: prev_h, x: x, Wx: Wx, Wh: Wh, b: b, next_h: next_h, } return next_h, cachedef rnn_step_backward(dnext_h, cache): """ 1. sigmoid函數：f(z) = 1 / (1 + exp( ? z))導數：f(z) = f(z)(1 ? f(z))2.tanh函數：f(z) = tanh(z)導數：f(z) = 1 ? (f(z))2 Inputs: - dnext_h: Gradient of loss with respect to next hidden state - cache: Cache object from the forward pass Returns a tuple of: - dx: Gradients of input data, of shape (N, D) - dprev_h: Gradients of previous hidden state, of shape (N, H) - dWx: Gradients of input-to-hidden weights, of shape (N, H) - dWh: Gradients of hidden-to-hidden weights, of shape (H, H) - db: Gradients of bias vector, of shape (H,) """ dx, dprev_h, dWx, dWh, db = None, None, None, None, None x = cache[x] prev_h = cache[prev_h] Wx = cache[Wx] Wh = cache[Wh] b = cache[b] next_h = cache[next_h] daffine_output = dnext_h*(1-next_h*next_h) dx = daffine_output.dot(Wx.T) dprev_h = daffine_output.dot(Wh.T) dWx = x.T.dot(daffine_output) dWh = prev_h.T.dot(daffine_output) db = np.sum(dnext_h*(1-next_h*next_h),axis=0) return dx, dprev_h, dWx, dWh, dbdef rnn_forward(x, h0, Wx, Wh, b): """ 我們假設每個輸入序列中包含T個向量，每個都有D維，RNN使用隱藏層尺寸為H，共有N個序列 Inputs: - x: Input data for the entire timeseries, of shape (N, T, D). - h0: Initial hidden state, of shape (N, H) - Wx: Weight matrix for input-to-hidden connections, of shape (D, H) - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H) - b: Biases of shape (H,) Returns a tuple of: - h: Hidden states for the entire timeseries, of shape (N, T, H). - cache: Values needed in the backward pass """ N, T, D = x.shape _, H = h0.shape h = np.zeros((N, T, H)) h_interm = h0 cache = [] for i in xrange(T): h[:, i, :], cache_sub = rnn_step_forward(x[:, i, :], h_interm, Wx, Wh, b) h_interm = h[:, i, :] cache.append(cache_sub) return h, cachedef rnn_backward(dh, cache): """ Compute the backward pass for a vanilla RNN over an entire sequence of data. Inputs: - dh: Upstream gradients of all hidden states, of shape (N, T, H) Returns a tuple of: - dx: Gradient of inputs, of shape (N, T, D) - dh0: Gradient of initial hidden state, of shape (N, H) - dWx: Gradient of input-to-hidden weights, of shape (D, H) - dWh: Gradient of hidden-to-hidden weights, of shape (H, H) - db: Gradient of biases, of shape (H,) """ x, Wx, Wh, prev_h, next_h = cache[-1] _, D = x.shape N, T, H = dh.shape dx = np.zeros((N, T, D)) dh0 = np.zeros((N, H)) dWx = np.zeros((D, H)) dWh = np.zeros((H, H)) db = np.zeros(H) dprev_h_=np.zeros((N, H)) for i in xrange(T-1, -1, -1): dx_, dprev_h_, dWx_, dWh_, db_ = rnn_step_backward(dh[:, i, :] + dprev_h_, cache.pop()) #注意這裡有一個累加關係因為一開始的時候只有dh（最後一個dnext_h,每次反向傳播後再加dprev_h） dx[:, i, :] = dx_ dh0 = dprev_h_ dWx += dWx_ dWh += dWh_ db += db_ return dx, dh0, dWx, dWh, dbdef word_embedding_forward(x, W): """在深度學習系統中，我們通常使用向量表示單詞。每個詞都將與向量相關聯，並且這些向量將與系統的其餘部分聯合學習。(這些辭彙都是用數字表示的） N個序列，每個序列長度為T，每個辭彙D維向量 Inputs: - x: Integer array of shape (N, T) giving indices of words. Each element idx of x muxt be in the range 0 <= idx < V.x中的每個整數均對應一個辭彙W: Weight matrix of shape (V, D) giving word vectors for all words.這裡的w就是w_embed - out: Array of shape (N, T, D) giving word vectors for all input words. - cache: Values needed for the backward pass """ N, T = x.shape V, D = W.shape out = np.zeros((N, T, D)) for n in xrange(N): for t in xrange(T): out[n, t, :] = W[x[n, t]] cache = (x, W) return out, cache def word_embedding_backward(dout, cache): """ Inputs: - dout: Upstream gradients of shape (N, T, D) - cache: Values from the forward pass Returns: - dW: Gradient of word embedding matrix, of shape (V, D). """ x, W = cache N, T, D = dout.shape dW = np.zeros(W.shape) for n in xrange(N): for t in xrange(T): dW[x[n, t]] += dout[n, t, :]"""在每個時間步，我們使用仿射函數將該時間步的RNN隱藏向量轉換為辭彙表中每個單詞的分數。這非常類似於在Assignment2 中的仿射層"""def temporal_affine_forward(x, w, b): """ Forward pass for a temporal affine layer. The input is a set of D-dimensional vectors arranged into a minibatch of N timeseries, each of length T. We use an affine function to transform each of those vectors into a new vector of dimension M. Inputs: - x: Input data of shape (N, T, D) 已經由辭彙轉換變為了向量 - w: Weights of shape (D, M) - b: Biases of shape (M,) Returns a tuple of: - out: Output data of shape (N, T, M) - cache: Values needed for the backward pass """ N, T, D = x.shape M = b.shape[0] out = x.reshape(N * T, D).dot(w).reshape(N, T, M) + b cache = x, w, b, out return out, cachedef temporal_affine_backward(dout, cache): """ Backward pass for temporal affine layer. Input: - dout: Upstream gradients of shape (N, T, M) - cache: Values from forward pass Returns a tuple of: - dx: Gradient of input, of shape (N, T, D) - dw: Gradient of weights, of shape (D, M) - db: Gradient of biases, of shape (M,) """ x, w, b, out = cache N, T, D = x.shape M = b.shape[0] dx = dout.reshape(N * T, M).dot(w.T).reshape(N, T, D) dw = dout.reshape(N * T, M).T.dot(x.reshape(N * T, D)).T db = dout.sum(axis=(0, 1)) return dx, dw, dbdef temporal_softmax_loss(x, y, mask, verbose=False): """ A temporal version of softmax loss for use in RNNs. We assume that we are making predictions over a vocabulary of size V for each timestep of a timeseries of length T, over a minibatch of size N. The input x gives scores for all vocabulary elements at all timesteps, and y gives the indices of the ground-truth element at each timestep. We use a cross-entropy loss at each timestep, summing the loss over all timesteps and averaging across the minibatch. As an additional complication, we may want to ignore the model output at some timesteps, since sequences of different length may have been combined into a minibatch and padded with NULL tokens. The optional mask argument tells us which elements should contribute to the loss.注意：不同的字幕可能有不同的長度，我們通過追加<NULL>標記到每個標題的結尾，使他們都有相同的長度。我們不希望這些<NULL>計入損失或梯度，所以除了分數和標籤，我們的損失函數也接受一個mask數組，告訴它哪些元素的分數計入損失 Inputs: - x: Input scores, of shape (N, T, V) - y: Ground-truth indices, of shape (N, T) where each element is in the range 0 <= y[i, t] < V - mask: Boolean array of shape (N, T) where mask[i, t] tells whether or not the scores at x[i, t] should contribute to the loss. Returns a tuple of: - loss: Scalar giving loss - dx: Gradient of loss with respect to scores x. """ N, T, V = x.shape x_flat = x.reshape(N * T, V) y_flat = y.reshape(N * T) mask_flat = mask.reshape(N * T) probs = np.exp(x_flat - np.max(x_flat, axis=1, keepdims=True)) probs /= np.sum(probs, axis=1, keepdims=True) loss = -np.sum(mask_flat * np.log(probs[np.arange(N * T), y_flat])) / N dx_flat = probs.copy() dx_flat[np.arange(N * T), y_flat] -= 1 dx_flat /= N dx_flat *= mask_flat[:, None] if verbose: print dx_flat: , dx_flat.shape dx = dx_flat.reshape(N, T, V) return loss, dx

Rnn

import numpy as npfrom cs231n.layers import *from cs231n.rnn_layers import *class CaptioningRNN(object):"""RNN 接受D維的向量, 有V個字母,注意在CaptioningRNN中，我們不使用任何形式的正則化."""def __init__(self, word_to_idx, input_dim=512, wordvec_dim=128,hidden_dim=128, cell_type=rnn, dtype=np.float32):"""Construct a new CaptioningRNN instance.Inputs:- word_to_idx: A dictionary giving the vocabulary. It contains V entries,and maps each string to a unique integer in the range [0, V). 每個辭彙對應一個唯一的數字。input_dim: Dimension D of input image feature vectors.- wordvec_dim: Dimension W of word vectors.- hidden_dim: Dimension H for the hidden state of the RNN.- cell_type: What type of RNN to use; either rnn or lstm.- dtype: numpy datatype to use; use float32 for training and float64 fornumeric gradient checking."""if cell_type not in {rnn, lstm}:raise ValueError(Invalid cell_type "%s" % cell_type)self.cell_type = cell_typeself.dtype = dtypeself.word_to_idx = word_to_idxself.idx_to_word = {i: w for w, i in word_to_idx.iteritems()}self.params = {}vocab_size = len(word_to_idx)self._null = word_to_idx[<NULL>]self._start = word_to_idx.get(<START>, None)self._end = word_to_idx.get(<END>, None)# Initialize word vectorsself.params[W_embed] = np.random.randn(vocab_size, wordvec_dim) self.params[W_embed] /= 100# Initialize CNN -> hidden state projection parametersself.params[W_proj] = np.random.randn(input_dim, hidden_dim)# 輸入的圖片特徵為（N,D),W_proj 為（D,H）self.params[W_proj] /= np.sqrt(input_dim)self.params[b_proj] = np.zeros(hidden_dim)# Initialize parameters for the RNNdim_mul = {lstm: 4, rnn: 1}[cell_type]self.params[Wx] = np.random.randn(wordvec_dim, dim_mul * hidden_dim)#（D,H)self.params[Wx] /= np.sqrt(wordvec_dim)self.params[Wh] = np.random.randn(hidden_dim, dim_mul * hidden_dim)self.params[Wh] /= np.sqrt(hidden_dim)self.params[b] = np.zeros(dim_mul * hidden_dim)# Initialize output to vocab weightsself.params[W_vocab] = np.random.randn(hidden_dim, vocab_size)#（H,V)self.params[W_vocab] /= np.sqrt(hidden_dim)self.params[b_vocab] = np.zeros(vocab_size)# Cast parameters to correct dtypefor k, v in self.params.iteritems():self.params[k] = v.astype(self.dtype)def loss(self, features, captions):"""Compute training-time loss for the RNN. We input image features andground-truth captions for those images, and use an RNN (or LSTM) to computeloss and gradients on all parameters.Inputs:- features: Input image features, of shape (N, D)- captions: Ground-truth captions; an integer array of shape (N, T) whereeach element is in the range 0 <= y[i, t] < VReturns a tuple of:- loss: Scalar loss- grads: Dictionary of gradients parallel to self.params"""""" Cut captions into two pieces: captions_in has everything but the last word and will be input to the RNN;(captions_in不包含最後一個單詞) captions_out has everything but the firstword and this is what we will expect the RNN to generate.(captions_out不包含第一個單詞，這正是我們想要的結果)These are offset by one relative to each other because the RNN should produce word (t+1)after receiving word t. The first element of captions_in will be the STARTtoken, and the first element of captions_out will be the first word.他們之間正好相互偏移一個，因為RNN在接收t個字後產生（t + 1）個字captions_in的第一個元素將是START令牌，captions_out的第一個元素將是第一個字。"""captions_in = captions[:, :-1]captions_out = captions[:, 1:]mask = (captions_out != self._null)# Weight and bias for the affine transform from image features to initial hidden stateW_proj, b_proj = self.params[W_proj], self.params[b_proj]# Word embedding matrixW_embed = self.params[W_embed]# Input-to-hidden, hidden-to-hidden, and biases for the RNNWx, Wh, b = self.params[Wx], self.params[Wh], self.params[b]# Weight and bias for the hidden-to-vocab transformation.W_vocab, b_vocab = self.params[W_vocab], self.params[b_vocab]loss, grads = 0.0, {}# In the forward pass you will need to do the following: ## (1) Use an affine transformation to compute the initial hidden state 計算初始隱藏狀態# from the image features. This should produce an array of shape (N, H)## (2) Use a word embedding layer to transform the words in captions_in ## from indices to vectors, giving an array of shape (N, T, W). ## (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to ## process the sequence of input word vectors and produce hidden state ## vectors for all timesteps, producing an array of shape (N, T, H). ## (4) Use a (temporal) affine transformation to compute scores over the ## vocabulary at every timestep using the hidden states, giving an ## array of shape (N, T, V). ## (5) Use (temporal) softmax to compute loss using captions_out, ignoring ## the points where the output word is <NULL> using the mask above. ## ## In the backward pass you will need to compute the gradient of the loss ## with respect to all model parameters. Use the loss and grads variables ## defined above to store loss and gradients; grads[k] should give the ## gradients for self.params[k]. ############################################################################## (1)initial_h = np.dot(features, W_proj) + b_proj#（N，D）* (D,H)# (2)embed_word, embed_word_cache = word_embedding_forward(captions_in, W_embed)# (3)if self.cell_type==rnn:h, h_cache = rnn_forward(embed_word, initial_h, Wx, Wh, b)elif self.cell_type ==lstm:h, h_cache = lstm_forward(embed_word, initial_h, Wx, Wh, b)#(4)affine_forward_out, affine_forward_cache = temporal_affine_forward(h, W_vocab, b_vocab)#(5)loss, dscore = temporal_softmax_loss(affine_forward_out, captions_out, mask, verbose=False)#backpropdaffine_out, grads[W_vocab], grads[b_vocab] = temporal_affine_backward(dscore, affine_forward_cache)if self.cell_type==rnn:dword_vector, dh0, grads[Wx], grads[Wh], grads[b] = rnn_backward(daffine_out, h_cache)elif self.cell_type==lstm:dword_vector, dh0, grads[Wx], grads[Wh], grads[b] = lstm_backward(daffine_out, h_cache)grads[W_embed] = word_embedding_backward(dword_vector, embed_word_cache)grads[W_proj] = features.T.dot(dh0)grads[b_proj] = np.sum(dh0, axis=0)return loss, gradsdef sample(self, features, max_length=30):"""Run a test-time forward pass for the model, sampling captions for inputfeature vectors.At each timestep, we embed the current word, pass it and the previous hiddenstate to the RNN to get the next hidden state, use the hidden state to getscores for all vocab words, and choose the word with the highest score asthe next word. The initial hidden state is computed by applying an affinetransform to the input image features, and the initial word is the <START>token.For LSTMs you will also have to keep track of the cell state; in that casethe initial cell state should be zero.Inputs:- features: Array of input image features of shape (N, D).- max_length: Maximum length T of generated captions.Returns:- captions: Array of shape (N, max_length) giving sampled captions,where each element is an integer in the range [0, V). The first elementof captions should be the first sampled word, not the <START> token.針對模型運行測試時的前向傳播，對輸入特徵向量進行採樣操作。在每個時間步，我們嵌入當前詞，傳遞它和之前的隱藏狀態到RNN獲得下一個隱藏狀態，使用隱藏狀態獲取所有辭彙的分數，並選擇具有最高分數的詞作為下一個詞。通過應用仿射來計算初始隱藏狀態變換到輸入圖像特徵，並且初始單詞是<START>令牌。對於LSTM，你還必須跟蹤細胞狀態; 在這種情況下初始單元狀態應為零。輸入：- 特徵：形狀（N，D）的輸入圖像特徵的陣列。- max_length：生成字幕的最大長度T.返回：- captions：提供抽樣字幕的shape（N，max_length）數組，其中每個元素是範圍[0，V）中的整數。第一個元素的字幕應為第一個抽樣字，而不是<START>標記。"""N = features.shape[0]captions = self._null * np.ones((N, max_length), dtype=np.int32)# Unpack parametersW_proj, b_proj = self.params[W_proj], self.params[b_proj]W_embed = self.params[W_embed]Wx, Wh, b = self.params[Wx], self.params[Wh], self.params[b]W_vocab, b_vocab = self.params[W_vocab], self.params[b_vocab]# initialize the hidden state of the RNN by applying the learned affine ## transform to the input image features. The first word that you feed to ## the RNN should be the <START> token; its value is stored in the ## variable self._start. At each timestep you will need to do to: ## (1) Embed the previous word using the learned word embeddings ## (2) Make an RNN step using the previous hidden state and the embedded ## current word to get the next hidden state. ## (3) Apply the learned affine transformation to the next hidden state to ## get scores for all words in the vocabulary ## (4) Select the word with the highest score as the next word, writing it ## to the appropriate slot in the captions variable ## ## For simplicity, you do not need to stop generating after an <END> token ## is sampled, but you can if you want to. ## ## HINT: You will not be able to use the rnn_forward or lstm_forward ## functions; youll need to call rnn_step_forward or lstm_step_forward in ## a loop. ############################################################################"""通過應用所學習的仿射層來初始化RNN的隱藏狀態並變換到輸入圖像特徵。你輸入的第一個字應該是<START>令牌;其值存儲在self._start。在每個時間步，你將需要做：（1）使用學習的詞嵌入前一個詞（2）使用先前的隱藏狀態和嵌入的當前字獲得下一個隱藏狀態。（3）將學習的仿射變換應用於下一個隱藏狀態到得到所有單詞的分數（4）選擇得分最高的單詞作為下一個單詞，寫它到標題變數中的相應位置提示：你不能使用rnn_forward或lstm_forward函數你需要調用rnn_step_forward或lstm_step_forward在一個循環當中"""(N, D) = features.shapeprev_h = features.dot(W_proj) + b_projprev_c = np.zeros(prev_h.shape)# self._start is the index of the word <START>current_word_index = [self._start]*Nfor i in range(max_length):x = W_embed[current_word_index] # get word_vector from word_indexif self.cell_type==rnn:next_h, _ = rnn_step_forward(x, prev_h, Wx, Wh, b)elif self.cell_type ==lstm:next_h, next_c, _ = lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)prev_c = next_cprev_h = next_hnext_h = np.expand_dims(next_h, axis=1)score, _ = temporal_affine_forward(next_h, W_vocab, b_vocab)captions[:,i] = list(np.argmax(score, axis = 2))#score(N,T,M)current_word_index = captions[:,i]return captions

captioning_solver

import numpy as npfrom cs231n import optimfrom cs231n.coco_utils import sample_coco_minibatchclass CaptioningSolver(object): """ A CaptioningSolver encapsulates all the logic necessary for training image captioning models. The CaptioningSolver performs stochastic gradient descent using different update rules defined in optim.py. The solver accepts both training and validataion data and labels so it can periodically check classification accuracy on both training and validation data to watch out for overfitting. To train a model, you will first construct a CaptioningSolver instance, passing the model, dataset, and various options (learning rate, batch size, etc) to the constructor. You will then call the train() method to run the optimization procedure and train the model. After the train() method returns, model.params will contain the parameters that performed best on the validation set over the course of training. In addition, the instance variable solver.loss_history will contain a list of all losses encountered during training and the instance variables solver.train_acc_history and solver.val_acc_history will be lists containing the accuracies of the model on the training and validation set at each epoch. Example usage might look something like this: data = load_coco_data() model = MyAwesomeModel(hidden_dim=100) solver = CaptioningSolver(model, data, update_rule=sgd, optim_config={ learning_rate: 1e-3, }, lr_decay=0.95, num_epochs=10, batch_size=100, print_every=100) solver.train() A CaptioningSolver works on a model object that must conform to the following API: - model.params must be a dictionary mapping string parameter names to numpy arrays containing parameter values. - model.loss(features, captions) must be a function that computes training-time loss and gradients, with the following inputs and outputs: Inputs: - features: Array giving a minibatch of features for images, of shape (N, D - captions: Array of captions for those images, of shape (N, T) where each element is in the range (0, V]. Returns: - loss: Scalar giving the loss - grads: Dictionary with the same keys as self.params mapping parameter names to gradients of the loss with respect to those parameters. """ def __init__(self, model, data, **kwargs): """ Construct a new CaptioningSolver instance. Required arguments: - model: A model object conforming to the API described above - data: A dictionary of training and validation data from load_coco_data Optional arguments: - update_rule: A string giving the name of an update rule in optim.py. Default is sgd. - optim_config: A dictionary containing hyperparameters that will be passed to the chosen update rule. Each update rule requires different hyperparameters (see optim.py) but all update rules require a learning_rate parameter so that should always be present. - lr_decay: A scalar for learning rate decay; after each epoch the learning rate is multiplied by this value. - batch_size: Size of minibatches used to compute loss and gradient during training. - num_epochs: The number of epochs to run for during training. - print_every: Integer; training losses will be printed every print_every iterations. - verbose: Boolean; if set to false then no output will be printed during training. """ self.model = model self.data = data # Unpack keyword arguments self.update_rule = kwargs.pop(update_rule, sgd) self.optim_config = kwargs.pop(optim_config, {}) self.lr_decay = kwargs.pop(lr_decay, 1.0) self.batch_size = kwargs.pop(batch_size, 100) self.num_epochs = kwargs.pop(num_epochs, 10) self.print_every = kwargs.pop(print_every, 10) self.verbose = kwargs.pop(verbose, True) # Throw an error if there are extra keyword arguments if len(kwargs) > 0: extra = , .join("%s" % k for k in kwargs.keys()) raise ValueError(Unrecognized arguments %s % extra) # Make sure the update rule exists, then replace the string # name with the actual function if not hasattr(optim, self.update_rule): raise ValueError(Invalid update_rule "%s" % self.update_rule) self.update_rule = getattr(optim, self.update_rule) self._reset() def _reset(self): """ Set up some book-keeping variables for optimization. Dont call this manually. """ # Set up some variables for book-keeping self.epoch = 0 self.best_val_acc = 0 self.best_params = {} self.loss_history = [] self.train_acc_history = [] self.val_acc_history = [] # Make a deep copy of the optim_config for each parameter self.optim_configs = {} for p in self.model.params: d = {k: v for k, v in self.optim_config.iteritems()} self.optim_configs[p] = d def _step(self): """ Make a single gradient update. This is called by train() and should not be called manually. """ # Make a minibatch of training data minibatch = sample_coco_minibatch(self.data, batch_size=self.batch_size, split=train) captions, features, urls = minibatch # Compute loss and gradient loss, grads = self.model.loss(features, captions) self.loss_history.append(loss) # Perform a parameter update for p, w in self.model.params.iteritems(): dw = grads[p] config = self.optim_configs[p] next_w, next_config = self.update_rule(w, dw, config) self.model.params[p] = next_w self.optim_configs[p] = next_config # TODO: This does nothing right now; maybe implement BLEU? def check_accuracy(self, X, y, num_samples=None, batch_size=100): """ Check accuracy of the model on the provided data. Inputs: - X: Array of data, of shape (N, d_1, ..., d_k) - y: Array of labels, of shape (N,) - num_samples: If not None, subsample the data and only test the model on num_samples datapoints. - batch_size: Split X and y into batches of this size to avoid using too much memory. Returns: - acc: Scalar giving the fraction of instances that were correctly classified by the model. """ return 0.0 # Maybe subsample the data N = X.shape[0] if num_samples is not None and N > num_samples: mask = np.random.choice(N, num_samples) N = num_samples X = X[mask] y = y[mask] # Compute predictions in batches num_batches = N / batch_size if N % batch_size != 0: num_batches += 1 y_pred = [] for i in xrange(num_batches): start = i * batch_size end = (i + 1) * batch_size scores = self.model.loss(X[start:end]) y_pred.append(np.argmax(scores, axis=1)) y_pred = np.hstack(y_pred) acc = np.mean(y_pred == y) return acc def train(self): """ Run optimization to train the model. """ num_train = self.data[train_captions].shape[0] iterations_per_epoch = max(num_train / self.batch_size, 1) num_iterations = self.num_epochs * iterations_per_epoch for t in xrange(num_iterations): self._step() # Maybe print training loss if self.verbose and t % self.print_every == 0: print (Iteration %d / %d) loss: %f % ( t + 1, num_iterations, self.loss_history[-1]) # At the end of every epoch, increment the epoch counter and decay the # learning rate. epoch_end = (t + 1) % iterations_per_epoch == 0 if epoch_end: self.epoch += 1 for k in self.optim_configs: self.optim_configs[k][learning_rate] *= self.lr_decay # Check train and val accuracy on the first iteration, the last # iteration, and at the end of each epoch. # TODO: Implement some logic to check Bleu on validation set periodically # At the end of training swap the best params into the model # self.model.params = self.best_params

對一批小數據進行過擬合：

small_data = load_coco_data(max_train=50)small_rnn_model = CaptioningRNN( cell_type=rnn, word_to_idx=data[word_to_idx], input_dim=data[train_features].shape[1], hidden_dim=512, wordvec_dim=256, )small_rnn_solver = CaptioningSolver(small_rnn_model, small_data, update_rule=adam, num_epochs=50, batch_size=25, optim_config={ learning_rate: 5e-3, }, lr_decay=0.95, verbose=True, print_every=10, )small_rnn_solver.train()

過擬合完成後，進入測試階段，與分類模型不同，圖像字幕模型在訓練時間和測試時間表現非常不同。在訓練時間我們在每個時間步長輸入標籤到RNN。在測試時，我們從每個時間步長的辭彙表分布中抽取樣本，並在下一個時間步長將樣本作為輸入給RNN。

二使用LSTM進行圖像標註

類似於vanilla RNN，在每個時間步，我們接收一個輸入和先前的隱藏狀態：

LSTM還維持H維度的狀態，所以我們也接收先前的狀態

LSTM的可學習參數是輸入層到隱藏層矩陣：

隱藏層到隱藏層的矩陣：

和偏置向量：在每個時間步，我們首先計算激活向量式子為：

然後我們將它分成四個向量這四個向量依次包含了a中的H個向量。然後我們依次計算 input gate ， forget gate , output gate and block input 。

最後，我們計算下一個單元格狀態和下一個隱藏狀態

rnn.layers

""" Forward pass for a single timestep of an LSTM. The input data has dimension D, the hidden state has dimension H, and we use a minibatch size of N. Inputs: - x: Input data, of shape (N, D) - prev_h: Previous hidden state, of shape (N, H) - prev_c: previous cell state, of shape (N, H) - Wx: Input-to-hidden weights, of shape (D, 4H) - Wh: Hidden-to-hidden weights, of shape (H, 4H) - b: Biases, of shape (4H,) Returns a tuple of: - next_h: Next hidden state, of shape (N, H) - next_c: Next cell state, of shape (N, H) - cache: Tuple of values needed for backward pass. """ next_h, next_c, cache = None, None, None N, D = x.shape _, H = prev_h.shape a = np.dot(x, Wx) + np.dot(prev_h, Wh) + b i = sigmoid(a[:, 0: H]) f = sigmoid(a[:, H: 2*H]) o = sigmoid(a[:, 2*H: 3*H]) g = np.tanh(a[:, 3*H: 4*H]) next_c = f*prev_c + i*g next_h = o*np.tanh(next_c)cache = (x, prev_h, prev_c, Wx, Wh, i, f, o, g, next_h, next_c)return next_h, next_c, cachedef lstm_step_backward(dnext_h, dnext_c, cache): """ Backward pass for a single timestep of an LSTM. Inputs: - dnext_h: Gradients of next hidden state, of shape (N, H) - dnext_c: Gradients of next cell state, of shape (N, H) - cache: Values from the forward pass Returns a tuple of: - dx: Gradient of input data, of shape (N, D) - dprev_h: Gradient of previous hidden state, of shape (N, H) - dprev_c: Gradient of previous cell state, of shape (N, H) - dWx: Gradient of input-to-hidden weights, of shape (D, 4H) - dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H) - db: Gradient of biases, of shape (4H,) """dx, dh, dc, dWx, dWh, db = None, None, None, None, None, None (x,prev_h, prev_c, Wx, Wh, i,f,o,g, next_h, next_c) = cache dnext_c = dnext_c + o*(1-np.tanh(next_c)**2)*dnext_h # next_h = o*np.tanh(next_c) di = dnext_c*g #next_c = f*prev_c + i*g df = dnext_c*prev_c #next_c = f*prev_c + i*g do = dnext_h*np.tanh(next_c) #next_h = o*np.tanh(next_c) dg = dnext_c*i #next_h = o*np.tanh(next_c) dprev_c = f*dnext_c #next_c = f*prev_c + i*g da = np.hstack((i*(1-i)*di, f*(1-f)*df, o*(1-o)*do, (1-g**2)*dg)) #共四部分 dx = da.dot(Wx.T) dprev_h = da.dot(Wh.T) dWx = x.T.dot(da) dWh = prev_h.T.dot(da)db = np.sum(da, axis=0) return dx, dprev_h, dprev_c, dWx, dWh, dbdef lstm_forward(x, h0, Wx, Wh, b): h, cache = None, None N, T, D = x.shape _, H = h0.shape h = np.zeros((N,T,H)) c = np.zeros((N,T,H)) c0 = np.zeros((N,H)) cache = {} for t in range(T): if t==0: h[:,t,:], c[:,t,:], cache[t] = lstm_step_forward(x[:,t,:], h0, c0, Wx, Wh, b) else: h[:,t,:], c[:,t,:], cache[t] = lstm_step_forward(x[:,t,:], h[:,t-1,:], c[:,t-1,:], Wx, Wh, b)return h, cachedef lstm_backward(dh, cache): """ Backward pass for an LSTM over an entire sequence of data.] Inputs: - dh: Upstream gradients of hidden states, of shape (N, T, H) - cache: Values from the forward pass Returns a tuple of: - dx: Gradient of input data of shape (N, T, D) - dh0: Gradient of initial hidden state of shape (N, H) - dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H) - dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H) - db: Gradient of biases, of shape (4H,) """ dx, dh0, dWx, dWh, db = None, None, None, None, None (N, T, H) = dh.shape x,prev_h, prev_c, Wx, Wh, i,f,o,g, next_h, next_c= cache[T-1] N,D = x.shape dx = np.zeros((N,T,D)) dWx = np.zeros(Wx.shape) dWh = np.zeros(Wh.shape) db = np.zeros((4*H)) dprev = np.zeros(prev_h.shape) dprev_c = np.zeros(prev_c.shape) for t in range(T-1,-1,-1): dx[:,t,:], dprev, dprev_c, dWx_local, dWh_local, db_local = lstm_step_backward(dh[:,t,:]+dprev, dprev_c, cache[t])#注意此處的疊加 dWx+=dWx_local dWh+=dWh_local db +=db_local dh0 = dprevreturn dx, dh0, dWx, dWh, db

作業暫且更新到這裡，開學後時間比較緊張，最近在做將神經網路移植到樹莓派小車上，做完後再來補坑以及確定下一步的學習方向，前路漫漫啊。

CS231n Assignment3

知識儲備：

一 Spatial Localization and Detection（空間定位和檢測）

二：Recurrent Neural Networks

三：LSTM

作業：

一：使用普通RNN進行圖像標註

二 使用LSTM進行圖像標註

二使用LSTM進行圖像標註