讀PyTorch源碼學習RNN（1）

04-19

PyTorch中RNN的實現分兩個版本：1）GPU版；2）CPU版。由於GPU版是直接調用cuDNN的RNN API，這裡咱就略去不表。這篇文章將講述0.2.0版PyTorch是如何實現基本RNN模型的。

RNN，更準確的說，torch.nn.RNN，實現的是Jeffrey Elman在1990年提出的simple recurrent neural network (SRNN)，它還有一個更為廣泛的稱呼：Elman network。

RNN的隱狀態計算公式如下：

$[h_t = anh(w_{ih} * x_t + b_{ih} + w_{hh} * h_{(t-1)} + b_{hh})]$

模型包含兩個參數矩陣 $w_{ih}$ 、 $w_{hh}$ 以及兩個bias變數 $b_{ih}$ 、 $b_{hh}$ 。

原來這就是RNN呀，雖然一提到RNN，各種論文、博客都會講到什麼sequence、遞歸、各種繞的辭彙，而本質上，RNN模型只包含這四個參數，所謂的訓練過程就是在學習/求解這四個參數！果然夠simple~ //（註：更嚴謹的說法是，對於單層單方向的RNN網路，只包含這四個參數）

講到這裡，插三個小問題。

1、熟悉PyTorch中RNN的朋友可能知道，RNN僅支持兩種激活函數：tanh和ReLU。為啥不支持sigmoid以及各種ReLU的變形呢？前面提到了，PyTorch的GPU版沒有自己實現，而是直接調用了cuDNN，因為cuDNN中的RNN僅支持tanh和ReLU。為了統一，CPU版的RNN也只支持tanh和ReLU。

2、很多博客在講解RNN時，都會說「RNN每個時刻的輸入是 $x_t$ 和上一個時刻的隱狀態 $h_{t-1}$ ，輸出是當前時刻隱狀態 $h_t$ 和模型輸出 $y_t$ 」。而實際上，從RNN的隱狀態計算公式可以看到，RNN輸出的只有 $h_t$ ！根本沒有所謂的 $y_t$ 。很多人之所以這麼說，其實是混淆了一個事實：在將RNN應用於具體任務時，比如情感分類，在得到RNN的隱狀態輸出後，通常並不會將 $h_t$ 和類別label直接關聯，而是在 $h_t$ 後面接一層全連接網路，而全連接網路的輸出才是模型預測的類別 $y_t$ 。so，還是要分清RNN的 $h_t$ 和分類模型的 $y_t$ 。

3、用過PyTorch的朋友大概都知道，對於不同的網路層，輸入的維度雖然不同，但是通常輸入的第一個維度都是batch_size，比如torch.nn.Linear的輸入(batch_size,in_features)，torch.nn.Conv2d的輸入（batch_size, $C_{in},H_{in},W_{in}$ ,）。而RNN的輸入卻是(seq_len, batch_size, input_size)，batch_size位於第二維度！雖然你可以將batch_size和序列長度seq_len對換位置，此時只需要把batch_first設置為True。但是默認情況下RNN輸入為啥不是batch first？原因同上，因為cuDNN中RNN的API就是batch_size在第二維度！進一步，為啥cuDNN要這麼做呢？因為batch first意味著模型的輸入（一個Tensor）在內存中存儲時，先存儲第一個sequence，再存儲第二個。。。而如果是seq_len first，模型的輸入在內存中，先存儲所有序列的第一個單元，然後是第二個單元。。。兩種區別如下圖所示：

batch first vs seq_len first

seq_len first意味著不同序列中同一個timestamp對應的輸入單元在內存中是毗鄰的，這樣才能做到真正的batch計算。

在學習RNN源碼之前，先複習Python中的一個語法：閉包(closure)。請戳鏈接

okay，下面開始閱讀RNN的源碼(torch/nn/modules/rnn.py)。

Pytorch中不論是RNN、LSTM還是GRU，都繼承了相同的基類RNNBase，並且三者只在構造方法(__init__)有細微差別:

以RNN為例，

class RNN(RNNBase): def __init__(self, *args, **kwargs): if nonlinearity in kwargs: if kwargs[nonlinearity] == tanh: mode = RNN_TANH elif kwargs[nonlinearity] == relu: mode = RNN_RELU else: raise ValueError("Unknown nonlinearity {}".format( kwargs[nonlinearity])) del kwargs[nonlinearity] else: mode = RNN_TANH super(RNN, self).__init__(mode, *args, **kwargs)

構造方法只做一件事：聲明RNN的類型是RNN_TANH還是RNN_RELU。其餘操作全靠RNNBase。

接著看RNNBase代碼，__init()__的核心代碼：

# 為RNN每一層，每個方向，都創建一組參數w_ih,w_hh,b_ih,b_hh。# 並且把所有參賽設置為模型的屬性，這一步通過setattr()函數實現for layer in range(num_layers): for direction in range(num_directions): layer_input_size = input_size if layer == 0 else hidden_size * num_directions w_ih = Parameter(torch.Tensor(gate_size, layer_input_size)) w_hh = Parameter(torch.Tensor(gate_size, hidden_size)) b_ih = Parameter(torch.Tensor(gate_size)) b_hh = Parameter(torch.Tensor(gate_size)) layer_params = (w_ih, w_hh, b_ih, b_hh) suffix = _reverse if direction == 1 else param_names = [weight_ih_l{}{}, weight_hh_l{}{}] # 1layer, 1direction: [weight_ih_l0, weight_hh_l0] if bias: param_names += [bias_ih_l{}{}, bias_hh_l{}{}] param_names = [x.format(layer, suffix) for x in param_names] for name, param in zip(param_names, layer_params): setattr(self, name, param) # self.name = param, 為實例添加屬性 self._all_weights.append(param_names) self._param_buf_size += sum(p.numel() for p in layer_params)self.flatten_parameters() # only works for RNN based on GPU and cuDNNself.reset_parameters()

self.flatten_parameters()方法只對GPU版RNN有效，這裡略去不表。

self.reset_parameters()方法是對模型的參數進行初始化：

def reset_parameters(self): stdv = 1.0 / math.sqrt(self.hidden_size) for weight in self.parameters(): weight.data.uniform_(-stdv, stdv)

可以看到，對於所有的bias和weights都使用了均勻分布進行隨機初始化，為啥要這麼初始化呢？我在Weight initialization when using ReLUs中找到了PyTorch核心開發人員Soumith Chintala在當時(2014年9yue)對神經網路參數初始化的經驗：

「I initialized my weights with a uniform distribution, mean 0 and std-deviation such that the output neurons would be reasonably bounded for the next layer (so this depended on fanin and fanout)」

「anyways, for most practical purposes, I found the torch defaults to work well.

For conv layers:

stdv = 1/math.sqrt(self.kW*self.kH*self.nInputPlane)

For linear layers:

stdv = 1./math.sqrt(inputSize)」

而RNN本質上就是linear layers。

// 如果你對神經網路如何初始化參數由興趣，強烈建議閱讀這兩個鏈接 1) Weight initialization when using ReLUs 2) weight initialization discussion

繼續看RNNBase的forward方法，RNN處理的是各種序列(比如一句話，一篇文章)，而這些序列通常長度不相同，也就是variable length sequence，這裡咱們暫時只分析最簡單的情況：各個序列長度相同。

def forward(self, input, hx=None): batch_sizes = None # is not packed, batch_sizes = None max_batch_size = input.size(0) if self.batch_first else input.size(1) # batch_size, why call it max if hx is None: # 使用者可以不傳輸hidden, 自動創建全0的hidden num_directions = 2 if self.bidirectional else 1 hx = torch.autograd.Variable(input.data.new(self.num_layers * num_directions, max_batch_size, self.hidden_size).zero_()) if self.mode == LSTM: # h_0, c_0 hx = (hx, hx) flat_weight = None # if cpu func = self._backend.RNN( # self._backend = thnn_backend # backend = THNNFunctionBackend(), FunctionBackend self.mode, self.input_size, self.hidden_size, num_layers=self.num_layers, batch_first=self.batch_first, dropout=self.dropout, train=self.training, bidirectional=self.bidirectional, batch_sizes=batch_sizes, dropout_state=self.dropout_state, flat_weight=flat_weight ) output, hidden = func(input, self.all_weights, hx) return output, hidden

可以看到，在訓練RNN時，可以不傳入 $h_0$ ，此時PyTorch會自動創建全0的 $h_0$ 。

forward中最重要的也是真正執行前向計算的是如下兩行代碼:

func = self._backend.RNN( self.mode, self.input_size, self.hidden_size, num_layers=self.num_layers, batch_first=self.batch_first, dropout=self.dropout, train=self.training, bidirectional=self.bidirectional, batch_sizes=batch_sizes, dropout_state=self.dropout_state, flat_weight=flat_weight )output, hidden = func(input, self.all_weights, hx)

還記得前面說過的閉包(closure)嗎？這裡func就是一個閉包。為啥這麼說呢，看一看RNN的源碼就知道了：

def RNN(*args, **kwargs): def forward(input, *fargs, **fkwargs): func = AutogradRNN(*args, **kwargs) # if no gpu, RNN=AutogradRNN # func也是閉包 return func(input, *fargs, **fkwargs) return forward

soga~前面提到的func果然是一個閉包。func這個閉包中的函數就是RNN中的forward。func(input, self.all_weights, hx)等同於AutogradRNN(input, self.all_weights, hx)。

注意函數RNN中的forward中的func也是一個閉包~

繼續看AutogradRNN的實現，看看模型RNN到底是如何實現的：

def AutogradRNN(mode, input_size, hidden_size, num_layers=1, batch_first=False, dropout=0, train=True, bidirectional=False, batch_sizes=None, dropout_state=None, flat_weight=None): if mode == RNN_RELU: cell = RNNReLUCell elif mode == RNN_TANH: cell = RNNTanhCell elif mode == LSTM: cell = LSTMCell elif mode == GRU: cell = GRUCell else: raise Exception(Unknown mode: {}.format(mode)) rec_factory = Recurrent if bidirectional: layer = (rec_factory(cell), rec_factory(cell, reverse=True)) # (Recurrent中的forward, Recurrent中的forward) else: layer = (rec_factory(cell),) # Recurrent(RNNTanhCell) # func is another closure o..o func = StackedRNN(layer, num_layers, (mode == LSTM), dropout=dropout, train=train) def forward(input, weight, hidden): if batch_first and batch_sizes is None: input = input.transpose(0, 1) # 即使輸入數據是batch_first, 內部也要轉為seq first nexth, output = func(input, hidden, weight) if batch_first and batch_sizes is None: output = output.transpose(0, 1) return output, nexth return forward

AutogradRNN中還是通過閉包的方式封裝了真正執行RNN計算的代碼。。。有一點需要注意，即使RNN的輸入數據是batch first，內部也會轉為seq_len first。

那我們就繼續看StackedRNN：

def StackedRNN(inners, num_layers, lstm=False, dropout=0, train=True): num_directions = len(inners) # 2 or 1 total_layers = num_layers * num_directions def forward(input, hidden, weight): assert(len(weight) == total_layers) next_hidden = [] if lstm: hidden = list(zip(*hidden)) for i in range(num_layers): all_output = [] for j, inner in enumerate(inners): l = i * num_directions + j hy, output = inner(input, hidden[l], weight[l]) # 調用Recurrent() next_hidden.append(hy) all_output.append(output) input = torch.cat(all_output, input.dim() - 1) if dropout != 0 and i < num_layers - 1: # 只有多層的rnn，才存在dropout, 對output input = F.dropout(input, p=dropout, training=train, inplace=False) if lstm: next_h, next_c = zip(*next_hidden) next_hidden = ( torch.cat(next_h, 0).view(total_layers, *next_h[0].size()), torch.cat(next_c, 0).view(total_layers, *next_c[0].size()) ) else: next_hidden = torch.cat(next_hidden, 0).view( total_layers, *next_hidden[0].size()) return next_hidden, input return forward

耶~終於找到真正執行前向計算的代碼了。

也就是下面這幾行：

for i in range(num_layers): all_output = [] for j, inner in enumerate(inners): l = i * num_directions + j hy, output = inner(input, hidden[l], weight[l]) # 調用Recurrent() next_hidden.append(hy) all_output.append(output) input = torch.cat(all_output, input.dim() - 1) if dropout != 0 and i < num_layers - 1: # 只有多層的rnn，才存在dropout, 對output input = F.dropout(input, p=dropout, training=train, inplace=False)

對於每一層，每個方向，調用Recurrent方法計算一次前向：

def Recurrent(inner, reverse=False): def forward(input, hidden, weight): output = [] steps = range(input.size(0) - 1, -1, -1) if reverse else range(input.size(0)) # steps=[seq_len-1, ...,1,0] or [0,1,...,seq_len-1] for i in steps: hidden = inner(input[i], hidden, *weight) # hack to handle LSTM output.append(hidden[0] if isinstance(hidden, tuple) else hidden) if reverse: output.reverse() output = torch.cat(output, 0).view(input.size(0), *output[0].size()) return hidden, output return forward

而真正執行每個時刻的隱狀態的計算如下：

def RNNReLUCell(input, hidden, w_ih, w_hh, b_ih=None, b_hh=None): hy = F.relu(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh)) return hydef RNNTanhCell(input, hidden, w_ih, w_hh, b_ih=None, b_hh=None): hy = F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh)) return hy

ok，至此我們就弄清了RNN的前向計算過程~雖然還有一些特殊情況沒提到，咱們下回繼續。

插張流程圖

未完待續。。。