讀PyTorch源碼學習RNN(1)

PyTorch中RNN的實現分兩個版本:1)GPU版;2)CPU版。由於GPU版是直接調用cuDNN的RNN API,這裡咱就略去不表。這篇文章將講述0.2.0版PyTorch是如何實現基本RNN模型的。

RNN,更準確的說,torch.nn.RNN,實現的是Jeffrey Elman在1990年提出的simple recurrent neural network (SRNN),它還有一個更為廣泛的稱呼:Elman network。

RNN的隱狀態計算公式如下:

[h_t = 	anh(w_{ih} * x_t + b_{ih} + w_{hh} * h_{(t-1)} + b_{hh})]

模型包含兩個參數矩陣 w_{ih}w_{hh} 以及兩個bias變數 b_{ih}b_{hh}

原來這就是RNN呀,雖然一提到RNN,各種論文、博客都會講到什麼sequence、遞歸、各種繞的辭彙,而本質上,RNN模型只包含這四個參數,所謂的訓練過程就是在學習/求解這四個參數!果然夠simple~ //(註:更嚴謹的說法是,對於單層單方向的RNN網路,只包含這四個參數)

講到這裡,插三個小問題。

1、熟悉PyTorch中RNN的朋友可能知道,RNN僅支持兩種激活函數:tanh和ReLU。為啥不支持sigmoid以及各種ReLU的變形呢?前面提到了,PyTorch的GPU版沒有自己實現,而是直接調用了cuDNN,因為cuDNN中的RNN僅支持tanh和ReLU。為了統一,CPU版的RNN也只支持tanh和ReLU。

2、很多博客在講解RNN時,都會說「RNN每個時刻的輸入是 x_t 和上一個時刻的隱狀態 h_{t-1} ,輸出是當前時刻隱狀態 h_t 和模型輸出 y_t 」。而實際上,從RNN的隱狀態計算公式可以看到,RNN輸出的只有 h_t !根本沒有所謂的 y_t 。很多人之所以這麼說,其實是混淆了一個事實:在將RNN應用於具體任務時,比如情感分類,在得到RNN的隱狀態輸出後,通常並不會將 h_t 和類別label直接關聯,而是在 h_t 後面接一層全連接網路,而全連接網路的輸出才是模型預測的類別 y_t 。so,還是要分清RNN的 h_t和分類模型的 y_t

3、用過PyTorch的朋友大概都知道,對於不同的網路層,輸入的維度雖然不同,但是通常輸入的第一個維度都是batch_size,比如torch.nn.Linear的輸入(batch_size,in_features),torch.nn.Conv2d的輸入(batch_size, C_{in},H_{in},W_{in} ,)。而RNN的輸入卻是(seq_len, batch_size, input_size),batch_size位於第二維度!雖然你可以將batch_size和序列長度seq_len對換位置,此時只需要把batch_first設置為True。但是默認情況下RNN輸入為啥不是batch first?原因同上,因為cuDNN中RNN的API就是batch_size在第二維度!進一步,為啥cuDNN要這麼做呢?因為batch first意味著模型的輸入(一個Tensor)在內存中存儲時,先存儲第一個sequence,再存儲第二個。。。而如果是seq_len first,模型的輸入在內存中,先存儲所有序列的第一個單元,然後是第二個單元。。。兩種區別如下圖所示:

batch first vs seq_len first

seq_len first意味著不同序列中同一個timestamp對應的輸入單元在內存中是毗鄰的,這樣才能做到真正的batch計算。

在學習RNN源碼之前,先複習Python中的一個語法:閉包(closure)。請戳鏈接

okay,下面開始閱讀RNN的源碼(torch/nn/modules/rnn.py)。

Pytorch中不論是RNN、LSTM還是GRU,都繼承了相同的基類RNNBase,並且三者只在構造方法(__init__)有細微差別:

以RNN為例,

class RNN(RNNBase): def __init__(self, *args, **kwargs): if nonlinearity in kwargs: if kwargs[nonlinearity] == tanh: mode = RNN_TANH elif kwargs[nonlinearity] == relu: mode = RNN_RELU else: raise ValueError("Unknown nonlinearity {}".format( kwargs[nonlinearity])) del kwargs[nonlinearity] else: mode = RNN_TANH super(RNN, self).__init__(mode, *args, **kwargs)

構造方法只做一件事:聲明RNN的類型是RNN_TANH還是RNN_RELU。其餘操作全靠RNNBase。

接著看RNNBase代碼,__init()__的核心代碼:

# 為RNN每一層,每個方向,都創建一組參數w_ih,w_hh,b_ih,b_hh。# 並且把所有參賽設置為模型的屬性,這一步通過setattr()函數實現for layer in range(num_layers): for direction in range(num_directions): layer_input_size = input_size if layer == 0 else hidden_size * num_directions w_ih = Parameter(torch.Tensor(gate_size, layer_input_size)) w_hh = Parameter(torch.Tensor(gate_size, hidden_size)) b_ih = Parameter(torch.Tensor(gate_size)) b_hh = Parameter(torch.Tensor(gate_size)) layer_params = (w_ih, w_hh, b_ih, b_hh) suffix = _reverse if direction == 1 else param_names = [weight_ih_l{}{}, weight_hh_l{}{}] # 1layer, 1direction: [weight_ih_l0, weight_hh_l0] if bias: param_names += [bias_ih_l{}{}, bias_hh_l{}{}] param_names = [x.format(layer, suffix) for x in param_names] for name, param in zip(param_names, layer_params): setattr(self, name, param) # self.name = param, 為實例添加屬性 self._all_weights.append(param_names) self._param_buf_size += sum(p.numel() for p in layer_params)self.flatten_parameters() # only works for RNN based on GPU and cuDNNself.reset_parameters()

self.flatten_parameters()方法只對GPU版RNN有效,這裡略去不表。

self.reset_parameters()方法是對模型的參數進行初始化:

def reset_parameters(self): stdv = 1.0 / math.sqrt(self.hidden_size) for weight in self.parameters(): weight.data.uniform_(-stdv, stdv)

可以看到,對於所有的bias和weights都使用了均勻分布進行隨機初始化,為啥要這麼初始化呢?我在Weight initialization when using ReLUs中找到了PyTorch核心開發人員Soumith Chintala在當時(2014年9yue)對神經網路參數初始化的經驗:

「I initialized my weights with a uniform distribution, mean 0 and std-deviation such that the output neurons would be reasonably bounded for the next layer (so this depended on fanin and fanout)」

「anyways, for most practical purposes, I found the torch defaults to work well.

For conv layers:

stdv = 1/math.sqrt(self.kW*self.kH*self.nInputPlane)

For linear layers:

stdv = 1./math.sqrt(inputSize)」

而RNN本質上就是linear layers。

// 如果你對神經網路如何初始化參數由興趣,強烈建議閱讀這兩個鏈接 1) Weight initialization when using ReLUs 2) weight initialization discussion

繼續看RNNBase的forward方法,RNN處理的是各種序列(比如一句話,一篇文章),而這些序列通常長度不相同,也就是variable length sequence,這裡咱們暫時只分析最簡單的情況:各個序列長度相同。

def forward(self, input, hx=None): batch_sizes = None # is not packed, batch_sizes = None max_batch_size = input.size(0) if self.batch_first else input.size(1) # batch_size, why call it max if hx is None: # 使用者可以不傳輸hidden, 自動創建全0的hidden num_directions = 2 if self.bidirectional else 1 hx = torch.autograd.Variable(input.data.new(self.num_layers * num_directions, max_batch_size, self.hidden_size).zero_()) if self.mode == LSTM: # h_0, c_0 hx = (hx, hx) flat_weight = None # if cpu func = self._backend.RNN( # self._backend = thnn_backend # backend = THNNFunctionBackend(), FunctionBackend self.mode, self.input_size, self.hidden_size, num_layers=self.num_layers, batch_first=self.batch_first, dropout=self.dropout, train=self.training, bidirectional=self.bidirectional, batch_sizes=batch_sizes, dropout_state=self.dropout_state, flat_weight=flat_weight ) output, hidden = func(input, self.all_weights, hx) return output, hidden

可以看到,在訓練RNN時,可以不傳入 h_0 ,此時PyTorch會自動創建全0的 h_0

forward中最重要的也是真正執行前向計算的是如下兩行代碼:

func = self._backend.RNN( self.mode, self.input_size, self.hidden_size, num_layers=self.num_layers, batch_first=self.batch_first, dropout=self.dropout, train=self.training, bidirectional=self.bidirectional, batch_sizes=batch_sizes, dropout_state=self.dropout_state, flat_weight=flat_weight )output, hidden = func(input, self.all_weights, hx)

還記得前面說過的閉包(closure)嗎?這裡func就是一個閉包。為啥這麼說呢,看一看RNN的源碼就知道了:

def RNN(*args, **kwargs): def forward(input, *fargs, **fkwargs): func = AutogradRNN(*args, **kwargs) # if no gpu, RNN=AutogradRNN # func也是閉包 return func(input, *fargs, **fkwargs) return forward

soga~前面提到的func果然是一個閉包。func這個閉包中的函數就是RNN中的forward。func(input, self.all_weights, hx)等同於AutogradRNN(input, self.all_weights, hx)。

注意函數RNN中的forward中的func也是一個閉包~

繼續看AutogradRNN的實現,看看模型RNN到底是如何實現的:

def AutogradRNN(mode, input_size, hidden_size, num_layers=1, batch_first=False, dropout=0, train=True, bidirectional=False, batch_sizes=None, dropout_state=None, flat_weight=None): if mode == RNN_RELU: cell = RNNReLUCell elif mode == RNN_TANH: cell = RNNTanhCell elif mode == LSTM: cell = LSTMCell elif mode == GRU: cell = GRUCell else: raise Exception(Unknown mode: {}.format(mode)) rec_factory = Recurrent if bidirectional: layer = (rec_factory(cell), rec_factory(cell, reverse=True)) # (Recurrent中的forward, Recurrent中的forward) else: layer = (rec_factory(cell),) # Recurrent(RNNTanhCell) # func is another closure o..o func = StackedRNN(layer, num_layers, (mode == LSTM), dropout=dropout, train=train) def forward(input, weight, hidden): if batch_first and batch_sizes is None: input = input.transpose(0, 1) # 即使輸入數據是batch_first, 內部也要轉為seq first nexth, output = func(input, hidden, weight) if batch_first and batch_sizes is None: output = output.transpose(0, 1) return output, nexth return forward

AutogradRNN中還是通過閉包的方式封裝了真正執行RNN計算的代碼。。。有一點需要注意,即使RNN的輸入數據是batch first,內部也會轉為seq_len first。

那我們就繼續看StackedRNN:

def StackedRNN(inners, num_layers, lstm=False, dropout=0, train=True): num_directions = len(inners) # 2 or 1 total_layers = num_layers * num_directions def forward(input, hidden, weight): assert(len(weight) == total_layers) next_hidden = [] if lstm: hidden = list(zip(*hidden)) for i in range(num_layers): all_output = [] for j, inner in enumerate(inners): l = i * num_directions + j hy, output = inner(input, hidden[l], weight[l]) # 調用Recurrent() next_hidden.append(hy) all_output.append(output) input = torch.cat(all_output, input.dim() - 1) if dropout != 0 and i < num_layers - 1: # 只有多層的rnn,才存在dropout, 對output input = F.dropout(input, p=dropout, training=train, inplace=False) if lstm: next_h, next_c = zip(*next_hidden) next_hidden = ( torch.cat(next_h, 0).view(total_layers, *next_h[0].size()), torch.cat(next_c, 0).view(total_layers, *next_c[0].size()) ) else: next_hidden = torch.cat(next_hidden, 0).view( total_layers, *next_hidden[0].size()) return next_hidden, input return forward

耶~終於找到真正執行前向計算的代碼了。

也就是下面這幾行:

for i in range(num_layers): all_output = [] for j, inner in enumerate(inners): l = i * num_directions + j hy, output = inner(input, hidden[l], weight[l]) # 調用Recurrent() next_hidden.append(hy) all_output.append(output) input = torch.cat(all_output, input.dim() - 1) if dropout != 0 and i < num_layers - 1: # 只有多層的rnn,才存在dropout, 對output input = F.dropout(input, p=dropout, training=train, inplace=False)

對於每一層,每個方向,調用Recurrent方法計算一次前向:

def Recurrent(inner, reverse=False): def forward(input, hidden, weight): output = [] steps = range(input.size(0) - 1, -1, -1) if reverse else range(input.size(0)) # steps=[seq_len-1, ...,1,0] or [0,1,...,seq_len-1] for i in steps: hidden = inner(input[i], hidden, *weight) # hack to handle LSTM output.append(hidden[0] if isinstance(hidden, tuple) else hidden) if reverse: output.reverse() output = torch.cat(output, 0).view(input.size(0), *output[0].size()) return hidden, output return forward

而真正執行每個時刻的隱狀態的計算如下:

def RNNReLUCell(input, hidden, w_ih, w_hh, b_ih=None, b_hh=None): hy = F.relu(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh)) return hydef RNNTanhCell(input, hidden, w_ih, w_hh, b_ih=None, b_hh=None): hy = F.tanh(F.linear(input, w_ih, b_ih) + F.linear(hidden, w_hh, b_hh)) return hy

ok,至此我們就弄清了RNN的前向計算過程~雖然還有一些特殊情況沒提到,咱們下回繼續。

插張流程圖

未完待續。。。


推薦閱讀:

DenseNet論文翻譯及pytorch實現解析(下)
一文搞定Pytorch+CNN講解
PyTorch發布一年團隊總結:運行資源降低至十分之一,單機王者
PyTorch在64位Windows下的Conda包
【譯】pytorch教程之載入數據與預處理

TAG:深度學習DeepLearning | RNN | PyTorch |