Tensorflow源碼解讀（一）：AttentionnSeq2Seq模型

01-28

Tensorflow版本：r0.12
github源碼
ps: 0.12和1.0，1.1， 1.2版本的代碼基本一致，本文對更高版本也具有參考價值。

Seq2Seq模型是機器翻譯，對話生成等任務里經典的模型，attention機制也是在2016年刷爆了各種NLP任務，這兩者都是很值得深入研究掌握的模型。本文要分享的是Tensorflow官方例子，翻譯模型里的embedding_attention_seq2seq函數源碼解讀。文章參考了另一篇博客[1]和官方github源碼，attention部分的公式和推導涉及了源碼參考的論文[2]。

tf.nn.seq2seq文件共實現了5個seq2seq函數，因為本文重點講解最後一個，所以前4個簡要介紹一下。

basic_rnn_seq2seq：最簡單版本，輸入和輸出都是embedding的形式；最後一步的state vector作為decoder的initial state；encoder和decoder用相同的RNN cell，但不共享權值參數；
tied_rnn_seq2seq：同1，但是encoder和decoder共享權值參數
embedding_rnn_seq2seq：同1，但輸入和輸出改為id的形式，函數會在內部創建分別用於encoder和decoder的embedding matrix
embedding_tied_rnn_seq2seq：同2，但輸入和輸出改為id形式，函數會在內部創建分別用於encoder和decoder的embedding matrix
embedding_attention_seq2seq：同3，但多了attention機制

下面進入正題！

tf.nn.seq2seq.embedding_attention_seq2seq

# T代表time_steps, 時序長度ndef embedding_attention_seq2seq(encoder_inputs, # [T, batch_size] n decoder_inputs, # [T, batch_size]n cell,n num_encoder_symbols,n num_decoder_symbols,n embedding_size,n num_heads=1, # attention的抽頭數量n output_projection=None, #decoder的投影矩陣n feed_previous=False,n dtype=None,n scope=None,n initial_state_attention=False):n

參數

Input

encoder_inputs：encoder的輸入，int32型 id tensor list
decoder_inputs：decoder的輸入，int32型id tensor list
cell： RNN_Cell的實例
num_encoder_symbols, num_decoder_symbols：分別是編碼和解碼的符號數，即詞表大小
embedding_size：詞向量的維度
num_heads：attention的抽頭數量，一個抽頭算一種加權求和方式，後面會進一步介紹
output_projection：decoder的output向量投影到詞表空間時，用到的投影矩陣和偏置項(W, B)；W的shape是[output_size, num_decoder_symbols]，B的shape是[num_decoder_symbols]；若此參數存在且feed_previous=True，上一個decoder的輸出先乘W再加上B作為下一個decoder的輸入
feed_previous：若為True, 只有第一個decoder的輸入（「GO"符號）有用，所有的decoder輸入都依賴於上一步的輸出；一般在測試時用（當然源碼也提到，可以在訓練時用於模擬測試的環境，比如Scheduled Sampling）
initial_state_attention: 默認為False, 初始的attention是零；若為True，將從initial state和attention states開始attention

Output

(outputs, state) tuple pair，outputs是 2D Tensors list, 每個Tensor的shape是[batch_size, cell.state_size]；state是最後一個時間步，decoder cell的state，shape是[batch_size, cell.state_size]

Encoder

創建了一個embedding matrix.
計算encoder的output和state
生成attention states，用於計算attention

encoder_cell = rnn_cell.EmbeddingWrapper( n cell, embedding_classes=num_encoder_symbols,n embedding_size=embedding_size)n encoder_outputs, encoder_state = rnn.rnn(n encoder_cell, encoder_inputs, dtype=dtype) # [T，batch_size，size]nn top_states = [array_ops.reshape(e, [-1, 1, cell.output_size])n for e in encoder_outputs] # T * [batch_size, 1, size]n attention_states = array_ops.concat(1, top_states) # [batch_size,T,size]n

上面的EmbeddingWrapper, 是RNNCell的前面加一層embedding，作為encoder_cell, input就可以是word的id。

class EmbeddingWrapper(RNNCell):n def __init__(self, cell, embedding_classes, embedding_size, initializer=None):n def __call__(self, inputs, state, scope=None):n #生成embedding矩陣[embedding_classes,embedding_size]n #inputs: [batch_size, 1]n #return : (output, state)n

Decoder

生成decoder的cell，通過OutputProjectionWrapper類對輸入參數中的cell實例包裝實現

# Decoder.n output_size = Nonen if output_projection is None:n cell = rnn_cell.OutputProjectionWrapper(cell, num_decoder_symbols)n output_size = num_decoder_symbolsn if isinstance(feed_previous, bool):n return embedding_attention_decoder(n ...n )n

上面的OutputProjectionWrapper將輸出映射成想要的維度

class OutputProjectionWrapper(RNNCell):n def __init__(self, cell, output_size): # output_size:映射後的sizen def __call__(self, inputs, state, scope=None):n #init 返回一個帶output projection的 rnn_celln

接著對embedding_attention_decoder一探究竟：

def embedding_attention_decoder(decoder_inputs,n initial_state,n attention_states,n cell,n num_symbols,n embedding_size,n num_heads=1,n output_size=None,n output_projection=None,n feed_previous=False,n update_embedding_for_previous=True,n dtype=None,n scope=None,n initial_state_attention=False):n# 核心代碼n embedding = variable_scope.get_variable("embedding",n [num_symbols, embedding_size])n loop_function = _extract_argmax_and_embed(n embedding, output_projection,n update_embedding_for_previous) if feed_previous else Nonen emb_inp = [n embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs]n # T * [batch_size, embedding_size]n return attention_decoder(n emb_inp,n initial_state,n attention_states,n cell,n output_size=output_size,n num_heads=num_heads,n loop_function=loop_function,n initial_state_attention=initial_state_attention)n

簡要的說，embedding_attention_decoder的代碼，第一步創建了解碼用的embedding；第二步創建了一個循環函數loop_function，用於將上一步的輸出映射到詞表空間，輸出一個word embedding作為下一步的輸入；最後是我們最關注的attention_decoder部分完成解碼工作！

tf.nn.attention_decoder

論文涉及三個公式：

encoder輸出的隱層狀態 $(h_{1},...,h_{T_{A}})$ ，decoder的隱層狀態 $(d_{1},...,d_{T_{B}})$ 。 $v^{T}$ ， $W^{}_{1}$ ， $W^{}_{2}$ 是模型要學的參數。所謂的attention，就是在每個解碼的時間步，對encoder的隱層狀態進行加權求和，針對不同信息進行不同程度的注意力。那麼我們的重點就是求出不同隱層狀態對應的權重。源碼中的attention機制里是最常見的一種，可以分為三步走：（1）通過當前隱層狀態( $d_{t}$ )和關注的隱層狀態( $h_{i}$ )求出對應權重 $u^{t}_{i}$ ；（2）softmax歸一化為概率；（3）作為加權係數對不同隱層狀態求和，得到一個的信息向量 $d^{}_{t}$ 。後續的 $d^{}_{t}$ 使用會因為具體任務有所差別。

上面的 $a^{t}_{i}$ 含義是第t個時間步，對 $h_{i}$ 的加權係數。

下面上代碼的時刻！

def attention_decoder(decoder_inputs, #T * [batch_size, input_size]n initial_state, #[batch_size, cell.states]n attention_states,#[batch_size, attn_length , attn_size]n cell,n output_size=None,n num_heads=1,n loop_function=None,n dtype=None,n scope=None,n initial_state_attention=False):n

對於num_heads參數，還記得當初留的坑么：) 我們知道，attention就是對信息的加權求和，一個attention head對應了一種加權求和方式，這個參數定義了用多少個attention head去加權求和，所以公式三可以進一步表述為 $sum^{num_heads}_{j=1}sum^{T_{A}}_{i=1}a_{i,j}h_{i}$ 。

$W_{1}*h_{i}$ 用的是卷積的方式實現，返回的tensor的形狀是[batch_size, attn_length, 1, attention_vec_size]

# To calculate W1 * h_t we use a 1-by-1 convolutionn hidden = array_ops.reshape(n attention_states, [-1, attn_length, 1, attn_size])n hidden_features = []n v = []n attention_vec_size = attn_size # Size of query vectors for attention.n for a in xrange(num_heads):n k = variable_scope.get_variable("AttnW_%d" % a,n [1, 1, attn_size, attention_vec_size])n hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))n v.append(n variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))n

$W_{2}*d_{t}$ ，此項是通過下面的線性映射函數linear實現

for a in xrange(num_heads):n with variable_scope.variable_scope("Attention_%d" % a):n # query對應當前隱層狀態d_tn y = linear(query, attention_vec_size, True)n y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])n # 計算u_tn s = math_ops.reduce_sum(n v[a] * math_ops.tanh(hidden_features[a] + y), [2, 3])n a = nn_ops.softmax(s)n # 計算 attention-weighted vector d.n d = math_ops.reduce_sum(n array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden,n [1, 2])n ds.append(array_ops.reshape(d, [-1, attn_size]))n

到這裡，embedding_attention_seq2seq的核心代碼都已經解讀完畢了。在實際的運用，可以根據需求靈活使用各個函數，特別是attention_decoder函數。相信堅持閱讀下來的小夥伴們，能對這個API有更深刻的認識：)

參考文獻：

[1] tensorflow學習筆記（十一）：seq2seq Model相關介面介紹

[2] Grammar as a Foreign Language