用注意力機制進行句子蘊含推理

08-11

用注意力機制進行句子蘊含推理

來自專欄自然語言理解論文筆記

論文題目：REASONING ABOUT ENTAILMENT WITH NEURAL ATTENTION

https://arxiv.org/pdf/1509.06664v4.pdf?

arxiv.org

ICLR 2016一篇關於句子蘊含推理的文章。

先簡單介紹下句子蘊含推理，給出兩句話，判斷兩句話的關係：1. 相互矛盾；2. 不相關；3. 第一句蘊含第二句

Recognizing textual entailment (RTE) is the task of determining whether two natural language sentences are (i) contradicting each other, (ii) not related, or whether (iii) the first sentence (called premise) entails the second sentence (called hypothesis).

這是第一次有文章將注意力機制用到蘊含推理的任務上。這篇文章主要貢獻在於提出了word-by-word注意力機制，可以利用兩個句子詞與詞之間的關係進行蘊含推理，實驗在SNLI數據集上取得了state-of-art的效果。下圖是模型的結構示意圖，主體結構是Conditional-Encoding，然後在此基礎上實驗了兩種不同的Attention計算方法，下面展開講。

Conditional-Encoding

先對premise使用LSTM進行編碼，然後使用另一個LSTM對hypothesis進行編碼，第2個LSTM使用第一個LSTM的最終狀態進行初始化。

Attention

假設第一個LSTM的輸出為 $Y in mathbb{R}^{k imes{L}}$ ，其中k是LSTM的隱狀態的維度，L是premise的長度，第二個LSTM最後時刻的輸出為 $h_N$ ，然後計算 $h_N$ 在 $Y$ 上的Attention

$M = ext{tanh}(W^yY + W^hh_Notimes{e_L}) in mathbb{R}^{k imes{L}}$ ，其中的外積效果就是重複L次

$alpha = ext{softmax}( ext{w}^TM) in mathbb{R}^L$

$r = Yalpha^T in mathbb{R}^k$

$h^{*} = ext{tanh}(W^pr + W^xh_N) in mathbb{R}^k$

$h^{*}$ 即為最終的sentence-pair representation

然後利用sentence-pair representation使用softmax做3分類得到最終結果。

Word-By-Word Attention

上面只用了第二個LSTM的最後一個輸出 $h_N$ 來計算Attention，理論上LSTM有長距離編碼能力，但如果句子較長時很多細節往往很難捕捉，所以考慮在第二個LSTM每個輸出都計算一次Attention，而且依賴於上一個時刻的Attention結果，這樣就得到word-by-word Attention

$M_t = ext{tanh}(W^yY + (W^hh_t + W^rr_{t-1})otimes{e_L}) in mathbb{R}^{k imes{L}}$ ，其中的外積效果就是重複L次

$alpha_t = ext{softmax}( ext{w}^TM_t) in mathbb{R}^L$

$r_t = Yalpha_t^T + ext{tanh}(W^tr_{t-1}) in mathbb{R}^k$

$h^{*} = ext{tanh}(W^pr_N + W^xh_N) in mathbb{R}^k$

$h^{*}$ 即為word-by-word Attention計算得到的sentence-pair representation

Result and Discussion

從上表中可以看出，本文不使用Attention時的結果就比LSTM (Bowman et al., 2015)好，作者分析主要有如下2點原因

1. 先後處理premise和hypothesis可以對它們之間的依賴關係更好的建模；

We argue this is due to information being able to flow from the part of the model that processes the premise to the part that processes the hypothesis.
One interpretation is that the LSTM is approximating a finite-state automaton for RTE

2. 本文訓練的時候沒有fine-tuning詞向量，因此過擬合沒那麼嚴重。

Another difference to Bowman et al.』s model is that we are using word2vec instead of GloVe for word representations and, more importantly, do not fine-tune these word em- bed dings. The drop in accuracy from train to test set is less severe for our models, which suggest that fine-tuning word embeddings could be a cause of overfitting.

然後在此基礎上使用Attention和word-by-word Attention都得到了一定的提升。另外，作者還嘗試了使用雙向Attention，其實就是把兩句話交換位置再計算一次，但其實效果並不好，作者分析可能是因為entailment這種關係不具有對稱性。

最後作者對Attention做了可視化，直觀上說明了Attention的效果。比如下圖中：(a)rides對riding，camel對animal，推斷出entailment；(b)pink對blue，推斷出contradiction。

總體來講，這篇文章的主要貢獻在於提出了word-by-word Attention來計算sentence-pair representation，而且是端到端網路模型第一次在RTE的任務上拿到state-of-art。另外，這種計算sentence-pair representation的思路在後續的很多研究中都會被用到，特別是涉及到pair encoding的，比如機器閱讀理解裡面question和context的interaction的設計。其實我對RTE這個方向之前也沒看過文章，最近也是在看機器閱讀理解相關文章的時候看到有很多文章都引用了這篇，所以過來學習下。