Attention的梳理、隨想與嘗試

07-18

Attention的梳理、隨想與嘗試

來自專欄數據革命27 人贊了文章

（一）深度學習中的直覺

3 X 1 and 1 X 3 代替 3 X 3

LSTM中的門設計

Attention機制的本質來自於人類視覺注意力機制。人們視覺在感知東西的時候一般不會是一個場景從到頭看到尾每次全部都看，而往往是根據需求觀察注意特定的一部分。而且當人們發現一個場景經常在某部分出現自己想觀察的東西時，人們會進行學習在將來再出現類似場景時把注意力放到該部分上：

將更多的注意力聚焦到有用的部分，Attention的本質就是加權。但值得注意的是，同一張圖片，人在做不同任務的時候，注意力的權重分布應該是不同的。

基於以上的直覺，Attention可以用於：

學習權重分布：

這個加權可以是保留所有分量均做加權（即soft attention）；也可以是在分布中以某種採樣策略選取部分分量（即hard attention），此時常用RL來做；
這個加權可以作用在原圖上，也可以作用在特徵圖上；
這個加權可以在時間維度、空間維度、mapping維度以及feature維度。

2. 任務聚焦、解耦（通過attention mask）

多任務模型，可以通過Attention對feature進行權重再分配，聚焦各自關鍵特徵。

（二）發展歷程

Attention機制最早是在視覺圖像領域提出來的，應該是在九幾年思想就提出來了，但是真正火起來應該算是2014年google mind團隊的這篇論文《Recurrent Models of Visual Attention》，他們在RNN模型上使用了attention機制來進行圖像分類。隨後，Bahdanau等人在論文《Neural Machine Translation by Jointly Learning to Align and Translate》中，使用類似attention的機制在機器翻譯任務上將翻譯和對齊同時進行，他們的工作算是第一個將attention機制應用到NLP領域中。接著attention機制被廣泛應用在基於RNN/CNN等神經網路模型的各種NLP任務中。2017年，google機器翻譯團隊發表的《Attention is all you need》中大量使用了自注意力（self-attention）機制來學習文本表示。自注意力機制也成為了大家近期的研究熱點，並在各種NLP任務上進行探索。下圖展示了attention研究進展的大概趨勢：

（三）Attention設計

3.1 定義

Attention(Q,K,V)=softmax(frac{QK^T}{sqrt{d_k}})V

Google 2017年論文Attention is All you need中，為Attention做了一個抽象定義：

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

注意力是將一個查詢和鍵值對映射到輸出的方法，Q、K、V均為向量，輸出通過對V進行加權求和得到，權重就是Q、K相似度。

計算Attention Weighted Value有三個步驟：

計算Q、K相似度得分
得分歸一化(Attention Weight)
根據得分對V進行加權

3.2 分類

3.2.1 按輸出分類

Soft attention
Hard attention

soft attention輸出注意力分布的概率值，hard attention 輸出onehot向量。

3.2.2 按關注的範圍分類

Effective Approaches to Attention-based Neural Machine Translation

Globle attention

全局注意力顧名思義對整個feature mapping進行注意力加權。

Local attention

局部注意力有兩種，第一種首先通過一個hard-globle-attention鎖定位置，在位置上下某個local窗口進行注意力加權。

第二種是在某中業務場景下，比如對於一個問題"Where is the football?", "where"和"football』"在句子中起著總結性的作用。而這種attention只和句子中每個詞自身相關。Location-based的意思就是，這裡的attention沒有其他額外所關注的對象，即attention的向量就是q本身，即Q=K，其attention score為：

?$score(Q,K)=activation(W^TQ+b)?$

3.2.3 按計算score的函數不同

（四）業務應用

chatbot意圖分類

採用：Self-attention + Dot-product-score

?效果：

觀察到：

attention自動mask了<PAD>字元；
對於分類作用更大的關鍵詞，給予了更高的attention weight；

（四）思考

多步負荷預測

多任務多輸出模型，每步預測對於特徵的關注點應該不一樣，學習一個feature mapping 的mask attention。

異常數據mask負荷預測

在原始feature mapping 後接一個attention，自動mask 異常輸入，提升模型的魯棒性。

（六）Reference

Paper

Hierarchical Attention Networks for Document Classification
Attention Is All You Need
Neural Machine Translation by Jointly Learning to Align and Translate
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Fully Convolutional Network with Task Partitioning for Inshore Ship Detection in Optical Remote Sensing Images
Effective Approaches to Attention-based Neural Machine Translation

github

pytorch-attention
seq2seq
PyTorch-Batch-Attention-Seq2seq

Blog

一文讀懂「Attention is All You Need」| 附代碼實現
Attention Model（mechanism）的套路
【計算機視覺】深入理解Attention機制
自然語言處理中的自注意力機制
Encoder-Decoder模型和Attention模型