標籤：

CVPR 深度學習DeepLearning 計算機視覺

【論文筆記】Harmonious Attention Network for Person Re-Identification

04-29

論文地址Harmonious Attention Network for Person Re-... Accepted in CVPR 2018

Motivation：

1、現階段，行人重識別領域採用的深度學習方法多數時候採用現有的經典深度神經網路結構，現有的經典深度神經網路往往參數量很大（小數據集上訓練容易過擬合），無法達到好的效果。

2、採用粗略的region-level attention，而忽略了fine-grained pixel-level saliency。因此圖片中的人物無法對齊，容易受到雜亂背景的影響。

Method：

HACNN結構：

HACNN採用多分支網路結構，以便較少網路參數，最小化模型複雜性。

HA-CNN網路包含兩個branch，一個local branch，一個global branch。

為了減少參數量，第一個conv層global和local branch共享參數。同層Inceptions的所有的local streams共享參數。利用cross-entropy 分類損失函數優化global和local branches的行人分類。

使用3個Inception-A和3個Inception-B模塊構建global branch。

使用3個Inception-B模塊構建所有的local stream，每個local stream擁有相同的結構。

學習一組互補的attention maps，採用cross-attention interaction learning，增強和諧度和兼容性程度。

HA模塊

(a)Soft Attention包括(b)Spatial Attention和(c)Channel Attention。

(d) Hard Regional

Attention (part-wise).

HA模塊是hard regional attention [11]soft spatial

[34]和channel attention [9]的結合。

（I） Soft Spatial-Channel Attention:

HA模塊的輸入是三維張量 $X^{l}in R^{h imes w imes c}$ 。其中，h，w和c分別表示高度，寬度和通道中的像素數目; l表示該模塊在整個網路中的層級。 $S^{l}in R^{h imes w imes l}$ 和分別代表spatial 和 channel attention maps 。作者通過設計雙分支單元來執行注意力張量因子分解，其中一個分支構建spatial attention $S^{l}$ ，另一個分支構建channel attention $C^{l}$ 。這樣就可以用張量乘法計算 $C^{l}$ 和 $S^{l}$ 的full soft attention

$A^{l} = S^{l} + C^{l}$

（1）Spatial Attention

由四層網路構成：

1、 global

cross-channel averaging pooling layer；

$S_{input}^{l} = frac{1}{C} sum_{1}^{c}{X_{1:h,1:w,i}^{l}}$

2、 stride 2，filter 3 × 3 的conv層；

3、 resizing bilinear layer；

4、 filter 為1x1的conv層。

（2）channel attention

通過4層squeeze-and-excitation 子網路模擬channel attention

1、 global averaging pooling；

2、conv層

3、conv層

4、spatial attention 和 channel attention 相乘後添加1×1×c卷積。計算full soft attention

最後，我們使用sigmoid操作將full soft attention歸一化到0.5和1之間的範圍內。

（二）Hard Regional Attention

Hard attention 學習第i階段的每個輸入圖像中潛在的具有區分性的T個區域的坐標。作者通過學習一個變換矩陣來模擬這個regional attention：

其中 $s_{h}$ , $s_{w}$ 為scale factors ， $t_{x}$ , $t_{y}$ 為2-D spatial position ，為了降低模型複雜度，則固定住 $s_{h}$ , $s_{w}$ ，這樣便只有兩個變數 $t_{x}$ , $t_{y}$ ，輸出維度為2T。

利用兩層網路學習Hard attention

1、channel attention 的第一層輸出作為第一個FC層輸入；

2、第二層用tanh將區域位置參數轉換為百分比。

Hard regional attention將輸出的位置參數加到相應network block，生成T個不同part，然後將其加到local branch。

(三) Cross-Attention Interaction Learning

通過local features 和 global features 跨分支交流使得soft attention 與 hard attention 聯合學習更加和諧

$ar{X}_{L}^{(l,k)} = X_{L}^{(l,k)} + X_{G}^{(l,k)}$

其中 $X_{G}^{(l,k)}$ 是第 $left( l+1 ight)$ 階段的HA 模塊生成。

Experiments

Datasets：Market-1501，DuckMTMC-reID，CUHK03

不同attention重要性對比

Global feature和Local feature重要性對比

不同階段SA和HA可視化效果

和經典網路參數量對比

推薦閱讀：

TAG:計算機視覺 | 深度學習DeepLearning | CVPR |