Learning to Detect Human-Object Interaction 筆記

04-25

1、為獲得更高層次的圖像語義理解，更好的理解圖像，提出的包括600個HOI categories + 80個object categories的數據集HICO-DET（每種object category會有多種的interaction classes）

2、the core of the idea is Interaction Pattern characterize the spatial relationship between the two bounding boxes. (是不是會因此丟失一些其他層面的 relationship 信息呢？！看到後面就知道，加上了前兩路提取的特徵，信息就顯得充分多了) 知識點哈：該領域可用的數據集有V-COCO、 VG、 SVD 等

3、HO-RCNN = proposals of human-object pairs + HOI classification scores got by ConvNet

4、先對所有的pairs按照一定的規則進行篩選，用的方法類似於這篇文章提出的language priors

5、

Multi-stream architecture of HO-RCNN

（1）前兩路是利用的DNN提取的local information around humans and objects

第三路提取的是human-object spatial pairs的特徵，是一個binary classifier for each HOI class。（三路的FC接在一起，參數是不是就多了。。。理解錯了，只是element-wise sum而已，不是連接起來，FC的大小並不變）

（2）三路FC分別產生的都是confidence scores for each hoi class of interest (這裡的K是600？)。

6、

Interaction Pattern for the pairwise stream

Interaction Pattern 就是一個兩個channel的二值圖像, 用的是2D filters提取特徵

需要注意的兩點就是：(1)、bounding boxes pairs需要具有平移不變性

(2)、aspect ratio需要根據attention window變化

7、多標籤分類損失

因為對於一個Human和一個object來說，兩者之間代表relationship的interanction verb可能不止一種，所以，對每一個HOI category 應用一個sigmoid cross entropy loss，再把它們加起來作為total loss

8、Experiments

（1）、600 個HOI類分為138個 rare和462個Non-rare

（2）、分別在只包含target object的圖像中和全部的test images中做evaluation

（3）、用Fast RCNN訓練的object detector產生的human-object proposals（為什麼要從一張圖像中挑選出top 10的human 和top 10的object產生100個human-object proposals呢？感覺喲有點多此一舉）

（4）、pairwise stream用了兩種結構，一種是FCN，一種是Conv。不過應該是Conv得到的效果更好一些。

Two structure of pairwise stream