《Detecting and Recognition Human-Object Interactions》論文筆記

05-01

Introduction

1、Human-Object Interactions可以用一個triplets : <human, verb, object> 來表示。論文中提出的模型InteractNet可以根據人的appearance（通過從檢測到的人體的box中提取特徵得到的信息表示，提取的可能是人體的動作姿態等信息）估計一個 action-type specific density（動作類型的比重）用來推斷目標物體可能出現的位置。

2、此處略微提一下研究的意義，human-centric understanding 可以和實驗室服務機器人的背景結合起來。。。仔細想一想

3、因為論文是FAIR的幾位大神做的，所以還是在基本框架Fast-RCNN上加的。具體來說，在人的 RoI 處進行了action classification動作分類和density estimation密度估計。density estimator 估計了一個4-d的高斯分布，對每一個動作類型，模型可以給出和人體相關的目標物體的relative position。

4、在V-COCO、HICO-DET數據集上進行的model evaluation，evaluate by Average Precision on a triplet, called role AP得到了較大的提升。

Method

5、Model Architecture 如下圖所示

（1）Human-centric branch利用的Fast-rcnn提取的特徵，在目標物體的位置處對每一種動作進行動作分類和概率密度的估計

（2）給定一些候選框，除了Fast-rcnn給出的一系列物體框和每一個框的標籤之外，還加上了一個triplet score函數。

$S_{h,o}^{a} =s_{h} cdot s_{o} cdot s_{h}^{a} cdot g_{h,o}^{a}$

函數中包含的四部分分別代表：human score、object score、把action 分配給 human的score、object是真正與人交互的物體的可能性，其中 $g_{h,o}^{a}$ 是通過 $mu_{h}^{a}$ （給定human box和 action的情況下計算得到的target object 的4-d mean locations）來計算得到的。

Model Components

主要包含四部分：Object Detection、Action Classification、Target Localization、Interaction Recognition

（1）Action Classification 中一個human可能和多個action有關，所以這又是一個多標籤分類，用的是binary sigmoid classifiers

（2）Target Localization 中直接利用human appearance 進行target object的location很困難，作者改為在可能位置上預測一個概率密度。然後利用這個概率密度（是一個高斯分布，均值是根據人的appearance 和action 來預測的）和真實物體的實際位置來精確的預測目標物體的位置。

（3）

預測一個4-d高斯密度，均值 $mu_{h}^{a}$ 代表的是一個對action a 的目標物體的4-d 偏移量。

Multi-task Training 中 overall loss包括所有loss的和，有 (1) 目標檢測部分的classification and regression loss (2) action classification and target localization loss for the human-centric branch (3) the action classification loss of the

interaction branch. 其中All loss terms have a weight of one, except the action classification term in the human-centric branch has a weight of two.