Video-based Person ReID的時序建模

05-15

文章鏈接： [1805.02104] Revisiting Temporal Modeling for Video-based Person ReID

代碼鏈接：jiyanggao/Video-Person-ReID

ECCV ddl之後，離實習大概還有兩個月，於是就想著做點什麼別的。無意間發現最近很多人在做person reid，看了一篇Liang Zheng的關於person reid的綜述 [1]，感覺這個問題還挺有意思的，而且發展確實極快。於是就看了一些person reid的文章，在video-based person reid這個子問題上，總體來說，左右性能高低的主要有三個因素：image-level feature extractor (把圖片變成一個feature)； temporal modeling （把連續的feature序列合為一個feature）; loss function （比如 softmax cross-entropy, hinge loss, triplet loss。具體而言，video-based person reid 區分於image-based person reid就在於這個video，所以自然而然大家主要研究如何做 temporal modeling ( i.e. temporal feature aggregation）。在這還是先解釋一下，一般而言，一個video會被分割為很多個clip，每個clip包含了幾幀或者十幾幀的，這裡的temporal modeling 是說如何把一個clip里的這些幀合併為一個feature，即clip-level temporal modeling，而不是如何把video里所有的clip 合併 (video-level modeling)。

但我感覺過去的很多文章有一個共同的問題：在eval的時候，並沒有統一image-level feature extractor 和 loss function，有的用自己設計的3-layer CNN，有人用 vgg pretrained，有的用 softmax cross entropy 有的用hinge loss，這樣的後果就是 performance不直接可比，因為 feature和loss function 對最終性能的影響太大了，讓人看不出到底哪個temporal modeling是work的。另外，不少文章是直接把過去幾年video analysis的技術拿過來套在person reid這個問題問題上，算是灌水吧，個人感覺motivation不是很強。

於是，我就想能不能把過去的這些temporal modeling的方法提煉出來，在統一的框架下（固定feature extractor和loss function）測試一下，看看到底哪個好用。主要測試了四種方法：temporal pooling, temporal attention [2,3], RNN [5,6] 和3D conv [6]。前三種方法是配合image feature extractor的，3D conv本身就是feature extractor+temporal modeling，可以直接把video clip生成一個feature。 image feature extractor 用了ResNet-50， loss 用了 triplet + softmax cross entropy，數據集是MARS，其他具體的setting大家可以到文章里去看（[1805.02104] Revisiting Temporal Modeling for Video-based Person ReID）。這篇我投到BMVC了，時間上很合適，講真我這沒啥創新，本質上就是做了點實驗。

我的代碼（基於pytorch，第一次學著用，感覺很好用）在這 jiyanggao/Video-Person-ReID，是從 Kaiyang 那裡（KaiyangZhou/deep-person-reid） fork 出來的，Kaiyang這套代碼寫得很強，佩服佩服。我主要是修改了training 和 test sample 的策略，以及增加temporal modeling，對於 image feature extractor和loss function部分都用了原來的代碼。

細節不多說了，說說實驗結果和結論吧。要說明的是，雖然這些方法提煉自之前的文章，但是我的結果和結論僅能代表在我這種實驗條件下的表現，不能說明之前文章方法的好或不好。

temporal modeling 是有用的，在mAP和CMC-1上能帶來2-3個點的提升。
RNN表現比較差，甚至不如 single image。之前的工作里RNN之所以能work，我猜測是他們的 image feature extractor大都是 shallow的，自身的結構還不是optimal的，加上一層RNN相當於加深了網路深度，也許feature 也就更representative了。
temporal pooling的表現很強的，和attention 持平。
attention之所以沒有比pooling更好，我個人猜測是，一個clip其實很短，不過1/4 到1/2秒，圖片間的質量差距沒多大（也看不出什麼變化），不需要attention做一個weighted average；反倒是clip和clip之間差距可能很大，畢竟一個video可以很長，也許這裡才用得到attention。

References

[1] Zheng, Liang, Yi Yang, and Alexander G. Hauptmann. "Person re-identification: Past, present and future."arXiv preprint arXiv:1610.02984(2016).

[2] Zhen Zhou, Yan Huang, Wei Wang, Liang Wang, and Tieniu Tan. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification. In CVPR 2017

[3] Yu Liu, Junjie Yan, and Wanli Ouyang. Quality aware network for set to set recognition. In CVPR 2017

[4] Yichao Yan, Bingbing Ni, Zhichao Song, Chao Ma, Yan Yan, and Xiaokang Yang. Person re-identification via recurrent feature aggregation. In ECCV 2016

[5] Niall McLaughlin, Jesus Martinez del Rincon, and Paul Miller. Recurrent convolutional network for video-based person re-identification. In CVPR 2016

[6] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? arXiv preprint, arXiv:1711.09577, 2017