【2D Single-person Pose Estimatiom】

04-21

本文內容來自於【極市】張鋒-2D單人人體姿態估計及其應用，視頻地址：【極市】張鋒-2D單人人體姿態估計及其應用_嗶哩嗶哩 (゜-゜)つロ乾杯~-bilibili，可以1.5倍速快速過一下。

Mark 了好久沒看的視頻，今天做完一門課的pre終於看了，來寫一下筆記。

這個是2018年1月的一個關於2D單人姿態估計的科普直播，大概講了一下相關應用、目前面臨的問題以及主流方法並且最後進行了總結和答疑。因為時間比較有限，所以主講人 @張曉挑了一些關鍵、重要的paper和方法跟大家進行講解，因為姿態檢測這塊這幾年做的人越來越多，paper量也逐年上升，大家可以大致都看一下用的方法，取精華棄糟粕吧。

再感謝 @張曉，我自己很早也關注了他的知乎，本人還創建了關於pose的交流社區和討論群，這種開放、互相交流的心態和想法都很值得我學習~~

我自己一直想寫一下最近關於pose的 paper總結，single-person以及multi-person的方法對比之類的，希望早日把坑填上。嗯！

好了，上正菜。

一、相關應用

姿態檢測作為一個基礎的研究方向，為很多研究方向提供的基本的結果，比如用於視頻監控等的Person RE-ID and Action Recognition；通過2D pose去做3D pose 以及 pose tracking；關於抖音的尬舞機和QQ高能舞室（我都沒有玩過T T）用到的人機交互，應該就是直接使用手機或者電腦的Single RGB Camera去完成的，還有大家熟知的AR/VR Games 和電影製作的 character animation（這部分大多是3D pose的應用）。

二、目前面臨的問題

1、遮擋：分為自遮擋和被其他物體或者其他人的遮擋。
主流解決方法：使用大卷積核提升感受野。

2、複雜背景
主流解決方法：採用XX方式讓網路定位回歸到所需定位的人的位置。例如：CPM[1]是通過給定center map ; Hourglass[2]通過變換矩陣（詳細可看其代碼）把人放在圖片中心。

3、光照問題
主流解決方法：進行預處理，加入光照因素，對每個通道進行偏移處理。可以參考Hourglass[2]

4、各種各樣的姿態
主流解決方法：考慮網路的深度等。
5、人的尺度不一（主要體現：人在圖片中的大小不一）、拍攝角度不同等
主流解決方法：用多尺度特徵。

三、主流方法介紹

1、傳統方法：基於Picturial structures, DPM 等。
2、基於深度學習的方法：

2.1直接回歸坐標（Deep Pose[3]）

Motivation: CNN分類效果好，嘗試直接使用CNN 回歸關節坐標。

由於當時的主要研究的方法是Part-based models，這類的方法efficient，但表達能力有限，只使用了局部的特徵。全局的方法也有被提出，但在實際問題中，取得的效果並不夠好。

本文把姿態估計看成一個關節回歸的問題，並第一個使用DNN（AlexNet based）來實現，以整個圖像為輸入，用一個7層的卷積神經網路。這樣做有兩個優點，一、使用了全局信息；二、這種方法比人工設計特徵簡單。本文還使用了串聯網路的方法來提高精度。

2.1.1 CNN多階段回歸模型 ~ eg. Combining local appearance and holistic view[4]

Motivation: 能不能給網路添加一些先驗知識？

提出了雙源CNN。

2.1.2 CNN多階段反饋回歸模型 ~ eg. Human Pose Estimation with Iterative Error Feedback [5]

Motivation: 能否讓網路學習到一個多階段反饋的模型？

本篇文章主要在用feedback來進行錯誤反饋，feedback將之前直接預測關節點位置的方法變成了預測偏差的方法，主要用在早期的錯誤修正上。作者的feedback方法的參考地方Recurrent models of visual attention。

2.2 通過熱力圖回歸坐標（CPM, Hourglass）

2.2.1 CNN+圖模型（pairwise relation,tree structure relation）

Motivation: 由於人的尺度是各異的，能不能通過網路克服這一問題，並且學習到關節間的關係？( pair wise relation)？

因此2015年，Jonathan Tompson, Arjun Jain, Yann LeCun, Christoph Bregler 等人提出了CNN+圖模型[6]，網路結構是金字塔。

Motivation:之前的關係建模是pair wise relation,那麼能不能對整個人所有關節所形成的樹狀結構進行建模？

在2015年，xiaogang組提出了CNN+樹狀結構圖模型[7]。

2.2.2 CNN多階段回歸模型

Motivation: 由於圖模型的計算效率太低，我們能不能拋棄圖模型，使用多階段的回歸方式提升精確度。

2016年，Shih-En Wei, Varun Ramakrishna, Takeo Kanade, Yaser Sheikh等人提出卷積姿態機CPM[8]，採用大的卷積核去提升感受野，多階段回歸。

Motivation: 使用大卷積核還是太耗費計算資源能不能提出一種新的架構降低計算量的同時提升感受野？

2016年Alejandro Newell, Kaiyu Yang, Jia Deng提出堆疊的沙漏模型Stacked hourglass networks [9]，極大提升感受野，降低計算量，多階段回歸。

Motivation: 既然直接使用圖模型太慢，那麼我們試試直接使用卷積核來實現這樣的願望？

2016年xiaogang組提出樹狀結構的特徵學習Structured feature learning for pose

estimation [9]。

Motivation: 既然直接使用圖模型太慢，為何不一次回歸關節讓網路去學習一個隱式的關節之間的依賴關係？

2016年Georgia Gkioxari, Alexander Toshev, Navdeep Jaitly提出鏈式關節預測模型[11]。

Motivation: 現有的模型都關注performance，那麼能不能提升下efficiency

2016年，U. Rafi, I.Kostrikov, J. Gall, and B. Leibe等人提出An Efficient Convolutional Network for Human Pose Estimation[12]

由於深度學習讓姿態檢測在performance上有了很大的提升，但這會造成一個巨大的計算量，在多個數據集上進行訓練，進行額外的後處理以及很少提供使用設計選擇的細節，這不僅讓對比不同方法變得很難，同樣也讓復現已有的結果很難。這篇文章設計了一個有效且低計算量的神經網路，網路只在同一個數據集上進行訓練而沒有使用pre-train，達到了較好的效果。

2.2.3 檢測模型 + 回歸模型

Motivation:通過將檢測網路的結果提供的信息給關節回歸網路[13]。

Our main contribution is a CNN cascaded architecture specifically designed for learning part relationships and spatial context, and robustly inferring pose even for the case of severe part occlusions.To this end, we propose a detection-followed-by-regression CNN cascade.The first part of our cascade outputs part detection heatmaps the second part performs regression on these heatmaps. The benefits

of the proposed architecture are multi-fold: It guides the network where to focus in the image and effectively encodes part constraints and context. More importantly, it can effectively cope with occlusions because part detection heatmaps for occluded parts provide low confidence scores which subsequently guide the regression part of our network to rely on contextual information in order to predict the location of these parts. Additionally, we show that the proposed cascade is flexible enough to readily allow the integration of various CNN architectures for both detection and regression, including recent ones based on residual learning.

Motivation: 提供多尺度特徵來提升精確度。

2017年提出特徵金字塔網路[14]。

四、總結

? 採用多尺度，多解析度的網路結構

? 採用基於Residual Block來構建網路（hourglass也是如此）

? 擴大感受野（large kernel, dilation convolution, hourglass module）

? 預處理很重要（將人放在輸入圖片的中心，人的尺度盡量歸一化到統一尺度，對圖片進行翻轉，對圖片進行旋轉）

? 後處理同樣重要

Reference:

[1]S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. CoRR, abs/1602.00134, 2016.

[2]A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. CoRR, abs/1603.06937, 2016.

[3]Alexander Toshev and Christian Szegedy. 2014. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 14). IEEE Computer Society, Washington, DC, USA, 1653-1660. DOI: http://dx.doi.org/10.1109/CVPR.2014.214

[4]Fan X, Zheng K, Lin Y, et al. Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 1347-1355.

[5]Carreira J, Agrawal P, Fragkiadaki K, et al. Human pose estimation with iterative error feedback[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4733-4742.

[6]Tompson J J, Jain A, LeCun Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation[C]//Advances in neural information processing systems. 2014: 1799-1807.

[7]Yang W, Ouyang W, Li H, et al. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 3073-3082.

[8]Wei S E, Ramakrishna V, Kanade T, et al. Convolutional pose machines[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4724-4732.

[9]Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 483-499.

[10]Chu X, Ouyang W, Li H, et al. Structured feature learning for pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 4715-4723.

[11]Gkioxari G, Toshev A, Jaitly N. Chained predictions using convolutional neural networks[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 728-743.

[12]Rafi U, Leibe B, Gall J, et al. An Efficient Convolutional Network for Human Pose Estimation[C]//BMVC. 2016, 1: 2

[13]Bulat A, Tzimiropoulos G. Human pose estimation via convolutional part heatmap regression[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 717-732.

[14]Yang W, Li S, Ouyang W, et al. Learning feature pyramids for human pose estimation[C]//The IEEE International Conference on Computer Vision (ICCV). 2017, 2.