F-PC_CNN 論文解讀

08-12

F-PC_CNN 論文解讀

來自專欄自動駕駛與視覺感知1 人贊了文章

這篇文章主要是對來自ICRA 2018論文 "A General Pipeline for 3D Detection of Vehicles"進行解讀，從題目可以看到，論文提出了無人駕駛場景下車輛3D檢測的通用方法流程。

核心思路總結：首先在RGB圖像上進行2D Detection，得到車輛的2D bounding box，然後投影到點雲中相應位置區域，並利用RANSAC演算法生成3D box proposals。接著，使用車輛3D CAD模型對3D box proposals進行篩選（類似模板匹配的過程）。最後，通過一個two-stage 2D CNN進一步finetune 3D bounding box並分類。F-PC_CNN網路結構如下圖所示。

圖一 Network Architecture from F-PC_CNN

一、網路流程與F-PointNet相似。(好像名字也很像?)

出發點都是先從2D RGB圖像上找到待檢測物體的2D bounding box，然後從點雲中找到相應位置再到3D Detection，如此，不需要去遍歷所有點雲進行檢測，從而大大提高了檢測效率。F-PointNet的pipeline如圖二所示。

圖二 Pipeline from F-PointNet

二、二者的主要區別在於如何在點雲中找到相應待檢測物體位置上（精準定位）。

1. F-PC_CNN結合RANSAC和模板匹配這兩個傳統的CV演算法進行「定位」，利用預設定的hand-crafted features(CAD model)提純 3D proposals；而F-PointNet使用的是強大的PointNet去學習物體的空間幾何特徵，並進行3D Instance Segmentation。

F-PC_CNN強調的是general pipeline for vehicle detection，因此利用模板匹配的思想是簡單且可行的，文中利用的三種車輛模型如下圖所示。但是這種方法缺點也很明顯，一是局限性很大，只能檢測到這三類車，且對雜訊比較敏感；二是模板匹配比較耗時。

圖三 Vehicles 3D CAD Model from F-PC_CNN

相比之下，F-PointNet全程使用深度學習，檢測精度和運行效率也比F-PC_CNN好。（好像有點反客為主了）。

當然，F-PC_CNN將傳統與深度相結合，並且取得了與SOTA存在一定可比性的結果，這是彌足珍貴的。比較結果如下圖所示，但沒有跟F-PointNet進行對比。

圖四 Comparison with Other Methods from F-PC_CNN

三、Two-stage refinement CNN

通過model fitting挑選出3D proposals之後，將其輸入到2D CNN進行最終的3D bounding box回歸。但是3D proposals中的點沒法直接輸入到2D CNN中，因此在輸入之前對其進行了歸一化和體素化( normalization and voxelization)，得到一個24x54x32的矩陣，實際上主要就是將高度這一維度作為channel，這與之前的很多方法相同。

第一個階段的CNN是回歸最終的3D bounding box並進行分類，第二個階段的CNN是進一步分類。但是很奇怪在圖一中並沒有發現two-stage CNN，明明只有one-stage @-@。

最後，列出F-PC_CNN中「靈魂性」的句子並做出適當解釋：

1. In this paper, we propose a flexible 3D vehicle detection pipeline which can make use of any 2D detection network and provide accurate 3D detection results by fusing the 2D network with a 3D point cloud.

介紹了F-PC_CNN的功能。其中提到可以將任意的2D detection network接入到該pipeline中，個人覺得這個功能還是很常見的，很多基於BEV進行3D Detection的方法都是這麼做的。

2. The raw image is passed to a 2D detection network which provides 2D boxes around the vehicles in the image plane. Subsequently, a set of 3D points which fall into the 2D bounding box after projection is selected. With this set, a model fitting algorithm detects the 3D location and 3D bounding box of the vehicle. And then another CNN network, which takes the points that fit into the 3D bounding box as input, carries out the final 3D box regression and classification.

整個pipeline，很清晰。

3. However, point sets cannot be input to the CNN directly. We apply normalization and voxelization strategies in order to formalize the points in matrix form in order to fit to the CNN.

規範化和體素化的必要性。

4. The first stage CNN has two parallel outputs, one for 3D box regression and the other for classification, while the second stage CNN only has one output, classification.

蜜汁two-stage