標籤:

Mask Rcnn 論文翻譯(1)

論文地址:arxiv.org/pdf/1703.0687

AbstractWe present a conceptually simple, flexible, and generalframework for object instance segmentation. Our approachefficiently detects objects in an image while simultaneouslygenerating a high-quality segmentation mask for each instance.The method, called Mask R-CNN, extends FasterR-CNN by adding a branch for predicting an object mask inparallel with the existing branch for bounding box recognition.Mask R-CNN is simple to train and adds only a smalloverhead to Faster R-CNN, running at 5 fps. Moreover,Mask R-CNN is easy to generalize to other tasks, e.g., allowingus to estimate human poses in the same framework.We show top results in all three tracks of the COCO suite ofchallenges, including instance segmentation, bounding-boxobject detection, and person keypoint detection. Withouttricks, Mask R-CNN outperforms all existing, single-modelentries on every task, including the COCO 2016 challengewinners. We hope our simple and effective approach willserve as a solid baseline and help ease future research ininstance-level recognition. Code will be made available.

摘要

我們提出了一個概念上簡單,靈活和通用的對象實例分割框架。我們的方法有效地檢測圖像中的物體同時為每個實例生成高質量的分段掩碼。這種稱為Mask R-CNN的方法通過添加一個分支來擴展faster R-CNN用於與現有分支並行預測對象掩碼以進行邊界框識別。Mask R-CNN訓練簡單,只增加了一小部分開銷,與faster R-CNN相比,以5fps運行。 此外,Mask R-CNN很容易推廣到其他任務,例如,允許我們在相同的框架中估計人的姿勢。我們在COCO全套挑戰的所有三條軌道中展示最佳成績, 包括實例分割,邊界框對象檢測, 和人員關鍵點檢測。沒有技巧,Mask R-CNN勝過所有現有的,包括COCO 2016挑戰賽獲勝者在內的每一項任務的單模型參賽作品。我們希望我們簡單有效的方法將成為堅實的基準並有助於減輕未來在實例級別識別中的研究。 代碼將在未來公布。

1. IntroductionThe vision community has rapidly improved object detectionand semantic segmentation results over a short periodof time. In large part, these advances have been drivenby powerful baseline systems, such as the Fast/Faster RCNN[12, 34] and Fully Convolutional Network (FCN) [29]frameworks for object detection and semantic segmentation,respectively. These methods are conceptually intuitiveand offer flexibility and robustness, together with fast trainingand inference time. Our goal in this work is to develop acomparably enabling framework for instance segmentation.Instance segmentation is challenging because it requiresthe correct detection of all objects in an image while alsoprecisely segmenting each instance. It therefore combineselements from the classical computer vision tasks of objectdetection, where the goal is to classify individual objectsand localize each using a bounding box, and semanticsegmentation, where the goal is to classify each pixel intoa fixed set of categories without differentiating object instances.1 Given this, one might expect a complex methodis required to achieve good results. However, we show thata surprisingly simple, flexible, and fast system can surpassprior state-of-the-art instance segmentation results.Our method, called Mask R-CNN, extends Faster R-CNN[34] by adding a branch for predicting segmentation maskson each Region of Interest (RoI), in parallel with the existingbranch for classification and bounding box regression(Figure 1). The mask branch is a small FCN appliedto each RoI, predicting a segmentation mask in a pixel-topixelmanner. Mask R-CNN is simple to implement andtrain given the Faster R-CNN framework, which facilitatesa wide range of flexible architecture designs. Additionally,the mask branch only adds a small computational overhead,enabling a fast system and rapid experimentation.In principle Mask R-CNN is an intuitive extension ofFaster R-CNN, yet constructing the mask branch properlyis critical for good results. Most importantly, Faster RCNNwas not designed for pixel-to-pixel alignment betweennetwork inputs and outputs. This is most evident inhow RoIPool [18, 12], the de facto core operation for attendingto instances, performs coarse spatial quantizationfor feature extraction. To fix the misalignment, we proposea simple, quantization-free layer, called RoIAlign, thatfaithfully preserves exact spatial locations. Despite beinga seemingly minor change, RoIAlign has a large impact: itimproves mask accuracy by relative 10% to 50%, showingbigger gains under stricter localization metrics. Second, wefound it essential to decouple mask and class prediction: wepredict a binary mask for each class independently, withoutcompetition among classes, and rely on the network』s RoIclassification branch to predict the category. In contrast,FCNs usually perform per-pixel multi-class categorization,which couples segmentation and classification, and basedon our experiments works poorly for instance segmentation.Without bells and whistles, Mask R-CNN surpasses allprevious state-of-the-art single-model results on the COCOinstance segmentation task [28], including the heavilyengineeredentries from the 2016 competition winner. Asa by-product, our method also excels on the COCO objectdetection task. In ablation experiments, we evaluate multiplebasic instantiations, which allows us to demonstrate itsrobustness and analyze the effects of core factors.Our models can run at about 200ms per frame on a GPU,and training on COCO takes one to two days on a single8-GPU machine. We believe the fast train and test speeds,together with the framework』s flexibility and accuracy, willbenefit and ease future research on instance segmentation.Finally, we showcase the generality of our frameworkvia the task of human pose estimation on the COCO keypointdataset [28]. By viewing each keypoint as a one-hotbinary mask, with minimal modification Mask R-CNN canbe applied to detect instance-specific poses. Without tricks,Mask R-CNN surpasses the winner of the 2016 COCO keypointcompetition, and at the same time runs at 5 fps. MaskR-CNN, therefore, can be seen more broadly as a flexibleframework for instance-level recognition and can be readilyextended to more complex tasks.We will release code to facilitate future research.

1.介紹

視覺社區在短時間內迅速改進了對象檢測和語義分割結果。在很大程度上,這些進步是由強大的基準系統驅動的,如Fast/Faster RCNN [12,34]和全卷積網路(FCN)[29]的對象檢測框架

和語義分割。這些方法在概念上是直觀的並且提供了靈活性和健壯性,以及快速的訓練和推理時間。我們在這項工作中的目標是開發一個相對有利的框架, 例如分割。

實例分段是具有挑戰性的,因為它要求正確檢測圖像中的所有對象,同時精確地分割每個實例。 因此它結合了來自傳統計算機視覺任務的目標檢測,目標是分類各個對象並使用邊界框對每個對象進行本地化,和語義分割,其目標是將每個像素分類為固定的一組類別,而不區分對象實例。 鑒於此,人們可能會期望獲得良好結果需要複雜的方法。然而,我們表明,一個令人驚訝的簡單,靈活,快速的系統可以超越先前的最先進的實例分割結果。

我們的方法稱為Mask R-CNN,通過在每個感興趣區域(RoI)上添加一個分支來預測分割掩模,並與現有分支並行進行分類和邊界框回歸(圖1),從而擴展了Faster R-CNN[34] 。

掩碼分支是應用於每個RoI的小FCN,以像素 - 像素的方式預測分割掩模。基於Faster R-CNN框架,Mask R-CNN的實施和訓練非常簡單,這有助於實現各種靈活的架構設計。另外,掩碼分支只會增加一個小的計算開銷,實現快速系統和快速實驗。

原則上,Mask R-CNN是Faster R-CNN的直觀擴展,但正確構建掩模分支對於獲得良好結果至關重要。最重要的是,Faster RCNN並非針對網路輸入和輸出之間的像素對像素對齊而設計的。RoIPool [18,12]是參與事實上的核心操作,這一點最為明顯到實例,為特徵提取執行粗略的空間量化。為了解決這個錯位問題,我們提出了一個簡單的,無量化的圖層,稱為RoIAlign,忠實地保留了確切的空間位置。儘管是一個看似微小的變化, RoIAlign具有很大的影響力:它將掩模精確度提高了10%到50%,在更嚴格的本地化指標下顯示出更大的收益。 其次,我們發現解耦模板和類別預測非常重要:我們獨立預測每個類的二進位掩碼,沒有階級之間的競爭,並依靠網路的RoI分類分支來預測類別。 相反,FCN通常執行每像素多類別分類,其中夫婦分割和分類,並基於我們的實驗效果很差,例如分割。

沒有花里胡哨的工作,Mask R-CNN在COCO實例分割任務上超過了所有先前的最新單一模型結果[28],包括來自2016年比賽冠軍的大量參賽作品。作為副產品,我們的方法也擅長COCO物體檢測任務。 在消融實驗中,我們評估多個基本實例,這使我們能夠展示其穩健性並分析核心因素的影響。

我們的模型可以在GPU上以每幀200毫秒的速度運行,而COCO培訓在一台8 GPU計算機上需要一到兩天的時間。 我們相信快速訓練和測試速度,以及框架的靈活性和準確性,將有益於並且便於未來對實例分割的研究。

最後,我們通過COCO keypointdataset [28]上的人體姿態估計任務展示了我們框架的一般性。通過將每個關鍵點視為一個熱門的二進位掩碼,只需最少的修改即可應用Mask R-CNN來檢測特定於實例的姿勢。沒有任何技巧,Mask R-CNN超越2016年COCO關鍵的贏家競爭,同時運行速度為5 fps。 因此,Mask R-CNN可以更廣泛地視為實例級別識別的靈活框架,並且可以輕鬆擴展到更複雜的任務。

我們將發布代碼以促進未來的研究。

2. Related WorkR-CNN: The Region-based CNN (R-CNN) approach [13]to bounding-box object detection is to attend to a manageablenumber of candidate object regions [38, 20] and evaluateconvolutional networks [25, 24] independently on eachRoI. R-CNN was extended [18, 12] to allow attending toRoIs on feature maps using RoIPool, leading to fast speedand better accuracy. Faster R-CNN [34] advanced thisstream by learning the attention mechanism with a RegionProposal Network (RPN). Faster R-CNN is flexible and robustto many follow-up improvements (e.g., [35, 27, 21]),and is the current leading framework in several benchmarks.Instance Segmentation: Driven by the effectiveness of RCNN,many approaches to instance segmentation are basedon segment proposals. Earlier methods [13, 15, 16, 9] resortedto bottom-up segments [38, 2]. DeepMask [32] andfollowing works [33, 8] learn to propose segment candidates,which are then classified by Fast R-CNN. In thesemethods, segmentation precedes recognition, which is slowand less accurate. Likewise, Dai et al. [10] proposed a complexmultiple-stage cascade that predicts segment proposalsfrom bounding-box proposals, followed by classification.Instead, our method is based on parallel prediction of masksand class labels, which is simpler and more flexible.Most recently, Li et al. [26] combined the segment proposalsystem in [8] and object detection system in [11] for「fully convolutional instance segmentation」 (FCIS). Thecommon idea in [8, 11, 26] is to predict a set of positionsensitiveoutput channels fully convolutionally. Thesechannels simultaneously address object classes, boxes, andmasks, making the system fast. But FCIS exhibits systematicerrors on overlapping instances and creates spuriousedges (Figure 5), showing that it is challenged by the fundamentaldifficulties of segmenting instances.

2.相關工作

R-CNN:基於區域的CNN(R-CNN)方法[13]對邊界框對象進行檢測是為了關注可管理數量的候選目標區域[38,20],並在每個ROI上獨立評估卷積網路[25,24] 。R-CNN得到了擴展[18,12],允許使用RoIPool在功能地圖上參與RoI,從而實現更快的速度和更高的準確性。

Faster R-CNN [34]通過學習注意機制來推進這一流程與地區提案網路(RPN)。Faster R-CNN對許多後續改進具有靈活性和可靠性(例如[35,27,21]),並且是幾個基準測試中的當前領先框架。

實例分段:在RCNN的有效性的推動下,許多實例細分方法都基於細分提案。 早期的方法[13,15,16,9]採用了自下而上的方法[38,2]。 DeepMask [32]和下面的作品[33,8]學會提出分段候選人,然後由Fast R-CNN進行分類。在這些方法中,分割先於識別,這是緩慢和不太準確的。同樣,Dai et al。 [10]提出了一個複雜的多階段級聯,從預先框中預測細分提案,然後進行分類。相反,我們的方法基於掩模和類標籤的並行預測,這更簡單,更靈活。

最近,李等人。 [26]合併了[8]中的細分提案系統,和[11]中的「完全卷積實例分割」(FCIS)中的對象檢測系統。[8,11,26]中的共同想法是預測一組完全卷積的位置敏感輸出通道。這些通道同時處理對象類,框和掩碼,使系統更快。但是FCIS在重疊的實例上表現出系統性錯誤併產生虛假邊緣(圖5),表明它受到分割實例的根本困難的挑戰。

歡迎關注公眾號:huangxiaobai880

https://www.zhihu.com/video/952505051516776448
推薦閱讀:

《古文觀止-公子重耳對秦客》的譯文和注釋是什麼?
別人2年,我2月!想要快速學會做翻譯,你一定要懂這些!
人工智慧加持移動翻譯,傳統的翻譯員即將大量失業?
翻譯質檢書:葉渭渠譯《雪國》(155)

TAG:翻譯 |