目標檢測評價標準mAP

09-03

目標檢測評價標準mAP

來自專欄 The eyes of computer

1、Recall&Precision

mAP全稱是mean Average Precision，這裡的Average Precision，是在不同recall下計算得到的，所以要知道什麼是mAP，要先了解recall（召回率）和precision（精確率）。

recall和precision是二分類問題中常用的評價指標，通常以關注的類為正類，其他類為負類，分類器的結果在測試數據上有4種情況：

實際---------------------------------------------------- | 1 | 0 ----------------------------------------------------預 | 1 | TP（True Positive）| FP（False Positive） ------------------------------------------------測 | 0 | FN（False Negative）| TN（True Negative）----------------------------------------------------

計算公式分別為：

$P=frac{TP}{TP+FP}，（在預測為正樣本種實際為正樣本的概率）$

$R=frac{TP}{TP+FN}，（在實際為正樣本中預測為正樣本的概率）$

$accuracy=frac{TP+TN}{TP+TN+FP+FN}，（通常用到的準確率的計算公式）$

用一個具體的例子說明：

假設我們在數據集上訓練了一個識別貓咪的模型，測試集包含100個樣本，其中貓咪60張，另外40張為小狗。測試結果顯示為貓咪的一共有52張圖片，其中確實為貓咪的共50張，也就是有10張貓咪沒有被模型檢測出來，而且在檢測結果中有2張為誤檢。因為貓咪更可愛，我們更關注貓咪的檢測情況，所以這裡將貓咪認為是正類：

所以TP=50，TN=38，FN=10，FP=2，P=50/52，R=50/60，acc=(50+38)/(50+38+10+2)

為什麼要引入recall和precision？

recall和precision是模型性能兩個不同維度的度量：

在圖像分類任務中，雖然很多時候考察的是accuracy，比如ImageNet的評價標準。但具體到單個類別，如果recall比較高，但precision較低，比如大部分的汽車都被識別出來了，但把很多卡車也誤識別為了汽車，這時候對應一個原因。如果recall較低，precision較高，比如檢測出的飛機結果很準確，但是有很多的飛機沒有被識別出來，這時候又有一個原因。

recall度量的是「查全率」，所有的正樣本是不是都被檢測出來了。比如在腫瘤預測場景中，要求模型有更高的recall，不能放過每一個腫瘤。

precision度量的是「查准率」，在所有檢測出的正樣本中是不是實際都為正樣本。比如在垃圾郵件判斷等場景中，要求有更高的precision，確保放到回收站的都是垃圾郵件。

2、mAP（mean Average Precision）

在查找資料的過程中，發現從信息檢索的角度出發更容易理解mAP的含義。

在信息檢索當中，比如我們搜索一個條目，相關的條目在資料庫中一共有5條，但搜索的結果一共有10條（包含4條相關條目）。這個時候精確率precision=返回結果中相關的條目數/返回總條目數，在這裡等於4/10。召回率recall=返回結果中相關條目數/相關條目總數，在這裡等於4/5。但對於一個搜索系統，相關條目在結果中的順序是非常影響用戶體驗的，我們希望相關的結果越靠前越好。比如在這個例子中，4個條目出現在位置查詢一（1，2，4，7）就比在查詢二（3，5，6，8）效果要好，但兩者的precision是相等的。這時候單單一個precision不足以衡量系統的好壞，於是引入了AP（Average Precision）——不同召回率上的平均precision。對於上面兩個例子。查詢一：

rank | correct | P | R ----------------------------- 1 | right | 1/1 | 1/5 ----------------------------- 2 | right | 2/2 | 2/5 ----------------------------- 3 | wrong | 2/3 | 2/5 ----------------------------- 4 | right | 3/4 | 3/5 ----------------------------- 5 | wrong | 3/5 | 3/5 ----------------------------- 6 | wrong | 3/6 | 3/5 ----------------------------- 7 | right | 4/7 | 4/5 ----------------------------- 8 | wrong | 4/8 | 4/5 ----------------------------- 9 | wrong | 4/9 | 4/5 ----------------------------- 10 | wrong | 4/10 | 4/5 ------------------------------

查詢二：

rank | correct | P | R ----------------------------- 1 | wrong | 0 | 0 ----------------------------- 2 | wrong | 0 | 0 ----------------------------- 3 | right | 1/3 | 1/5 ----------------------------- 4 | wrong | 1/4 | 1/5 ----------------------------- 5 | right | 2/5 | 2/5 ----------------------------- 6 | right | 3/6 | 3/5 ----------------------------- 7 | wrong | 3/7 | 3/5 ----------------------------- 8 | right | 4/8 | 4/5 ----------------------------- 9 | wrong | 4/9 | 4/5 ----------------------------- 10 | wrong | 4/10 | 4/5 -----------------------------

AP(查詢一) = (1+1+3/4+4/7+0)/5 = 0.664

AP(查詢二) = (1/3+2/5+3/6+4/8+0)/5 = 0.347

這個時候mAP = (0.664+0.347)/2 = 0.51

分析：對於上面的例子，最好的結果就是5個條目全部被檢索到，並且分別排在rank=1、2、3、4、5的位置，這時AP=1。所以可以得出即使條目被全部檢索到，但結果的先後順序決定了一個系統的好壞。這個結論會用在目標檢測當中。

註：precision在計算的時候取各個召回率下最大的那個，因為同一recall下最大的precision表示該條目最先出現的位置。

3、目標檢測中的mAP

圖像分類任務通常用accuracy來衡量模型的準確率，對於目標檢測任務，比如測試集上的所有圖片一共有1000個object（這裡的object不是圖片的數量，因為一張圖片中可能包含若干個object），兩個模型都正確檢測出了900個object（IOU>規定的閾值）。與圖像分類任務不同的是，目標檢測因為可能出現重複檢測的情況，所以不是一個n to n的問題。在上面的例子中也就不能簡單用分類任務的accuracy來衡量模型性能，因為模型A有可能是預測了2000個結果才中了900個，而模型B可能只預測了1200個結果。模型B的性能顯然要好於A，因為模型A更像是廣撒網，誤檢測的概率比較高。想像一下如果將模型A用在自動駕駛的汽車上，出現很多誤檢測的情況對汽車的安全性和舒適性都有很大影響。

那在目標檢測任務中，應該怎樣衡量模型的性能？其中一個標準就是信息檢索那樣，不僅要衡量檢測出正確目標的數量，還應該評價模型是否能以較高的precision檢測出目標。也就是在某個類別下的檢測，在檢測出正確目標之前，是不是出現了很多判斷失誤。AP越高，說明檢測失誤越少。對於所有類別的AP求平均就得到mAP了。

4、計算方法和相關代碼

voc2007的計算方法：

在計算AP時，首先要把結果按照置信度排序，公式如下：

$AP=frac{1}{11}sum rin[0,0.1,...,1] Pinterp(r), 其中Pinterp(r)=operatorname*{max}limits_{ ilde{r}:{ ilde{r}}ge{r}}({ ilde{r}})$

voc2010的計算方法：

比起07年，10年以後的新方法是取所有真實的recall值，按照07年的方法得到所有recall/precision數據點以後，計算recall/precision曲線下的面積：

Compute a version of the measured precision/recall curve with precision monotonically decreasing, by setting the precision for recall r to the maximum precision obtained for any recall r′ ≥ r.
Compute the AP as the area under this curve by numerical integration. No approximation is involved since the curve is piecewise constant.

舉一個例子具體說明：

對於Aeroplane類別，我們有以下輸出（BB表示Bounding Box序號，IOU>0.5時GT=1）：

BB | confidence | GT----------------------BB1 | 0.9 | 1----------------------BB2 | 0.9 | 1----------------------BB1 | 0.8 | 1----------------------BB3 | 0.7 | 0----------------------BB4 | 0.7 | 0----------------------BB5 | 0.7 | 1----------------------BB6 | 0.7 | 0----------------------BB7 | 0.7 | 0----------------------BB8 | 0.7 | 1----------------------BB9 | 0.7 | 1----------------------

因此，我們有 TP=5 (BB1, BB2, BB5, BB8, BB9), FP=5 (重複檢測到的BB1也算FP)。除了表裡檢測到的5個GT以外，我們還有2個GT沒被檢測到，因此: FN = 2. 這時我們就可以按照Confidence的順序給出各處的PR值，如下：

rank=1 precision=1.00 and recall=0.14------------------------------rank=2 precision=1.00 and recall=0.29------------------------------rank=3 precision=0.66 and recall=0.29------------------------------rank=4 precision=0.50 and recall=0.29------------------------------rank=5 precision=0.40 and recall=0.29------------------------------rank=6 precision=0.50 and recall=0.43------------------------------rank=7 precision=0.43 and recall=0.43------------------------------rank=8 precision=0.38 and recall=0.43------------------------------rank=9 precision=0.44 and recall=0.57------------------------------rank=10 precision=0.50 and recall=0.71------------------------------

07年的方法：我們選取Recall >={ 0, 0.1, ..., 1}的11處Percision的最大值：1, 1, 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0, 0, 0。AP = 5.5 / 11 = 0.5
VOC2010及以後的方法，對於Recall >= {0, 0.14, 0.29, 0.43, 0.57, 0.71, 1}，我們選取此時Percision的最大值：1, 1, 1, 0.5, 0.5, 0.5, 0。計算recall/precision下的面積：AP = (0.14-0)x1 + (0.29-0.14)x1 + (0.43-0.29)x0.5 + (0.57-0.43)x0.5 + (0.71-0.57)x0.5 + (1-0.71)x0 = 0.5

計算出每個類別的AP以後，對於所有類別的AP取均值就得到mAP了

代碼:

#計算recall, precision和APclass_recs = {} npos = 0 for imagename in imagenames: R = [obj for obj in recs[imagename] if obj[name] == classname] bbox = np.array([x[bbox] for x in R]) difficult = np.array([x[difficult] for x in R]).astype(np.bool) det = [False] * len(R) #這個值是用來判斷是否重複檢測的 npos = npos + sum(~difficult) class_recs[imagename] = {bbox: bbox, difficult: difficult, det: det} # read dets detfile = detpath.format(classname) with open(detfile, r) as f: lines = f.readlines() splitlines = [x.strip().split( ) for x in lines] image_ids = [x[0] for x in splitlines] confidence = np.array([float(x[1]) for x in splitlines]) BB = np.array([[float(z) for z in x[2:]] for x in splitlines]) # sort by confidence sorted_ind = np.argsort(-confidence) BB = BB[sorted_ind, :] image_ids = [image_ids[x] for x in sorted_ind] # go down dets and mark TPs and FPs nd = len(image_ids) tp = np.zeros(nd) fp = np.zeros(nd) for d in range(nd): R = class_recs[image_ids[d]] bb = BB[d, :].astype(float) ovmax = -np.inf BBGT = R[bbox].astype(float) if BBGT.size > 0: # compute overlaps # intersection ixmin = np.maximum(BBGT[:, 0], bb[0]) iymin = np.maximum(BBGT[:, 1], bb[1]) ixmax = np.minimum(BBGT[:, 2], bb[2]) iymax = np.minimum(BBGT[:, 3], bb[3]) iw = np.maximum(ixmax - ixmin + 1., 0.) ih = np.maximum(iymax - iymin + 1., 0.) inters = iw * ih # union uni = ((bb[2] - bb[0] + 1.) * (bb[3] - bb[1] + 1.) + (BBGT[:, 2] - BBGT[:, 0] + 1.) * (BBGT[:, 3] - BBGT[:, 1] + 1.) - inters) overlaps = inters / uni ovmax = np.max(overlaps) jmax = np.argmax(overlaps) if ovmax > ovthresh: if not R[difficult][jmax]: if not R[det][jmax]: tp[d] = 1. R[det][jmax] = 1 #判斷是否重複檢測，檢測過一次以後，值就從False變為1了 else: fp[d] = 1. else: fp[d] = 1. # compute precision recall fp = np.cumsum(fp) tp = np.cumsum(tp) rec = tp / float(npos) # avoid divide by zero in case the first detection matches a difficult # ground truth prec = tp / np.maximum(tp + fp, np.finfo(np.float64).eps) ap = voc_ap(rec, prec, use_07_metric) return rec, prec, ap

計算AP：

def voc_ap(rec, prec, use_07_metric=False): """Compute VOC AP given precision and recall. If use_07_metric is true, uses the VOC 07 11-point method (default:False). """ if use_07_metric: # 11 point metric ap = 0. for t in np.arange(0., 1.1, 0.1): if np.sum(rec >= t) == 0: p = 0 else: p = np.max(prec[rec >= t]) ap = ap + p / 11. else: # correct AP calculation # first append sentinel values at the end mrec = np.concatenate(([0.], rec, [1.])) mpre = np.concatenate(([0.], prec, [0.])) # compute the precision envelope for i in range(mpre.size - 1, 0, -1): mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i]) i = np.where(mrec[1:] != mrec[:-1])[0] # and sum (Delta recall) * prec ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1]) #計算面積 return ap

計算mAP：

def mAP(): detpath,annopath,imagesetfile,cachedir,class_path = get_dir(kitti) ovthresh=0.3, use_07_metric=False rec = 0; prec = 0; mAP = 0 class_list = get_classlist(class_path) for classname in class_list: rec, prec, ap = voc_eval(detpath, annopath, imagesetfile, classname, cachedir, ovthresh=0.5, use_07_metric=False, kitti=True) print(on {}, the ap is {}, recall is {}, precision is {}.format(classname, ap, rec[-1], prec[-1])) mAP += ap mAP = float(mAP) / len(class_list) return mAP

coco的計算方式和詳細代碼以及使用yolov3訓練kitti數據集，可以在github上查看

第一次寫文章，如有不正確的地方歡迎指正交流

參考：

我是小將：目標檢測模型的評估指標mAP詳解(附代碼）?

zhuanlan.zhihu.com呂欣蔚：全面梳理：準確率,精確率,召回率,查准率,查全率,假陽性,真陽性,PRC,ROC,AUC,F1?

zhuanlan.zhihu.com
推薦閱讀：

※港股科技股爆炒之後 ,風還會往哪吹？（附股）
※這款迅雷的下載速度，就和他的名字一樣快！
※那些珍貴的「視覺SLAM」課程資料總結（補充版/完整版）
※教你如何快速查看您的Mac硬碟還有多少剩餘空間
※無觸摸測晶元CPU發來的片選時鐘中斷請求及複位信號測試不到

TAG:科技 |