SSD:Single Short MultiBox Detector

05-01

本文主要內容：

SSD理論總結（SSD: Single Shot MultiBox Detector）
關鍵源碼解析:

balancap/SSD-Tensorflow?

github.com

Model

SSD模型採用VGG16作為基礎網路結構(base network)，在base network 之後添加了額外的網路結構，如下圖所示：

SSD網路結構

1.Multi-scale feature maps for detection

在base network(VGG16的前5層)之後添加了額外的卷基層，具體利用astrous演算法將fc6和fc7層轉化為兩個卷積層，再額外增加3個卷基層(Conv:1*1+Conv:3*3)和一個平均池化層(Avg Pooling，論文中是一個Conv:1*1+Conv:3*3，具有相同作用)。
這裡我們在網路的所有特徵圖上應用3*3卷積進行預測，來自較低層的預測有助於處理較小的物體。因為低層的feature map的感受野較小。這意味著可以通過使用與感受野大小相似的feature map來處理大小不同的對象，即達到多尺度特徵圖檢測的目的。
關鍵代碼解析：

#部分初始化參數class SSDNet(object): """Implementation of the SSD VGG-based 300 network. The default features layers with 300x300 image input are: 多尺度feature map檢測位置 conv4 ==> 38 x 38 conv7 ==> 19 x 19 conv8 ==> 10 x 10 conv9 ==> 5 x 5 conv10 ==> 3 x 3 conv11 ==> 1 x 1 The default image size used to train this network is 300x300. """ default_params = SSDParams( img_shape=(300, 300),#輸入圖像尺寸 num_classes=21,#類別數量，20+1（背景） no_annotation_label=21, #多尺度feature map檢測位置 feat_layers=[block4, block7, block8, block9, block10, block11], #feature map尺寸 feat_shapes=[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)], #最低層、最高層default box大小，可根據需要進行修改 anchor_size_bounds=[0.15, 0.90], #anchor_size_bounds=[0.20, 0.90],（原論文中的值） #default box大小 anchor_sizes=[(21., 45.), (45., 99.), (99., 153.), (153., 207.), (207., 261.), (261., 315.)], # anchor_sizes=[(30., 60.), # (60., 111.), # (111., 162.), # (162., 213.), # (213., 264.), # (264., 315.)], #default box的長寬比例 anchor_ratios=[[2, .5], [2, .5, 3, 1./3], [2, .5, 3, 1./3], [2, .5, 3, 1./3], [2, .5], [2, .5]], #default box中心位置間隔 anchor_steps=[8, 16, 32, 64, 100, 300], anchor_offset=0.5,#補償閾值 #該特徵圖是否進行正則化，大於0正則化 normalizations=[20, -1, -1, -1, -1, -1], prior_scaling=[0.1, 0.1, 0.2, 0.2] )

使用tensorflow-slim庫定義網路，此處對源碼做了適當修改：

#定義SSD網路結構def ssd_net(input, num_classes=SSDNet.default_params.num_classes, feat_layers=SSDNet.default_params.feat_layers, anchor_sizes=SSDNet.default_params.anchor_sizes, anchor_ratios=SSDNet.default_params.anchor_ratios, normalizations=SSDNet.default_params.normalizations, is_training=True, dropout_keep_prob=0.5, prediction_fn=slim.softmax, reuse=None, scope=ssd_300_vgg): """SSD net definition.""" # End_points collect relevant activations for external use. #存儲每層feature map的輸出結果 end_points = {} with tf.variable_scope(scope, ssd_300_vgg, [inputs], reuse=reuse): # ========Original VGG-16 blocks======== net = slim.repeat(input, 2, slim.conv2d, 64, [3, 3], scope=conv1) end_points[block1] = net net = slim.max_pool2d(net, [2, 2], scope=pool1, padding=SAME) # Block 2. net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope=conv2) end_points[block2] = net net = slim.max_pool2d(net, [2, 2], scope=pool2, padding=SAME) # Block 3. net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope=conv3) end_points[block3] = net net = slim.max_pool2d(net, [2, 2], scope=pool3, padding=SAME) # Block 4. net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope=conv4) #第一個用於預測的feature map，shape為(batch_size, 38, 38, 512) end_points[block4] = net net = slim.max_pool2d(net, [2, 2], scope=pool4, padding=SAME) # Block 5. net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope=conv5) end_points[block5] = net net = slim.max_pool2d(net, [3, 3], stride=1, scope=pool5, padding=SAME) # Additional SSD blocks. # Block 6: lets dilate the hell out of it! net = slim.conv2d(net, 1024, [3, 3], rate=6, scope=conv6) end_points[block6] = net net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training) # Block 7: 1x1 conv. Because the fuck. net = slim.conv2d(net, 1024, [1, 1], scope=conv7) #第二個用於預測的feature map，shape為(batch_size, 19, 19, 1024) end_points[block7] = net net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training) # Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts) end_point = block8 with tf.variable_scope(end_point): net = slim.conv2d(net, 256, [1, 1], scope=conv1x1) net = custom_layers.pad2d(net, pad=(1, 1)) net = slim.conv2d(net, 512, [3, 3], stride=2, scope=conv3x3, padding=VALID) #第三個用於預測的feature map，shape為(batch_size, 10, 10, 512) end_points[end_point] = net end_point = block9 with tf.variable_scope(end_point): net = slim.conv2d(net, 128, [1, 1], scope=conv1x1) net = custom_layers.pad2d(net, pad=(1, 1)) net = slim.conv2d(net, 256, [3, 3], stride=2, scope=conv3x3, padding=VALID) #第四個用於預測的feature map，shape為(batch_size, 5, 5, 256) end_points[end_point] = net end_point = block10 with tf.variable_scope(end_point): net = slim.conv2d(net, 128, [1, 1], scope=conv1x1) net = slim.conv2d(net, 256, [3, 3], scope=conv3x3, padding=VALID) #第五個用於預測的feature map，shape為(batch_size, 3, 3, 256) end_points[end_point] = net end_point = block11 with tf.variable_scope(end_point): net = slim.conv2d(net, 128, [1, 1], scope=conv1x1) net = slim.conv2d(net, 256, [3, 3], scope=conv3x3, padding=VALID) #第六個用於預測的feature map，shape為(batch_size, 1, 1, 256) end_points[end_point] = net # Prediction and localisations layers. predictions = [] logits = [] localisations = [] for i, layer in enumerate(feat_layers): with tf.variable_scope(layer + _box): #預測bbox的位置（相對於default box的偏移）以及類別 p, l = ssd_multibox_layer(end_points[layer], num_classes, anchor_sizes[i], anchor_ratios[i], normalizations[i]) #softmax predictions.append(prediction_fn(p)) #類別概率 logits.append(p) #bbox相對於default box的偏移 localisations.append(l) return predictions, localisations, logits, end_pointsssd_net.default_image_size = 300#測試是用的是tf-1.1.0版本，使用300*300的圖片feature map的shape和預期不一樣，因此在源碼中做了改動，即在max_pool添加參數padding=SAME

2.Convolutional predictors for detection

每一個用於預測的特徵層（base network之後的feature map），使用一系列 convolutional filters，產生一系列固定大小（即每個特徵圖預測的尺度是固定的）的 predictions。對於一個 m×n，具有 p 通道的feature map，使用的convolutional filters 是 3×3 的 kernels。預測default box的類別和偏移位置。
YOLO 則是用一個全連接層來代替這裡的卷積層，全連接層導致輸入大小必須固定。
關鍵代碼分析：

#在特徵圖上進行預測（偏移位置，類別概率）"""inpouts:[block4, block7, block8, block9, block10, block11]num_classes:21sizes:[(21.,45.),(45.,99.),(99.,153.), (153.,207.),(207.,261.),(261.,315.)]ratios:[[2, .5],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5, 3, 1./3],[2, .5],[2,.5]]參數一一對應"""def ssd_multibox_layer(inputs, num_classes, sizes, ratios=[1], normalization=-1, bn_normalization=False): """Construct a multibox layer, return a class and localization predictions. """ net = inputs #正則化 if normalization > 0: net = custom_layers.l2_normalization(net, scaling=True) # Number of anchors. #此feature map每個位置對應的default box個數 #len(size)表示長寬比例為1的的個數 #len(ratios)表示其它長寬比例 num_anchors = len(sizes) + len(ratios) # Location. #位置 num_loc_pred = num_anchors * 4 #卷積預測器，為每個bbox預測位置 """輸出: (batch_size, 38, 38，num_loc_pred) (batch_size, 19, 19，num_loc_pred) (batch_size, 10, 10，num_loc_pred) (batch_size, 5, 5，num_loc_pred) (batch_size, 3, 3，num_loc_pred) (batch_size, 1, 1，num_loc_pred) """ loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None, scope=conv_loc) loc_pred = custom_layers.channel_to_last(loc_pred) loc_pred = tf.reshape(loc_pred, tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4]) # Class prediction. #卷積預測器，為每個bbox預測類別 num_cls_pred = num_anchors * num_classes cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None, scope=conv_cls) cls_pred = custom_layers.channel_to_last(cls_pred) cls_pred = tf.reshape(cls_pred, tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes]) return cls_pred, loc_pred

3.Default boxes and aspect ratios(長寬比)

在每一個用於預測的feature map上得到default boxes，default boxes的數量、尺寸、長寬比由網路結構固定而固定。
關鍵代碼解析：

#為特徵每個feature map生成固定的default boxdef ssd_anchor_one_layer(img_shape, feat_shape, sizes, ratios, step, offset=0.5, dtype=np.float32): """Computer SSD default anchor boxes for one feature layer. Determine the relative position grid of the centers, and the relative width and height. Arguments: feat_shape: Feature shape, used for computing relative position grids; size: Absolute reference sizes; ratios: Ratios to use on these features; img_shape: Image shape, used for computing height, width relatively to the former; offset: Grid offset. Return: y, x, h, w: Relative x and y grids, and height and width. """ # Compute the position grid: simple way. # y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]] # y = (y.astype(dtype) + offset) / feat_shape[0] # x = (x.astype(dtype) + offset) / feat_shape[1] # Weird SSD-Caffe computation using steps values... #以（38*38）的feature map為例生成default box #理解為feature map對應的y軸坐標，x軸坐標 """ y的shape(38,38),值為： np.array([[0,0,0,...,0,0,0], [1,1,1,...,1,1,1], ...... [37,37,37,...,37,37,37]]) x的shape(38,38),值為： np.array([[0,1,2,...,35,36,37], [0,1,2,...,35,36,37], ...... [0,1,2,...,35,36,37]]) """ y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]] #將feature map的點對應到原始圖像上並歸一化[0-1] #y = (y + 0.5) * 8/300 #x = (x + 0.5) * 8/300 #x,y為default box在原始圖片中的中心位置，並歸一化[0-1] y = (y.astype(dtype) + offset) * step / img_shape[0] x = (x.astype(dtype) + offset) * step / img_shape[1] # Expand dims to support easy broadcasting. #擴展維度,shape為（38,38,1） y = np.expand_dims(y, axis=-1) x = np.expand_dims(x, axis=-1) # Compute relative height and width. # Tries to follow the original implementation of SSD for the order. #anchors的數量 #feature map每個點對應的default box 的數量 num_anchors = len(sizes) + len(ratios) #default box 的高和寬 h = np.zeros((num_anchors, ), dtype=dtype) w = np.zeros((num_anchors, ), dtype=dtype) # Add first anchor boxes with ratio=1. # #長寬比例為1的default box，高和寬都為21/300 h[0] = sizes[0] / img_shape[0] w[0] = sizes[0] / img_shape[1] di = 1 #長寬比例為1的default box額外添加一個尺寸為sqrt(Sk*Sk+1)的default box if len(sizes) > 1: #寬高都為sqrt(21*45) h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0] w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1] di += 1 #剩餘長寬比的default box for i, r in enumerate(ratios): h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r) w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r) #返回default box的中心位置以及寬和高 #y,x的shape為(38,38,1) #h,w的shape為（4，） return y, x, h, wdef ssd_anchors_all_layers(img_shape,#原始圖像的shape layers_shape,#特徵圖shape anchor_sizes,#default box尺寸 anchor_ratios,#長寬比例 anchor_steps, offset=0.5, dtype=np.float32): """Compute anchor boxes for all feature layers.""" """ params: img_shape: (300,300) layers_shape: [(38,38),(19,19),(10,10),(5,5),(3,3),(1,1)] 21，45，99，153，207，261 anchor_sizes: [(21,45),(45,99),(99,153),(153,207),(207,261),(261,315)] anchor_ratios:[[2,.5],[2,.5,3,1./3],[2,.5,3,1./3],[2,.5,3,1./3],[2,.5],[2,.5]] anchor_steps: [8,16,32,64,100,300] offset: 0.5 """ layers_anchors = [] #enumerate,python的內置函數返回索引、內容 """ 即： 0,(38,38) 1,(19,19) 2,(10,10) 3,(5,5) 4,(3,3) 5,(1,1) """ for i, s in enumerate(layers_shape): anchor_bboxes = ssd_anchor_one_layer(img_shape, s, anchor_sizes[i], anchor_ratios[i], anchor_steps[i], offset=offset, dtype=dtype) layers_anchors.append(anchor_bboxes) return layers_anchors

訓練

1.生成default box

對每種尺寸的feature map，按照相應的大小（scale）和寬高比例（ratio）在每個點生成固定數量的default box，也就是說，SSD中的default box是由網路結構固定而固定的，如下圖(僅僅是為了舉例)，紅色點代表feature map(5*5)，每個位置預測3個default box，尺寸為168，寬高比為1，1/2，2，則default box寬高分別為([168,168],[ $168divsqrt{2},168 imessqrt{2}$ ],[ $168 imessqrt{2},168divsqrt{2}$ ])。

生成default box：

首先設計出最小和最大default box的尺寸[ $S_{min},S_{max}$ ]，即越底層的feature map對應的default box尺寸越小（感受野越小，更適合檢測小尺寸對象），論文中為[0.2，0.9]，上述代碼中為[0.15，0.9]。
每個feature map(由低層到高層)對應的default box的尺寸計算公式為： $s_{k}=s_{min} +frac{s_{max}-s_{min} }{m-1}(k-1),kin [1,m]$ ，m為feature map數量
每個尺寸的default box寬高根據比例值計算，如下所示：

寬： $w_{k}^{a} =s_{k}sqrt{a_{r} }$ ，高： $h_{k}^{a} =s_{k}/sqrt{a_{r} }$ ， $a_{r}$ 為寬高比， $S_{k}$ 為default box尺寸；

比例為1的默認框，額外添加一個尺寸為 $s_{k}^{} =sqrt{s_{k}s_{k+1}}$ default box;

每個默認框中心設定為 $(frac{i+0.5}{|f_{k} |},frac{j+0.5}{|f_{k} |} )$ ， $left| f_{k} ight|$ 為特徵圖尺寸;

2.生成訓練數據

根據圖片的ground truth和default box生成訓練數據，關鍵代碼解析如下：

#gt編碼函數#labels:gt的類別#bboxes:gt的位置#anchors:default box的位置#num_class:類別數量#no_annotation_label:21#ignore_threshold=0.5，閾值#prior_scaling=[0.1, 0.1, 0.2, 0.2],縮放def tf_ssd_bboxes_encode(labels, bboxes, anchors, num_classes, no_annotation_label, ignore_threshold=0.5, prior_scaling=[0.1, 0.1, 0.2, 0.2], dtype=tf.float32, scope=ssd_bboxes_encode): """Encode groundtruth labels and bounding boxes using SSD net anchors. Encoding boxes for all feature layers. Arguments: labels: 1D Tensor(int64) containing groundtruth labels; bboxes: Nx4 Tensor(float) with bboxes relative coordinates; anchors: List of Numpy array with layer anchors; matching_threshold: Threshold for positive match with groundtruth bboxes; prior_scaling: Scaling of encoded coordinates. Return: (target_labels, target_localizations, target_scores): Each element is a list of target Tensors. """ with tf.name_scope(scope): target_labels = [] target_localizations = [] target_scores = [] for i, anchors_layer in enumerate(anchors): with tf.name_scope(bboxes_encode_block_%i % i): #處理每個尺寸的default box(對應一層的feature map)，生成訓練數據 t_labels, t_loc, t_scores = tf_ssd_bboxes_encode_layer(labels, bboxes, anchors_layer, num_classes, no_annotation_label, ignore_threshold, prior_scaling, dtype) target_labels.append(t_labels) target_localizations.append(t_loc) target_scores.append(t_scores) return target_labels, target_localizations, target_scores

處理每個尺寸的default box(對應一層的feature map)，生成訓練數據，關鍵代碼解析，以shape為(38,38)feature map為例：

本代碼塊中對於每一個anchor和所有的gt計算重疊度，anchor的類別為重疊度最高的gt的類別，偏移位置為相對於重疊度最高的gt的偏移位置；
給定輸入圖像以及每個物體的 ground truth，首先找到每個gt對應的default box中重疊度最大的作為（與該ground true box相關的匹配）正樣本。然後，在剩下的default box中找到那些與任意一個ground truth box 的 IOU 大於 0.5的default box作為（與該ground true box相關的匹配）正樣本。剩餘的default box 作為負例樣本；
一個anchor對應一個gt，而一個gt可能對應多個anchor；

#labels:gt的類別#bboxes:gt的位置#anchors_layer:特定feature map的default box的位置#num_class:類別數量#no_annotation_label:21#ignore_threshold=0.5，閾值#prior_scaling=[0.1, 0.1, 0.2, 0.2],縮放def tf_ssd_bboxes_encode_layer(labels, bboxes, anchors_layer, num_classes, no_annotation_label, ignore_threshold=0.5, prior_scaling=[0.1, 0.1, 0.2, 0.2], dtype=tf.float32): """Encode groundtruth labels and bounding boxes using SSD anchors from one layer. Arguments: labels: 1D Tensor(int64) containing groundtruth labels; bboxes: Nx4 Tensor(float) with bboxes relative coordinates; anchors_layer: Numpy array with layer anchors; matching_threshold: Threshold for positive match with groundtruth bboxes; prior_scaling: Scaling of encoded coordinates. Return: (target_labels, target_localizations, target_scores): Target Tensors. """ # Anchors coordinates and volume. #anchors的中心坐標，以及寬高 #shape為(38,38,1),(38,38,1),(4,),(4,) yref, xref, href, wref = anchors_layer ymin = yref - href / 2.#anchor的下邊界,(38,38,4) xmin = xref - wref / 2.#anchor的左邊界,(38,38,4) ymax = yref + href / 2.#anchor的上邊界,(38,38,4) xmax = xref + wref / 2.#anchor的右邊界,(38,38,4) vol_anchors = (xmax - xmin) * (ymax - ymin)#anchor的面積,(38,38,4) # Initialize tensors... #(38,38,4) shape = (yref.shape[0], yref.shape[1], href.size) feat_labels = tf.zeros(shape, dtype=tf.int64) feat_scores = tf.zeros(shape, dtype=dtype) feat_ymin = tf.zeros(shape, dtype=dtype) feat_xmin = tf.zeros(shape, dtype=dtype) feat_ymax = tf.ones(shape, dtype=dtype) feat_xmax = tf.ones(shape, dtype=dtype) #計算jaccard重合度 #box存儲的是gt的四個邊界位置,並且都進行了歸一化 def jaccard_with_anchors(bbox): """Compute jaccard score between a box and the anchors. """ #獲取gt和anchors重合的部分 int_ymin = tf.maximum(ymin, bbox[0]) int_xmin = tf.maximum(xmin, bbox[1]) int_ymax = tf.minimum(ymax, bbox[2]) int_xmax = tf.minimum(xmax, bbox[3]) h = tf.maximum(int_ymax - int_ymin, 0.) w = tf.maximum(int_xmax - int_xmin, 0.) # Volumes. inter_vol = h * w#計算重疊部分面積 union_vol = vol_anchors - inter_vol + (bbox[2] - bbox[0]) * (bbox[3] - bbox[1]) jaccard = tf.div(inter_vol, union_vol) return jaccard#返回重合度 #計算重疊部分面積佔anchor面積的比例 def intersection_with_anchors(bbox): """Compute intersection between score a box and the anchors. """ int_ymin = tf.maximum(ymin, bbox[0]) int_xmin = tf.maximum(xmin, bbox[1]) int_ymax = tf.minimum(ymax, bbox[2]) int_xmax = tf.minimum(xmax, bbox[3]) h = tf.maximum(int_ymax - int_ymin, 0.) w = tf.maximum(int_xmax - int_xmin, 0.) inter_vol = h * w scores = tf.div(inter_vol, vol_anchors) return scores #tf.while_loop的條件 def condition(i, feat_labels, feat_scores, feat_ymin, feat_xmin, feat_ymax, feat_xmax): """Condition: check label index. """ #返回I<tf.shape(labels)是否為真 r = tf.less(i, tf.shape(labels)) return r[0] #tf.while_loop的主體 def body(i, feat_labels, feat_scores, feat_ymin, feat_xmin, feat_ymax, feat_xmax): """Body: update feature labels, scores and bboxes. Follow the original SSD paper for that purpose: - assign values when jaccard > 0.5; - only update if beat the score of other bboxes. """ # Jaccard score. #第i個gt的類別和位置 label = labels[i] bbox = bboxes[i] #計算gt和每一個anchor的重合度 jaccard = jaccard_with_an4chors(bbox) # Mask: check threshold + scores + no annotations + num_classes. #比較兩個值的大小來輸出對錯,大於輸出true,shape(38,38,4) #feat_scores存儲的是anchor和gt重疊度最高的值 mask = tf.greater(jaccard, feat_scores) #mask = tf.logical_and(mask,tf.greater(jaccard,matching_threshold)) #邏輯與 mask = tf.logical_and(mask, feat_scores > -0.5) mask = tf.logical_and(mask, label < num_classes) imask = tf.cast(mask, tf.int64) fmask = tf.cast(mask, dtype) # Update values using mask. #根據imask更新類別，和位置 #imask表示本輪anchor和gt重合度之前gt的重合度，1-imask保留之前的結果 #更新anchor的類別標籤 feat_labels = imask * label + (1 - imask) * feat_labels #jaccard返回true對應的只，feat_scores返回false對應的值 #更新anchor與gt的重合度，為每個anchor保留重合度最大值 feat_scores = tf.where(mask, jaccard, feat_scores) #更新anchor對應的gt（具有最大重合度） feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax # Check no annotation label: ignore these anchors... # interscts = intersection_with_anchors(bbox) # mask = tf.logical_and(interscts > ignore_threshold, # label == no_annotation_label) # # Replace scores by -1. # feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores) return [i+1, feat_labels, feat_scores, feat_ymin, feat_xmin, feat_ymax, feat_xmax] # Main loop definition. i = 0 [i, feat_labels, feat_scores, feat_ymin, feat_xmin, feat_ymax, feat_xmax] = tf.while_loop(condition, body, [i, feat_labels, feat_scores, feat_ymin, feat_xmin, feat_ymax, feat_xmax]) # Transform to center / size. #計算anchor對應的gt的中心位置以及寬和高 feat_cy = (feat_ymax + feat_ymin) / 2. feat_cx = (feat_xmax + feat_xmin) / 2. feat_h = feat_ymax - feat_ymin feat_w = feat_xmax - feat_xmin # Encode features. #計算anchor與對應的gt的偏移位置 feat_cy = (feat_cy - yref) / href / prior_scaling[0] feat_cx = (feat_cx - xref) / wref / prior_scaling[1] feat_h = tf.log(feat_h / href) / prior_scaling[2] feat_w = tf.log(feat_w / wref) / prior_scaling[3] # Use SSD ordering: x / y / w / h instead of ours. feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1) #返回每個anchor的類別標籤，以及anchor和對應gt的偏移，anchor與對應gt的重合度 return feat_labels, feat_localizations, feat_scores

3.損失函數

SSD損失函數分為兩部分：

localization loss(loc)
confidence loss(conf)

定義 $x_{ij}^{p}=left{ 1,0 ight}$ , 表示第 i 個 default box 與第 j 個 ground truth box 相匹配，類別為p，若不匹配的話，值為0。

訓練對象為：

$L(x,c,l,g)=frac{1}{N}(L_{conf}(x,c)+alpha L_{loc}(x,l,g) )$ ，N為匹配default box，If N = 0，loss = 0;

$L_{loc}$ 為預測框 $l$ 和ground truth box $g$ 的Smooth L1 loss，定義如下：

localization loss

$l$ 為預測框， $g$ 為ground truth， $d$ 為default box，我們對偏移位置進行回歸。

$Lconf$ 為多類別softmax loss，定義如下，通過交叉驗證將 $alpha$ 設為1 ：

confidence loss

關鍵代碼分析：

#SSD損失函數定義#logits：預測的類別#localisations：預測的偏移位置#gclasses：default box相對於gt的類別#glocalisations：default box相對於gt的偏移位置#gscores：default box和gt的重疊度def ssd_losses(logits, localisations, gclasses, glocalisations, gscores, match_threshold=0.5, negative_ratio=3., alpha=1., label_smoothing=0., device=/cpu:0, scope=None): with tf.name_scope(scope, ssd_losses): lshape = tfe.get_shape(logits[0], 5) #類別數量 num_classes = lshape[-1] batch_size = lshape[0] # Flatten out all vectors! flogits = [] fgclasses = [] fgscores = [] flocalisations = [] fglocalisations = [] #處理所有尺寸feature map的預測結果 #(38,38),(19,19),(10,10),(5,5),(3,3),(1,1) for i in range(len(logits)): #預測的類別（38*38*4, 21） flogits.append(tf.reshape(logits[i], [-1, num_classes])) #真實類別（38*38*4） fgclasses.append(tf.reshape(gclasses[i], [-1])) #重疊度（38*38*4） fgscores.append(tf.reshape(gscores[i], [-1])) #預測偏移位置，（38*38*4, 4） flocalisations.append(tf.reshape(localisations[i], [-1, 4])) #真實偏移位置，（38*38*4, 4） fglocalisations.append(tf.reshape(glocalisations[i], [-1, 4])) # And concat the crap! logits = tf.concat(flogits, axis=0) gclasses = tf.concat(fgclasses, axis=0) gscores = tf.concat(fgscores, axis=0) localisations = tf.concat(flocalisations, axis=0) glocalisations = tf.concat(fglocalisations, axis=0) dtype = logits.dtype # Compute positive matching mask... #獲取重疊度>0.5的default box個數，即損失函數中的N，正例樣本位置 pmask = gscores > match_threshold fpmask = tf.cast(pmask, dtype) n_positives = tf.reduce_sum(fpmask) # Hard negative mining... no_classes = tf.cast(pmask, tf.int32) #將輸出類別對應的softmax predictions = slim.softmax(logits) #邏輯與，獲得負類樣本的位置 nmask = tf.logical_and(tf.logical_not(pmask), gscores > -0.5) fnmask = tf.cast(nmask, dtype) #獲得負例樣本對應的概率 nvalues = tf.where(nmask, predictions[:, 0], 1. - fnmask) nvalues_flat = tf.reshape(nvalues, [-1]) # Number of negative entries to select. #負例樣本數目，保證正負樣本數目為1:3 max_neg_entries = tf.cast(tf.reduce_sum(fnmask), tf.int32) n_neg = tf.cast(negative_ratio * n_positives, tf.int32)+batch_size n_neg = tf.minimum(n_neg, max_neg_entries) val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg) max_hard_pred = -val[-1] # Final negative mask. nmask = tf.logical_and(nmask, nvalues < max_hard_pred) fnmask = tf.cast(nmask, dtype) # Add cross-entropy loss. #正樣本概率損失函數 with tf.name_scope(cross_entropy_pos): loss = tf.nn.sparse_softmax_cross_entropy_with_logits( logits=logits, labels=gclasses) loss = tf.div(tf.reduce_sum(loss * fpmask), batch_size, name=value) tf.losses.add_loss(loss) #負樣本概率損失函數 with tf.name_scope(cross_entropy_neg): loss = tf.nn.sparse_softmax_cross_entropy_with_logits( logits=logits, labels=no_classes) loss = tf.div(tf.reduce_sum(loss * fnmask), batch_size, name=value) tf.losses.add_loss(loss) # Add localization loss: smooth L1, L2, ... #位置損失函數 with tf.name_scope(localization): # Weights Tensor: positive mask + random negative. weights = tf.expand_dims(alpha * fpmask, axis=-1) loss = custom_layers.abs_smooth(localisations - glocalisations) loss = tf.div(tf.reduce_sum(loss * weights), batch_size, name=value) tf.losses.add_loss(loss)

4.Hard Negative Mining

絕大多數的default box都是負例樣本，導致正負樣本不平衡，訓練時採用Hard Negative Mining策略（使正負樣本比例為1:3）來平衡正負樣本比例。

總結

本文總結論文中的關鍵點，並對關鍵源碼進行分析。在讀完論文之後有很多不明確的地方，閱讀了源碼之後，豁然開朗。
本文主要講解了SSD的多尺度特徵圖檢測、default box的生成、訓練數據預處理、目標函數。

參考

SSD: Single Shot MultiBox Detector

https://www.zhihu.com/people/shan-ren-87-35/posts

https://www.zhihu.com/people/xiao-lei-75-81/posts