2.3 人臉識別網路Inception代碼解析

01-23

由於TensorFlow還有很多基礎，感覺掛一漏萬，所以在這裡對2.2章先略過，等想到一個好的方式再來講。

本章準備對Inception網路代碼進行解析，Inception網路可以完成對於人臉的分類工作。其實總結起來人臉識別工作可以分為幾個部分人臉檢測（detection），人臉校準（aliment），人臉識別（recognise），還記得手機照相過程那個小框框吧，那個就屬於人臉檢測，其用的是MTCNN，名字比較唬人哈，其實就是三個簡單的卷積神經網路綜合起來的（圖形節選自文章Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks ）：

NMS極大值抑制就指的是將好多小框框進行合併，因為在Pnet中是純卷積結構，這個結構使得可以對任意大小的圖形進行識別，識別之後就是很多的框框，由於訓練過程人臉大小不一，所以在識別過程中需要建立所謂的「圖像金字塔」，剩下兩層是卷積和全鏈接結構，用於人臉校正，其網路結構如下：

再次說明，上面兩個圖形節選自文章(Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks )

人臉檢測就是對圖片中包含的人臉進行識別，這是一個相對簡單的任務，也為後續的人臉身份識別提供基礎，那麼人臉身份識別對應的就是FaceRecognise，這部分是本文的主要內容，Inception網路的目的就是對人臉進行分類，其他分類結構包括Resnet等這是一個完全不同的結構，網路很深，一百多層。

TensorFlow中是包含完整的Inception網路的，對於高層次的api大都在contrib庫中：

tensorflow.contrib.slim

slim庫包含了很多現成的網路，這免去了我們自己搭建的麻煩。

我們直接打開inception部分代碼，就可以在注釋中發現inception的網路結構：

Conv2d_1a_3x3Conv2d_2a_3x3Conv2d_2b_3x3MaxPool_3a_3x3Conv2d_3b_1x1Conv2d_4a_3x3MaxPool_5a_3x3Mixed_5bMixed_5cMixed_5dMixed_6aMixed_6bMixed_6cMixed_6dMixed_6eMixed_7aMixed_7bMixed_7c

前幾層中是卷積以及池化層，這與傳統的卷積神經網路並沒有什麼不同。之後是mixed層，其原理上來說也是幾個卷積層的綜合，但是細節結構上做了一些優化，使得參數更少。

首先來看對於上面定義的網路結構是如何用TensorFlow寫成的：

對於卷積層來說：

with variable_scope.variable_scope(scope, "InceptionV3", [inputs]): with arg_scope( [layers.conv2d, layers_lib.max_pool2d, layers_lib.avg_pool2d], stride=1, padding="VALID"): # 299 x 299 x 3 end_point = "Conv2d_1a_3x3" net = layers.conv2d(inputs, depth(32), [3, 3], stride=2, scope=end_point) end_points[end_point] = net if end_point == final_endpoint: return net, end_points # 149 x 149 x 32 end_point = "Conv2d_2a_3x3" net = layers.conv2d(net, depth(32), [3, 3], scope=end_point) end_points[end_point] = net if end_point == final_endpoint: return net, end_points # 147 x 147 x 32 end_point = "Conv2d_2b_3x3" net = layers.conv2d( net, depth(64), [3, 3], padding="SAME", scope=end_point) end_points[end_point] = net if end_point == final_endpoint: return net, end_points # 147 x 147 x 64

還記得曾經我們如何構建卷積層的嗎？

def conv(self, data, weigh_shape, bias_shape, activ=tf.nn.relu): """ Define the conf layer """ weight = tf.get_variable("conv_weight", weigh_shape, initializer=tf.random_normal_initializer()) biases = tf.get_variable("conv_bias", bias_shape, initializer=tf.constant_initializer(0.0)) conv = tf.nn.conv1d(data, filters=weight, strides=[1, 1, 1, 1], padding="VALID", name="conv_data") return acitv(conv + biases)

顯然對於編程來說是非常麻煩的，因此在TensorFlow中定義了layers：

from tensorflow.contrib import layerslayers.conv2d(...)#或者import tensorflow.contrib.slim as slimslim.conv2d(...)

它使得我們對於神經網路的構建得以很大程度的簡化。

而對於inception層，其是一系列卷積、concat操作的小合集：

end_point = "Mixed_5b" with variable_scope.variable_scope(end_point): with variable_scope.variable_scope("Branch_0"): branch_0 = layers.conv2d( net, depth(64), [1, 1], scope="Conv2d_0a_1x1") with variable_scope.variable_scope("Branch_1"): branch_1 = layers.conv2d( net, depth(48), [1, 1], scope="Conv2d_0a_1x1") branch_1 = layers.conv2d( branch_1, depth(64), [5, 5], scope="Conv2d_0b_5x5") with variable_scope.variable_scope("Branch_2"): branch_2 = layers.conv2d( net, depth(64), [1, 1], scope="Conv2d_0a_1x1") branch_2 = layers.conv2d( branch_2, depth(96), [3, 3], scope="Conv2d_0b_3x3") branch_2 = layers.conv2d( branch_2, depth(96), [3, 3], scope="Conv2d_0c_3x3") with variable_scope.variable_scope("Branch_3"): branch_3 = layers_lib.avg_pool2d(net, [3, 3], scope="AvgPool_0a_3x3") branch_3 = layers.conv2d( branch_3, depth(32), [1, 1], scope="Conv2d_0b_1x1") net = array_ops.concat([branch_0, branch_1, branch_2, branch_3], 3) end_points[end_point] = net if end_point == final_endpoint: return net, end_points

其描述的網路為：

由於大的卷積核對於計算的消耗較大，所以可以用深層-小卷積核替代：

兩幅圖均節選自文章（http://arxiv.org/abs/1512.00567）

在疊加了多個inception層後，整個inception網路構建完成。實際上整個：

def inception_v3(inputs, num_classes=1000, is_training=True, dropout_keep_prob=0.8, min_depth=16, depth_multiplier=1.0, prediction_fn=layers_lib.softmax, spatial_squeeze=True, reuse=None, scope="InceptionV3"): ....

就是返回Inception網路模型，在人臉識別的實踐中可以直接引入。