【博客存檔】TensorFlow之深入理解VGGResidual Network

01-24

前言

這段時間到了新公司，工作上開始研究DeepLearning以及TensorFlow，挺忙了，前段時間看了VGG和deep residual的paper，一直沒有時間寫，今天準備好好把這兩篇相關的paper重讀下。

VGGnet

VGGnet是Oxford的Visual Geometry Group的team，在ILSVRC 2014上的相關工作，主要工作是證明了增加網路的深度能夠在一定程度上影響網路最終的性能，如下圖，文章通過逐步增加網路深度來提高性能，雖然看起來有一點小暴力，沒有特別多取巧的，但是確實有效，很多pretrained的方法就是使用VGG的model（主要是16和19），VGG相對其他的方法，參數空間很大，最終的model有500多m，alnext只有200m，googlenet更少，所以train一個vgg模型通常要花費更長的時間，所幸有公開的pretrained model讓我們很方便的使用，前面neural style這篇文章就使用的pretrained的model，paper中的幾種模型如下：

可以從圖中看出，從A到最後的E，他們增加的是每一個卷積組中的卷積層數，最後D，E是我們常見的VGG-16，VGG-19模型，C中作者說明，在引入1*1是考慮做線性變換（這裡channel一致，不做降維），後面在最終數據的分析上來看C相對於B確實有一定程度的提升，但不如D、VGG主要得優勢在於

減少參數的措施，對於一組（假定3個，paper裡面只stack of three 3*3）卷積相對於7*7在使用3層的非線性關係（3層RELU）的同時保證參數數量為3*（3^2C^2）=27C^2的，而7*7為49C^2，參數約為7*7的81%。
去掉了LRN，減少了內存的小消耗和計算時間

VGG-16 tflearn實現

tflearn 官方github上有給出基於tflearn下的VGG-16的實現 from future import division, print_function, absolute_import

import tflearnfrom tflearn.layers.core import input_data, dropout, fully_connectedfrom tflearn.layers.conv import conv_2d, max_pool_2dfrom tflearn.layers.estimator import regression# Data loading and preprocessingimport tflearn.datasets.oxflower17 as oxflower17X, Y = oxflower17.load_data(one_hot=True)# Building "VGG Network"network = input_data(shape=[None, 224, 224, 3])network = conv_2d(network, 64, 3, activation="relu")network = conv_2d(network, 64, 3, activation="relu")network = max_pool_2d(network, 2, strides=2)network = conv_2d(network, 128, 3, activation="relu")network = conv_2d(network, 128, 3, activation="relu")network = max_pool_2d(network, 2, strides=2)network = conv_2d(network, 256, 3, activation="relu")network = conv_2d(network, 256, 3, activation="relu")network = conv_2d(network, 256, 3, activation="relu")network = max_pool_2d(network, 2, strides=2)network = conv_2d(network, 512, 3, activation="relu")network = conv_2d(network, 512, 3, activation="relu")network = conv_2d(network, 512, 3, activation="relu")network = max_pool_2d(network, 2, strides=2)network = conv_2d(network, 512, 3, activation="relu")network = conv_2d(network, 512, 3, activation="relu")network = conv_2d(network, 512, 3, activation="relu")network = max_pool_2d(network, 2, strides=2)network = fully_connected(network, 4096, activation="relu")network = dropout(network, 0.5)network = fully_connected(network, 4096, activation="relu")network = dropout(network, 0.5)network = fully_connected(network, 17, activation="softmax")network = regression(network, optimizer="rmsprop", loss="categorical_crossentropy", learning_rate=0.001)# Trainingmodel = tflearn.DNN(network, checkpoint_path="model_vgg", max_checkpoints=1, tensorboard_verbose=0)model.fit(X, Y, n_epoch=500, shuffle=True, show_metric=True, batch_size=32, snapshot_step=500, snapshot_epoch=False, run_id="vgg_oxflowers17")

VGG-16 graph如下：

對VGG，我個人覺得他的亮點不多，pre-trained的model我們可以很好的使用，但是不如GoogLeNet那樣讓我有眼前一亮的感覺。

Deep Residual Network

Deep Residual Network解讀

一般來說越深的網路，越難被訓練，Deep Residual Learning for Image Recognition中提出一種residual learning的框架，能夠大大簡化模型網路的訓練時間，使得在可接受時間內，模型能夠更深(152甚至嘗試了1000)，該方法在ILSVRC2015上取得最好的成績。

隨著模型深度的增加，會產生以下問題：

vanishing/exploding gradient，導致了訓練十分難收斂，這類問題能夠通過norimalized initialization 和intermediate normalization layers解決；
對合適的額深度模型再次增加層數，模型準確率會迅速下滑（不是overfit造成），training error和test error都會很高，相應的現象在CIFAR-10和ImageNet都有提及

為了解決因深度增加而產生的性能下降問題，作者提出下面一種結構來做residual learning：

假設潛在映射為H(x)，使stacked nonlinear layers去擬合F(x):=H(x)-x，殘差優化比優化H(x)更容易。 F(x)+x能夠很容易通過」shortcut connections」來實現。

這篇文章主要得改善就是對傳統的卷積模型增加residual learning，通過殘差優化來找到近似最優identity mappings。

paper當中的一個網路結構：

Deep Residual Network tflearn實現

tflearn官方有一個cifar10的實現，代碼如下：

from __future__ import division, print_function, absolute_importimport tflearn# Residual blocks# 32 layers: n=5, 56 layers: n=9, 110 layers: n=18n = 5# Data loadingfrom tflearn.datasets import cifar10(X, Y), (testX, testY) = cifar10.load_data()Y = tflearn.data_utils.to_categorical(Y, 10)testY = tflearn.data_utils.to_categorical(testY, 10)# Real-time data preprocessingimg_prep = tflearn.ImagePreprocessing()img_prep.add_featurewise_zero_center(per_channel=True)# Real-time data augmentationimg_aug = tflearn.ImageAugmentation()img_aug.add_random_flip_leftright()img_aug.add_random_crop([32, 32], padding=4)# Building Residual Networknet = tflearn.input_data(shape=[None, 32, 32, 3], data_preprocessing=img_prep, data_augmentation=img_aug)net = tflearn.conv_2d(net, 16, 3, regularizer="L2", weight_decay=0.0001)net = tflearn.residual_block(net, n, 16)net = tflearn.residual_block(net, 1, 32, downsample=True)net = tflearn.residual_block(net, n-1, 32)net = tflearn.residual_block(net, 1, 64, downsample=True)net = tflearn.residual_block(net, n-1, 64)net = tflearn.batch_normalization(net)net = tflearn.activation(net, "relu")net = tflearn.global_avg_pool(net)# Regressionnet = tflearn.fully_connected(net, 10, activation="softmax")mom = tflearn.Momentum(0.1, lr_decay=0.1, decay_step=32000, staircase=True)net = tflearn.regression(net, optimizer=mom, loss="categorical_crossentropy")# Trainingmodel = tflearn.DNN(net, checkpoint_path="model_resnet_cifar10", max_checkpoints=10, tensorboard_verbose=0, clip_gradients=0.)model.fit(X, Y, n_epoch=200, validation_set=(testX, testY), snapshot_epoch=False, snapshot_step=500, show_metric=True, batch_size=128, shuffle=True, run_id="resnet_cifar10")

其中，residual_block實現了shortcut，代碼寫的十分棒：

def residual_block(incoming, nb_blocks, out_channels, downsample=False, downsample_strides=2, activation="relu", batch_norm=True, bias=True, weights_init="variance_scaling", bias_init="zeros", regularizer="L2", weight_decay=0.0001, trainable=True, restore=True, reuse=False, scope=None, name="ResidualBlock"): """ Residual Block. A residual block as described in MSRA"s Deep Residual Network paper. Full pre-activation architecture is used here. Input: 4-D Tensor [batch, height, width, in_channels]. Output: 4-D Tensor [batch, new height, new width, nb_filter]. Arguments: incoming: `Tensor`. Incoming 4-D Layer. nb_blocks: `int`. Number of layer blocks. out_channels: `int`. The number of convolutional filters of the convolution layers. downsample: `bool`. If True, apply downsampling using "downsample_strides" for strides. downsample_strides: `int`. The strides to use when downsampling. activation: `str` (name) or `function` (returning a `Tensor`). Activation applied to this layer (see tflearn.activations). Default: "linear". batch_norm: `bool`. If True, apply batch normalization. bias: `bool`. If True, a bias is used. weights_init: `str` (name) or `Tensor`. Weights initialization. (see tflearn.initializations) Default: "uniform_scaling". bias_init: `str` (name) or `tf.Tensor`. Bias initialization. (see tflearn.initializations) Default: "zeros". regularizer: `str` (name) or `Tensor`. Add a regularizer to this layer weights (see tflearn.regularizers). Default: None. weight_decay: `float`. Regularizer decay parameter. Default: 0.001. trainable: `bool`. If True, weights will be trainable. restore: `bool`. If True, this layer weights will be restored when loading a model. reuse: `bool`. If True and "scope" is provided, this layer variables will be reused (shared). scope: `str`. Define this layer scope (optional). A scope can be used to share variables between layers. Note that scope will override name. name: A name for this layer (optional). Default: "ShallowBottleneck". References: - Deep Residual Learning for Image Recognition. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. 2015. - Identity Mappings in Deep Residual Networks. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. 2015. Links: - [http://arxiv.org/pdf/1512.03385v1.pdf] (http://arxiv.org/pdf/1512.03385v1.pdf) - [Identity Mappings in Deep Residual Networks] (https://arxiv.org/pdf/1603.05027v2.pdf) """ resnet = incoming in_channels = incoming.get_shape().as_list()[-1] with tf.variable_op_scope([incoming], scope, name, reuse=reuse) as scope: name = scope.name #TODO for i in range(nb_blocks): identity = resnet if not downsample: downsample_strides = 1 if batch_norm: resnet = tflearn.batch_normalization(resnet) resnet = tflearn.activation(resnet, activation) resnet = conv_2d(resnet, out_channels, 3, downsample_strides, "same", "linear", bias, weights_init, bias_init, regularizer, weight_decay, trainable, restore) if batch_norm: resnet = tflearn.batch_normalization(resnet) resnet = tflearn.activation(resnet, activation) resnet = conv_2d(resnet, out_channels, 3, 1, "same", "linear", bias, weights_init, bias_init, regularizer, weight_decay, trainable, restore) # Downsampling if downsample_strides > 1: identity = tflearn.avg_pool_2d(identity, 1, downsample_strides) # Projection to new dimension if in_channels != out_channels: ch = (out_channels - in_channels)//2 identity = tf.pad(identity, [[0, 0], [0, 0], [0, 0], [ch, ch]]) in_channels = out_channels resnet = resnet + identity return resnet

Deep Residual Network tflearn這個裡面有一個downsample，我在run這段代碼的時候出現一個error，是tensorflow提示kernel size 1 小於stride，我看了好久， sample確實要這樣，莫非是tensorflow不支持kernel小於stride的情況？我這裡往tflearn里提了個issue issue-331

kaiming He在新的paper裡面提了proposed Residualk Unit，相比於上面提到的採用pre-activation的理念，相對於原始的residual unit能夠更容易的訓練，並且得到更好的泛化能力。

總結

前面一段時間，大部分花在看CV模型上，研究其中的原理，從AlexNet到deep residual network,從大牛的paper裡面學到了很多，接下來一段時間，我會去github找一些特別有意思的相關項目，可能會包括GAN等等的東西來玩玩，還有在DL meetup上聽周昌大神說的那些neural style的各種升級版本，也許還有強化學習的一些框架以及好玩的東西。