VGG學習筆記

05-22

VGG學習筆記

來自專欄腦圖像深度搬磚之路

原文：[1409.1556] Very Deep Convolutional Networks for Large-Scale Image Recognition

VGG是由牛津大學Visual Geometry Group（網路即以課題組的名字命名）提出的卷積神經網路模型。他們提出了深度卷積神經網路的多種模型及配置，其中一種提交到了2014年ILSVRC（ImageNet大規模圖像識別）競賽上。這個模型由於由16個權重層組成，因此也被稱為VGG-16，其在該競賽中取得了top-5上92.7%的準確率。
文章的主要工作是表明增加網路的深度能夠在一定程度上影響網路最終的性能，文章通過逐步增加網路深度來提高性能，雖然看起來有一點小暴力，沒有特別多取巧的，但是確實有效，很多pretrained的方法就是使用VGG的model（主要是16和19），VGG相對其他的方法，參數空間很大，最終的model有500多m，alnext只有200m，googlenet更少，所以train一個VGG模型通常要花費更長的時間，但是這些pretrained model讓人們很方便的將VGG的基礎結構使用到其他的領域去。

VGG有很強的泛化能力，VGG網路的變體也成功應用到了很多不同的研究領域。

論文概述

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers. These findings were the basis of our ImageNet Challenge 2014submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

第一章，第二段：

For instance, the best-performing submissions to the ILSVRC-2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer. Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al., 2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecture design – its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers.

主要的幾點

站在Alex Net的肩膀上；使用了更小的filter，更小的stride(1st Conv層）；訓練和測試時使用整圖和多scales。設計模型時主要專註於深度方面：-固定了其他超參數、-不斷增加層數、-技巧：3x3filter

第二章

2.1節第一段：

The image is passed through a stack of convolutional(conv.) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center).

大量用了3×3大小的卷積核，因為認為3*3是捕捉左/右、上/下、中心的最小大小。（為什麼用小不用大卷積核後邊會說）

這段還有些別的細節，具體看論文吧

第二段略，第三段：

as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory consumption and computation time.

沒有使用Alex Net里的LRN層:-沒有幫助改進效果、-增加內存佔用和計算時間

現在人們也都不用這個層了，看AlexNet也不用細看LRN這塊

2.2節匯總了一下論文中實驗的多個模型，第二段說，雖然VGG更深了，但是因為用的是小卷積核，參數反而少了。

2.3節

Rather than using relatively large receptive fields in the first conv. layers (e.g. 11×11 with stride 4 in (Krizhevsky et al., 2012), or 7×7 with stride 2 in (Zeiler & Fergus,2013; Sermanet et al., 2014)), we use very small 3 × 3 receptive fields throughout the whole net,which are convolved with the input at every pixel (with stride 1).

Alex Net和Z-F Net在第一層分別用了11×11（stride 4）和7×7（stride 2）的卷基層，但是VGG從第一層開始就使用3 × 3的小卷積核，並且只使用了stride 1。

Alex Net和Z-F Net第一層選用這種大卷積核應該也是有些歷史因素存的，畢竟小卷積核也是大牛們慢慢摸索出來的。那個年代GPU的算力不像現在這麼牛，為了節省算力，依靠直覺，當然是先上大卷積核大stride在前幾層趕緊把圖像的尺寸降下來。

It is easy to see that a stack of two 3×3 conv.layers (without spatial poolingin between) has an effective receptive field of 5×5; three such layers have a 7 × 7 effective receptive field.

VGG大膽使用小卷積核的原因

So what have we gained by using, for instance, a stack of three 3×3 conv.layers instead of a single 7×7 layer?

First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative.
Second, we decrease the number of parameters: assuming that both the input and the output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by $3left( 3^{2}C^{2} ight)=27C^{2}$ weights; at the same time, a single 7 × 7 conv. layer would require $7C^{2} =49C^{2}$ parameters, i.e.81% more.
This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).

第二段說了一下1×1卷積核，1×1卷積還是看Inception那篇吧

第三段對比了一下其它的模型

第三章講了很多與訓練有關的trick，但是畢竟當年最新的技術很多到現在都已經成了共識，被寫進了書里??。感覺現在大牛們總結的都比較好了，直接看教程比看論文里寫的更好。

第四章

4.1節

First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers. We thus do not employ normalisation in the deeper architectures (B–E).

A網路發現 local response normalisation （LRN）沒用

Second, we observe that the classification error decreases with the increased ConvNet depth: from 11 layers in A to 19 layers in E.

A-E網路，發現網路越深，分類錯誤率越低（非線性層多的原因）

Notably, in spite of the same depth, the configuration C (which contains three 1 × 1 conv. layers), performs worse than the configuration D, which uses 3 × 3 conv. layers throughout the network. This indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C).

對比B、C，加了1X1 conv,效果更好（因為C網路比B網路增加了非線性層）；

對比C、D，3x3 conv的D網路比1X1 conv的C網路效果好（能抓住更多空間上下文信息）

The error rate of our architecture saturates when the depth reaches 19 layers, but even deeper models might be beneficial for larger datasets.

到19層後， error rate saturate無法進一步下降，但是他們相信更深模型+更大數據，可以下降。 ??然後第二年ResNet就登場了

We also compared the net B with a shallow net with five 5 × 5 conv. layers, which was derived from B by replacing each pair of 3×3 conv. layers with a single 5×5 conv. layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 7% higher than that of B (on a center crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters.

他們把B網路變形出了個B*網路，發現深且小的conv模型要優於淺且寬的conv模型

補上網路結構圖，圖裡各個網路里有差別的地方都加粗了

別的章節略過，具體看論文吧

完