卷積神經網路保證「位移、尺度、形變不變性」該怎麼理解？

01-06

總之，卷積網路的核心思想是將：局部感受野、權值共享（或者權值複製）以及時間或空間亞採樣這三種結構思想結合起來獲得了某種程度的位移、尺度、形變不變性。
這句話該怎麼理解呢？

個人粗淺理解，有錯請直接打臉。

CNN所謂的位移，尺度，形不變，個人理解是根據卷積的特性來的。

CNN的卷積層，不就是用了卷積核對規定大小的區域進行卷積運算么。那麼我們將規定區域的信息進行位移，或是其他變化卷積結果有差么？個人感覺是沒差的。

好吧，就這點拙見

謝邀.

圖像在平移後再特徵圖上的表示也是同樣平移的，這就使圖像擁有了一定的平移不變性。

同樣的，pooling（以MAX Pooling為例），對局部感受野取其極大值，如果圖像在尺度上發生了變化，有一定概率在尺度變化後對應的感受野取到的極大值不變，這樣就可以使特徵圖不變，同樣也增加了一定的平移不變性。對於形狀不變性，實際上，在圖像識別中，重要的不是顯著特徵的絕對位置而是相對的位置，所以為了避免把過多的位置信息編碼進去，卷積和池化的操作都可以對局部的紋理進行模糊化，這樣也就使圖像有了一定的形狀的不變性。

一點拙見，如有錯誤希望能夠得到指正，謝謝。

聲明：

（1）我也在學習中，如有錯誤的地方，望給提出，定改。。。

（2）這裡面的一些內容摘自一些大牛的博客，有興趣的可以找一下。。。

CNN 的three key ideas 是局部連接（local connections）、權值共享（shared weight）、池化（pooling）；這就使得CNN的網路結構會有一定程度的移位shift不變性和deformation不變性，同時能減少一定的訓練參數。位移、尺度、形變不變性具體的還得從基本思想出發。下面就從三個基本特性出發來看一下CNN的各種不變性。

1、局部連接（local connections）

卷積神經網路有兩種神器可以降低參數數目，局部感知野和權值共享。先來說說局部感知也，一般認為人對外界的認知是從局部到全局的，而圖像的空間聯繫也是局部的像素聯繫較為緊密，而距離較遠的像素相關性則較弱。因而，每個神經元其實沒有必要對全局圖像進行感知，只需要對局部進行感知，然後在更高層將局部的信息綜合起來就得到了全局的信息。即，局部感受野指卷積層的神經元只和上一層map的局部相聯繫。

2、權值共享（shared weight）

權值共享（也就是卷積操作）減少了權值數量，降低了網路複雜度。可以看成是特徵提取的方式。其中隱含的原理是：圖像中的一部分的統計特性與其他部分是一樣的。意味著我們在這一部分學習的特徵也能用在另一部分上，所以對於這個圖像上的所有位置，我們都能使用同樣的學習特徵。這個可以用圖來進行說明（動圖，存於網盤中 http://pan.baidu.com/s/1b8NFo6 可以下載下來看一下）

左側是卷積的輸入，黃色陰影（裡面的紅色字體）是卷積核，右側的是卷積後的輸出feature map。對於同一個卷積核來說，由feature map主對角線和右上角的「4」可以看出他們的統計特性是一樣的。

3、池化（pooling）

在通過卷積獲得了特徵 (features) 之後，下一步我們希望利用這些特徵去做分類。人們可以用所有提取得到的特徵去訓練分類器，例如 softmax 分類器，但這樣做面臨計算量的挑戰，並且容易出現過擬合 (over-fitting)。

前面我們使用使用卷積後的特徵是因為圖像具有一種「靜態性」的屬性，這也就意味著在一個圖像區域有用的特徵極有可能在另一個區域同樣適用。因此，為了描述大的圖像，可以對不同位置的特徵進行聚合統計，如計算平均值或者是最大值，即mean-pooling和max-pooling。最大池化把輸入圖像分割成為不重疊的矩陣，每一個子區域（矩形區域），都輸出最大值。

子採樣層中的每個特徵圖唯一對應前一層的一個特徵圖，各特徵圖組合前一層對應特徵圖大小相同但互不重疊的所有子區域，使得卷積神經網路具有一定的空間不變性，從而實現一定程度的shift 和 distortion invariance。利用圖像局部相關性的原理，對圖像進行子抽樣，可以減少數據處理量同時保留有用信息。

最大池化技術用於視覺問題有兩個原因：

（1）通過消除非極大值，降低了上層的計算複雜度。

（2）它提供了平移不變形的一種形式。想像一下，一個卷積層級聯一個max-pooling層為了理解這種不變性，我們假設把最大池化層和一個卷積層結合起來，對於單個像素，有8個變換的方向（上、下、左、右、左上、左下、右上、右下），如果最大層是在2*2的窗口上面實現，這8個可能的配置中，有3個可以準確的產生和卷積層相同的結果。如果窗口變成3*3，則產生精確結果的概率變成了5／８。

因此，它對於位移變化有著不錯的魯棒性，最大池化用一種很靈活的方式降低了中間表示層的維度。

（Max-pooling is useful in vision for two reasons:

By eliminating non-maximal values, it reduces computation for upper layers.
It provides a form of translation invariance. Imagine cascading a max-pooling layer with a convolutional layer. There are 8 directions in which one can translate the input image by a single pixel. If max-pooling is done over a 2x2 region, 3 out of these 8 possible configurations will produce exactly the same output at the convolutional layer. For max-pooling over a 3x3 window, this jumps to 5/8.
Since it provides additional robustness to position, max-pooling is a 「smart」 way of reducing the dimensionality of intermediate representations.

）

看了夢裡水鄉的答案，受益頗多。但是最後max-pooling的兩個原因不是很懂。於是百度了一下，看到 https://www.quora.com/How-can-I-understand-this-point-about-max-pooling-in-Theano 上有解答。並發現窗口3*3的5/8有錯。正解粘貼過來：

Samir』s answer is great, and confirms my suspicions that the authors are incorrect. Here』s my logic as to a more complete answer for the 3x3 case.

Let』s say we have a 3x3 grid with the maximum pixel in the middle. For simplicity, set e=1 and all other values to 0.

a b c

d e f

g h i

The image can be translated by 1 pixel in 8 different directions: up, down, left, right, and the diagonals. If we assume the 2x2 max pooling box contains the pixels {a, b, d, e}, then 3 of the 8 possible translations will keep the max pixel in the 2x2 box, hence the 3/8. In fact, no matter where this 2x2 pooling box is, the max value will always be in a corner of the box, so 3/8 of the translations will keep it in the box.

However, what happens when we have a 3x3 max pooling box? Well, there』s different places this max value can be found relative to the 3x3 box. If it』s found in the center, what Samir said applies, and any of the 8 translations will keep the max value in the same pooling box.

However, if it』s found in one of the corners of the 3x3 box (where a, c, g, and i are), then 3/8 of the translations will keep it in the same box. If it』s one of the edges (b, d, f, h), 5/8 of the translations will.

So if we assume that the positioning of the 3x3 box is totally random with respect to the max pixel, which it should, then the probability of the max pixel staying in the same box after translation should be P(same box | center_max) * P(center_max) + P(same box | center_corner) * P(center_corner) + P(same box | center_edge) * P(center_edge) = 1*(1/9) + (3/8)*(4/9) + (5/8)*(4/9) = 5/9.

這樣看來，這個平移不變性說的是pooling時3*3或2*2範圍內的平移不變性（局部平移不變性）。我一直以為說的是，比如，一張臉的圖像，是這張臉的平移不變性。

那如何實現整張臉的平移不變性呢？還是只能在訓練樣本中，包含臉出現在各個位置的圖片？