神經網路求導:computational graph 中涉及向量的求導問題 ?(cs231n作業為例)

cs231n Assignment 2 中 Batch Normalization的 backward部分實現時的問題:

正向傳播最後一步:

out = xgamma + beta

對應反響傳播:

dbeta = np.sum(dout, axis = 0)

(1)為什麼會這樣做?為什麼會sum一下,是不是我畫computational graph時忽略了什麼結構?

(2)Batch Normalization 這裡computational graph中的全都是矩陣之間element-wise的運算,如果維度不同就自動擴展矩陣補齊。

這和Assignment 1 中的矩陣點乘不同,求導也不太相同,比如上面的問題(1)請問有什麼規律沒,或者有什麼比較好的閱讀材料請推薦一下


謝謝邀請。

  • 閱讀材料推薦

之所以拖這麼久才回答,是為了給題主一份足夠有誠意的閱讀材料。我剛剛在專欄智能單元 - 知乎專欄發布了CS231n的反向傳播筆記翻譯:

CS231n課程筆記翻譯:反向傳播筆記 - 智能單元 - 知乎專欄

這篇翻譯我們知道它的重要性,初稿出來後大家反覆進行了校對,現在終於發布,所以來給題主進行回答。題主的問題,如果足夠細心,應該在本篇翻譯中找到自己迷惑的答案,這裡就容我小小的賣個關子,關鍵詞就是數據尺寸

  • 問題回答

我個人在寫代碼的時候比較喜歡注釋,偶爾還被朋友嫌棄,但是我自己還是堅持非常仔細地注釋,現在派上了用場。下面把我關於batchnorm_forward和batchnorm_backward兩個函數的實現貼給題主,個人認為注釋足夠詳細:

首先是前向傳播訓練時模式時代碼:

# forward pass
# note: use staged computation, store intermediates the for backward pass

# 1: compute mini-batch mean: mu
mu = 1 / float(N) * np.sum(x, axis=0) # mu: (D,)

# compute mini-batch variance: var
# 2: x - mu
x_minus_mu = x - mu # x_minus_mu: (N, D)

# 3: square
x_minus_mu_square = x_minus_mu**2 # x_minus_mu_square: (N, D)

# 4: var
var = 1 / float(N) * np.sum(x_minus_mu_square, axis=0) # var: (D,)

# use mu and var to do the normalization
# 5: sqrt the var
sqrt_var = np.sqrt(var + eps) # sqrt_var: (D,)

# 6: invert sqrt_var
sqrt_var_invert = 1.0 / sqrt_var # sqrt_var_invert: (D,)

# 7: get normalized x
x_norm = x_minus_mu * sqrt_var_invert # x_norm: (N, D)

# compute the y: using gamma and beta
# 8: mul gamma to scale
y_unshift = gamma * x_norm # y_unshift: (N, D)

# 9: add beta to shift
y = y_unshift + beta # y: (N, D)
out = y

# compute the running mean and variance
running_mean = momentum * running_mean + (1 - momentum) * mu
running_var = momentum * running_var + (1 - momentum) * var

# store the intermediates
cache = (mu, x_minus_mu, x_minus_mu_square, var, sqrt_var,
sqrt_var_invert, x_norm, y_unshift, gamma, beta, x, bn_param)

然後是前向傳播測試時代碼:

# get the running mean and variance for dic
running_mean = bn_param[running_mean]
running_var = bn_param[running_var]

# compute the normalized x
x_norm_test = (x - running_mean) / np.sqrt(running_var + eps)

# scale and shift
y = gamma * x_norm_test + beta
out = y

# store the results
cache = (x, x_norm_test, gamma, beta, eps)

最後是反向傳播代碼:

# get the intermediates
mu, x_minus_mu, x_minus_mu_square, var, sqrt_var, sqrt_var_invert, x_norm, y_unshift, gamma, beta, x, bn_param = cache
eps = bn_param.get(eps, 1e-5) # if no eps, set to 1e-5
N, D = dout.shape # upper grad flow

# backward pass
# Back to 9: x + y = z -&> dz/dx(writted in dx) = 1, dz/dy(writted in dy) = 1
# with chain rule -&> dx = 1 * upper grad, dy = 1 * upper grad
# here dout is the upper grad
dy_unshift = 1 * dout
dbeta = 1 * np.sum(dout, axis=0) # dbeta shape: (D,)

# Back to 8: x * y = z -&> dx = y, dy = x
# with chain rule -&> dx = y * upper grad, dy = x * upper grad
# here dy_unshift is the upper grad
dx_norm = gamma * dy_unshift
dgamma = np.sum(x_norm * dy_unshift, axis=0)

# Back to 7: x * y = z -&> dx = y, dy = x
# with chain rule -&> dx = y * upper grad, dy = x * upper grad
# here dx_norm is the upper grad
dx_minus_mu = sqrt_var_invert * dx_norm # !! note: do not forget here is a dx_minus_mu
dsqrt_var_invert = np.sum(x_minus_mu * dx_norm, axis=0)

# Back to 6: y = 1 / x -&> dx(dy/dx) = - 1 / x**2
# with chain rule -&> dx = - 1 / x**2 * upper grad
# here dsqrt_var_invert is the upper grad
dsqrt_var = - 1.0 / sqrt_var**2 * dsqrt_var_invert

# Back to 5: y = np.sqrt(x) -&> dx = 1/2 * x**(-1/2)
# with chain rule -&> dx = 1/2 * x**(-1/2) * upper grad
# here dsqrt_var is the upper grad
dvar = 0.5 * (var + eps)**(-0.5) * dsqrt_var

# Back to 4: y = 1/N * x -&> dy/dx = 1/N
# with chain rule -&> dx = 1/N * upper grad
# here dvar is the upper grad
dx_minus_mu_square = 1 / float(N) * np.ones((x_minus_mu_square.shape)) * dvar

# Back to 3: y = x**2 -&> dy/dx = 2 * x
# with chain rule -&> dx = 2 * x * upper grad
# here dx_minus_mu_square is the upper grad
dx_minus_mu += 2.0 * x_minus_mu * dx_minus_mu_square # !! note to use +=

# Back to 2: z = x - y -&> dz/dx = 1, dz/dy = -1
# with chain rule -&> dx = 1 * upper grad, dy = -1 * upper grad
# here dx_minus_mu is the upper grad
dx = 1.0 * dx_minus_mu # !! note here is a dx
dmu = - np.sum(dx_minus_mu, axis=0)

# Back to 1: y = 1/N * x -&> dy/dx = 1/N
# with chain rule -&> dx = 1/N * upper grad
# here dmu is the upper grad
dx += 1 / float(N) * np.ones((dx_minus_mu.shape)) * dmu # !!use +=

# done!

我在反向傳播中的每一步都進行了詳細的注釋,應該能幫到題主。

回答完畢。


謝邀。參見下面這幅圖,前面的computatinal graph我就省略了,我們從hat{x}這裡開始.

gamma hat{x}eta相加,其實是一個broadcast(NumPy refresher)操作,對於out的每一行,都進行了加eta的操作,所以在backpropagation的過程中,dout的每一行都要加到dbeta上。更細一點,你可以把broadcasting這種操作看成computational graph中的一種操作,首先把1*D的beta「複製」m份,變成一個broadcast過的矩陣,然後和gamma hat{x}逐行相加:

這樣就會發現,backpropagation的過程中,broadcasting這個節點其實有多個輸出,那麼自然迴流的誤差(梯度)就需要求和了。

對於你的第二個問題,其實broadcasting在backpropagation的過程不難,關鍵就是「找責任」。在forward的過程中,找准變數a參與了哪些操作,通過這些操作影響到了哪些變數b,c,d...,那麼根據這個操作和影響到的變數b,c,d...的梯度,就可以得到變數a的梯度了。

推薦幾個backpropagation的博客:

1. Neural networks and deep learning

2. 我有寫關於computational graph的blog:Using Computation Graph to Understand and Implement Backpropagation a€「 SUNSHINEATNOON a€「 Madness between pain and boredom.

我很久沒有看cs231n了,有不當之處請指出,謝謝。


我當時做Assignment2的時候也有相同的疑惑,雖然從維數上和直覺上來看確實是這樣,通過簡單的特例推理也確實是要sum一下,但是始終沒找到更好的理論解釋來說服自己。

下面是我看過的兩個不錯的博客

Understanding the backward pass through Batch Normalization Layer

What does the gradient flowing through batch normalization looks like ?(http://cthorey.github.io./backpropagation/)


我試了一下加上平均,基本上沒有影響...這裡之所以直接用sum,我個人理解是計算方便,並沒有什麼理論支持。

因為從邏輯上來說,取平均才是正確的選擇。不過b的值太小了而且對結果的影響力也小,所以無論怎麼操作,影響都不大.....

看到有些人說正確結果是sum一下..那是用梯度檢驗的時候吧,其實你看梯度檢驗的代碼,他就是sum的梯度值,所以最後你要跟他的一致肯定是要直接sum了..


推薦閱讀:

TAG:神經網路 | 深度學習DeepLearning |