機器學習筆記14 —— BP演算法相關編程與編程作業4神經網路後向傳播演算法

01-29

不知道大家是否還記得前面提過的fminunc函數。其輸入的參數是 $theta$ ，函數的返回值是代價函數jVal以及導數值gradient。然後將返回值傳遞給高級優化演算法fminunc，然後輸出為輸入值@costFunction，以及 $theta$ 值的初始值。

其中參數 $Theta_1,Theta_2,Theta_3…$ 和 $D^{(1)},D^{(2)},D^{(3)},…$ 都為矩陣，那麼為了讓其適用於fminunc，我們要將其變成向量，

假如我們 $Theta_1,Theta_2,Theta_3$ 參數和 $D^{(1)},D^{(2)},D^{(3)}$ 參數如下：

其代碼具體實現如下：

thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]ndeltaVector = [ D1(:); D2(:); D3(:) ]nnTheta1 = reshape(thetaVector(1:110),10,11)nTheta2 = reshape(thetaVector(111:220),10,11)nTheta3 = reshape(thetaVector(221:231),1,11)nnn%======================例子==============================%n>> Theta1 = ones(3,4)nTheta1 =n 1 1 1 1n 1 1 1 1n 1 1 1 1n>> Theta2 = 2*ones(3,4)nTheta2 =n 2 2 2 2n 2 2 2 2n 2 2 2 2n>> Theta3 = 3*ones(1,4)nTheta3 =nn 3 3 3 3n% 變成一個向量：n>> thetaVec = [Theta1(:);Theta2(:);Theta3(:)]；n%thetaVec向量大小n>> size(thetaVec)nans =n 28 1n%恢復三個矩陣：nreshape(thetaVec(1:12),3,4)nreshape(thetaVec(13:24),3,4)nreshape(thetaVec(25:28),1,4)n

練習1
Suppose D1 is a 10 x 6 matrix and D2 is a 1 x 11 matrix. You set:
DVec = [D1(:);D2(:)];
Which of the following would get D2 back from DVec?
A.reshape(DVec(60:71),1,11)
B.reshape(DVec(60:72),1,11)
C.reshape(DVec(61:71),1,11)
D.reshape(DVec(60:70),11,1)
Answer：C

整個過程結合起來如下：

首先我們有初始化參數 $Theta_1,Theta_2,Theta_3$ ,將這些矩陣展開為一個長向量（稱之為initialTheta），然後作為theta參數的初始設置傳入函數fminunc。緊接著我們執行代價函數costFunction，實現演算法為圖中下部分。costFunction函數將傳入參數thetaVec（就是剛才包含所有 $Theta$ 參數的向量），然後通過reshape函數得到初始的矩陣，這樣可以更方便地通過前向傳播和反向傳播以求得導數 $D^{(1)},D^{(2)},D^{(3)}$ 和代價函數 $J(Theta)$ 。最後按順序展開得到gradientVec，在函數里得以向量的形式返回。

一般我們計算導數的時候，習慣將其等於在該點的導數，在我們使用梯度下降計算導數的時候，雖然可能 $J(Theta)$ 每次迭代都在下降，但是因為反向傳播的複雜性，可能導致我們的代碼存在BUG。有一個辦法叫做梯度檢驗（Gradient Checking），它能減少這種錯誤的概率。

在我們求該點的斜率的時候，我們不直接使用其導數，而是用 $frac{d}{dTheta}J(Theta)approxfrac{J(Theta+epsilon)-J(Theta-epsilon)}{2epsilon}$ 代替。通常 $epsilon$ 取較小的一個數。（其實就是使用導數的定義）

練習2
Let $J(theta)=theta^3$ .Furthermore, let $theta =1$ and $epsilon=0.01$ .You use the fomula: $frac{J(Theta+epsilon)-J(Theta-epsilon)}{2epsilon}$ to approximate the derivative.What value do you get using this approximation?(When $theta =1$ ,the true,exact derivative is $frac{d}{dtheta}J(theta)=3$ )
A.3.0000
B.3.0001
C.3.0301

D.6.0002
Answer:B
代入即可。

因為 $Theta$ 是一個向量，所以一般的情況為： $frac{d}{dTheta_j}J(Theta)approxfrac{J(Theta_1,…,+Theta_j+epsilon,…,Theta_n)-J(Theta_1,…,+Theta_j-epsilon,…,Theta_n)}{2epsilon}$

而程序我們一般這麼寫：

epsilon = 1e-4;nfor i = 1:n,n thetaPlus = theta;n thetaPlus(i) += epsilon;n thetaMinus = theta;n thetaMinus(i) -= epsilon;n gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)nend;n

我們先用反向傳播計算出導數DVec（也就是上面講的那部分）。然後我們使用梯度檢驗是否gradApprox和上面計算出來的導數結果一致。最後在使用演算法學習的時候關閉梯度檢驗。因為梯度檢驗主要是為了讓我們知道我們寫的程序演算法是否存在錯誤，而不是用來計算導數的，因為這種方法計算導數相比於之前的會非常慢。

練習3
What is the main reason that we use the backpropagation algorithm rather than the numerical gradient computation method during learning?
A.The numerical gradient computation method is much harder to implement.
B.The numerical gradient algorithm is very slow.

C.Backpropagation does not require setting the parameter EPSILON.
D.None of the above.
Answer：C
分析：正是因為梯度檢驗這種方法算導數的話太慢了所以我們不使用，僅用於演算法是否正確的檢驗。

Random Initialization（隨機初始化）：

當我們使用梯度下降演算法的時候，我們需要設置初始值。

在我們之前線性回歸和邏輯回歸使用梯度函數的時候，都習慣把初始值設置為0，但是在神經網路裡面這樣設置是否可以呢？

假設我們有這樣一個網路，其初始參數都設為0。那麼我們會發現其激勵 $a_1^{(2)}=a_2^{(2)}$ ,且誤差 $delta_1^{(2)}=delta_2^{(2)}$ ,且導數 $frac{d}{dTheta^{(1)}_{01}}J(Theta)=frac{d}{dTheta^{(1)}_{02}}J(Theta)$ 。這就導致了在參數更新的情況下，兩個參數是一樣的。這就會導致進入一個死循環，無論怎麼重複計算其兩邊的激勵還是一樣的。

這種現象又被稱為對稱現象。而隨機初始化正是要打破這種對稱性。我們將初始化權值 $Theta_{ij}^{(l)}$ 的範圍限定在 $[-epsilon.epsilon]$ 。（不要跟上面的 $epsilon$ 混淆）

其代碼表示如下：

%If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.nnTheta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;nTheta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;nTheta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;n

練習4
Consider this procedure for initializing the parameters of a neural network:
1.Pick a random number r = rand(1,1)*(2*INIT_EPSILON)-INIT_EPSILON;
2.Set $Theta_{ij}^{(l)}=r$ for all i,j,l
Does this work?
A.Yes,because the parameter are chosen randomly.
B.Yes,unless we are unlucky and get r=0(up to numerical precision).

C.Maybe,depending on the training set inputs x(i).
D.No,because this fails to break symmetry.
Answer：D
分析：我覺得這裡關鍵點在於 rand(1,1)是一個固定的數字了，而不是一個隨機的矩陣，讓 $Theta_{ij}^{(l)}$ 等於一個固定的數字，那麼結果其實跟初始值為0一樣，兩邊的激勵始終相等。

結合上一章所講的內容，下面我們把瑣碎的東西整合起來，給出實現神經網路演算法訓練的實現過程：

前期準備：搭建網路的大體框架。

例如，我們要確定神經網路有多少輸入單元，有多少隱藏層，每一層隱藏層又有多少個單元，還有多少輸出單元。那我們怎麼去選擇呢？

當然輸入輸出單元是很簡單的，有多少個輸入和輸出都是我們已知的。假如我們是處理多元分類問題，我們要把輸出單元寫成矩陣的形式，例如有3個分類，輸出單元應該寫成 $y=left[ begin{array}{ccc}1 0 0 end{array} right]orleft[ begin{array}{ccc}0 1 0 end{array} right]orleft[ begin{array}{ccc}0 0 1 end{array} right]$ 。至於隱層數量的話，我們有一個默認的規則就是只是用單個隱藏層，所以上圖左邊是最普遍的一種模型。或者我們使用不止一個隱藏層的話，我們每一層隱藏層的單元數量都是相等的。對於隱藏單元的數量，雖然計算量較大，但是一般隱藏單元越多越好。下面就是介紹我們怎麼訓練我們的神經網路了。

第一步：隨機初始化權值。

第二步：執行前向傳播演算法，對於每一個 $x^{(i)}$ 計算出假設函數 $h_Theta(x^{(i)})$ 。

第三步：計算出代價函數 $J(Theta)$ 。

第四步：執行反向傳播演算法，計算出偏導數 $frac{alpha}{alphaTheta_{jk}^{(l)}}J(Theta)$ 。

具體操作就是執行一個for循環，先將 $(x^{(1)},y^{(1)})$ 進行一次前向傳播和後向傳播的操作，然後再對 $(x^{(2)},y^{(2)})$ 進行相同的操作一直到 $(x^{(n)},y^{(n)})$ ，其間我們還需計算每一層激勵的誤差 $delta^{(l)}$ 。

第五步：利用梯度檢驗，對比已經計算得到的偏導數項是否與梯度檢驗演算法計算出的導數項基本相等。

第六步：最後我們利用梯度下降演算法或者更高級的演算法例如BFGS、共軛梯度法等，結合之前算出的偏導數項，最小化代價函數 $J(Theta)$ 算出權值的大小 $Theta$ 。

練習5
Suppose you are using gradient descent together with back propagation to try to minimize $J(Theta)$ as a function of $Theta$ .Which of the following would be a useful step for verifying that the learning algorithm is running correctly?

A.Plot $J(Theta)$ as a function of $Theta$ ,to make sure gradient

descent is going downhill.

B.Plot $J(Theta)$ as a function of the number of iterations and make sure it is increasing(or at least non-decreasing)with every iteration.
C.Plot $J(Theta)$ as a function of the number of iterations and make sure it is decreasing(or at least non-decreasing)with every iteration.
D.Plot $J(Theta)$ as a function of the number of iterations to make sure the parameter values are improving in classification accuracy.
Answer： C
分析：還是像以前一樣， $J(Theta)$ 是隨著迭代的次數下降的。

下面是關於本章的練習：

練習1
You are training a three layer neural network and would like to use backpropagation to compute the gradient of the cost function. In the backpropagation algorithm, one of the steps is to update $Δ^{(2)}_{ij}:=Δ^{(2)}_{ij}+δ^{(3)}_i?(a^{(2)})_j$ for every i,j. Which of the following is a correct vectorization of this step?
A. $Δ^{(2)}_{ij}:=Δ^{(2)}+δ^{(3)}?(a^{(3)})^T$
B. $Δ^{(2)}_{ij}:=Δ^{(2)}+δ^{(3)}?(a^{(2)})^T$

C. $Δ^{(2)}_{ij}:=Δ^{(2)}+(a^{(2)})^T*δ^{(2)}$
D. $Δ^{(2)}_{ij}:=Δ^{(2)}+(a^{(2)})^T*δ^{(3)}$
Answer：B
分析：代入公式就好。
練習2
Suppose Theta1 is a 5x3 matrix, and Theta2 is a 4x6 matrix. You set thetaVec=[Theta1(:);Theta2(:)]. Which of the following correctly recovers Theta2?
A.reshape(thetaVec(16:39),4,6)
B.reshape(thetaVec(15:38),4,6)
C.reshape(thetaVec(16:24),4,6)
D.reshape(thetaVec(15:39),4,6)
E.reshape(thetaVec(16:39),6,4)
Answer：A
分析：請參考上面在向量中如何恢復矩陣。
練習3
Let $J(θ)=2θ^3+2$ . Let $θ=1$ , and $?=0.01$ . Use the formula $frac{J(θ+?)?J(θ??)}{2?}$ to numerically compute an approximation to the derivative at $θ=1$ . What value do you get? (When $θ=1$ , the true/exact derivati ve is $frac{dJ(θ)}{dθ}=6$ .)
A.6
B.8
C.5.9998
D.6.0002
Answer：D
分析：代入梯度檢驗的公式算一下就好了。
練習4
Which of the following statements are true? Check all that apply.
A.Gradient checking is useful if we are using gradient descent as our optimization algorithm. However, it serves little purpose if we are using one of the advanced optimization methods (such as in fminunc).
B.If our neural network overfits the training set, one reasonable step to take is to increase the regularization parameter $λ$ .
C.Using gradient checking can help verify if ones implementation of backpropagation is bug-free.
D.Using a large value of $λ$ cannot hurt the performance of your neural network; the only reason we do not set $λ$ to be too large is to avoid numerical problems.
E.For computational efficiency, after we have performed gradient checking to verify that our backpropagation code is correct, we usually disable gradient checking before using backpropagation to train the network.
F.Computing the gradient of the cost function in a neural network has the same efficiency when we use backpropagation or when we numerically compute it using the method of gradient checking.
Answer：B、C、E、
分析：
A.梯度檢驗只是用來檢驗我們算偏導數的演算法是否正確，而不是用來計算的。
B.過擬合增大正則化參數 $λ$ 正確。
C.梯度檢驗能檢驗反向傳播演算法是否正確。
D.正則化參數 $λ$ 太大會導致欠擬合。
E.還是在說梯度檢驗能驗證反向傳播演算法的正確性。
F.還是在說梯度檢驗可以用來在演算法里算偏導數。
練習5
Which of the following statements are true? Check all that apply.
A.Suppose you have a three layer network with parameters $Θ^{(1)}$ (controlling the function mapping from the inputs to the hidden units) and $Θ^{(2)}$ (controlling the mapping from the hidden units to the outputs). If we set all the elements of $Θ^{(1)}$ to be 0, and all the elements of $Θ^{(2)}$ to be 1, then this suffices for symmetry breaking, since the neurons are no longer all computing the same function of the input.
B.If we are training a neural network using gradient descent, one reasonable "debugging" step to make sure it is working is to plot $J(Θ)$ as a function of the number of iterations, and make sure it is decreasing (or at least non-increasing) after each iteration.
C.Suppose you are training a neural network using gradient descent. Depending on your random initialization, your algorithm may converge to different local optima (i.e., if you run the algorithm twice with different random initializations, gradient descent may converge to two different solutions).
D.If we initialize all the parameters of a neural network to ones instead of zeros, this will suffice for the purpose of "symmetry breaking" because the parameters are no longer symmetrically equal to zero.
E.If we are training a neural network using gradient descent, one reasonable "debugging" step to make sure it is working is to plot $J(Θ)$ as a function of the number of iterations, and make sure it is decreasing (or at least non-increasing) after each iteration.
F.Suppose we have a correct implementation of backpropagation, and are training a neural network using gradient descent. Suppose we plot $J(Θ)$ as a function of the number of iterations, and find that it is increasing rather than decreasing. One possible cause of this is that the learning rate $α$ is too large.
G.Suppose that the parameter $Θ^{(1)}$ is a square matrix (meaning the number of rows equals the number of columns). If we replace $Θ^{(1)}$ with its transpose $(Θ^{(1)})^T$ , then we have not changed the function that the network is computing.
H.Suppose we are using gradient descent with learning rate $α$ . For logistic regression and linear regression, $J(theta)$ was a convex optimization problem and thus we did not want to choose a learning rate $α$ that is too large. For a neural network however, $J(Θ)$ may not be convex, and thus choosing a very large value of $α$ can only speed up convergence.
Answer：B、C、
分析：
A.一層的權重都是一樣的數字怎麼會打破對稱呢。
B.隨著迭代次數的增加代價函數 $J(Theta)$ 下降正確。
C.學習速率 $alpha$ 太大會導致代價函數隨著迭代次數的增加也增加正確。
D.權重全部為1也不能打破對稱的。
E.保證 $J(Θ)$ 隨著迭代次數的增加而下降用以驗證演算法的正確。
F.同B。
G.矩陣的倒置一般不相等。
H.選擇大的學習速率 $α$ 會導致 $J(Θ)$ 不收斂的。

下面是編程作業4了：

依然是上次識別數字的數據，因為各種圖像的顯示函數都已經寫好了所以就先不用管了，我們看一下所要構建的神經網路：

隱藏層只有一層，但是輸入有挺多的，輸出就是0-9 10個數字。那麼就先要求打開nnCostFunction把代價函數補充了：

代價函數：

反向傳播：

function [J grad] = nnCostFunction(nn_params, ...n input_layer_size, ...n hidden_layer_size, ...n num_labels, ...n X, y, lambda)n%NNCOSTFUNCTION Implements the neural network cost function for a two layern%neural network which performs classificationn% [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...n% X, y, lambda) computes the cost and gradient of the neural network. Then% parameters for the neural network are "unrolled" into the vectorn% nn_params and need to be converted back into the weight matrices. n% n% The returned parameter grad should be a "unrolled" vector of then% partial derivatives of the neural network.n%nn% Reshape nn_params back into the parameters Theta1 and Theta2, the weight matricesn% for our 2 layer neural networknTheta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...n hidden_layer_size, (input_layer_size + 1));nnTheta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...n num_labels, (hidden_layer_size + 1));nn% Setup some useful variablesnm = size(X, 1);n n% You need to return the following variables correctly nJ = 0;nTheta1_grad = zeros(size(Theta1));nTheta2_grad = zeros(size(Theta2));nn%=================================所寫代碼========================================%n% Instructions: You should complete the code by working through then% following parts.n%n% Part 1: Feedforward the neural network and return the cost in then% variable J. After implementing Part 1, you can verify that yourn% cost function computation is correct by verifying the costn% computed in ex4.mnn%首先將輸出單元y用矩陣表示出來：nI = eye(num_labels);nY = zeros(m, num_labels);nfor i=1:mn Y(i, :)= I(y(i), :);nendnn%然後我們要寫出代價函數的代碼。那麼我們先利用正向傳播把目標函數h寫出來，再寫正則化，最後拼在一起就好了。n%只有三層神經網路,先按照前向傳播的步驟（筆記13有例子）寫出各層激勵：na1 = [ones(m, 1) X]; %加上偏置單元nz2 = a1*Theta1;na2 = [ones(size(z2, 1), 1) sigmoid(z2)];nz3 = a2*Theta2;na3 = sigmoid(z3);nh = a3;nn%正則化部分：np = sum(sum(Theta1(:, 2:end).^2, 2)) + sum(sum(Theta2(:, 2:end).^2, 2));nn%代價函數：nJ = sum(sum((-Y) .* log(h) - (1-Y) .* log(1-h), 2))/m + lambda * p/(2 * m);nn% Part 2: Implement the backpropagation algorithm to compute the gradientsn% Theta1_grad and Theta2_grad. You should return the partial derivatives ofn% the cost function with respect to Theta1 and Theta2 in Theta1_grad andn% Theta2_grad, respectively. After implementing Part 2, you can checkn% that your implementation is correct by running checkNNGradientsn%n% Note: The vector y passed into the function is a vector of labelsn% containing values from 1..K. You need to map this vector into a n% binary vector of 1s and 0s to be used with the neural networkn% cost function.n%n% Hint: We recommend implementing backpropagation using a for-loopn% over the training examples if you are implementing it for the n% first time.n%n%接著我們要用後向傳播去計算每一層激勵的誤差（筆記13）：nsigma3 = a3 - Y;nsigma2 = (sigma3 * Theta2) .* sigmoidGradient([ones(size(z2, 1), 1) z2]);nsigma2 = sigma2(:, 2:end);nn%然後計算梯度：ndelta_1 = (sigma2 * a1);ndelta_2 = (sigma3 * a2);nn% Part 3: Implement regularization with the cost function and gradients.n%n% Hint: You can implement this around the code forn% backpropagation. That is, you can compute the gradients forn% the regularization separately and then add them to Theta1_gradn% and Theta2_grad from Part 2.n%n%最後把正則化加上：np1 = (lambda/m) * [zeros(size(Theta1, 1), 1) Theta1(:, 2:end)];np2 = (lambda/m) * [zeros(size(Theta2, 1), 1) Theta2(:, 2:end)];nTheta1_grad = delta_1./m + p1;nTheta2_grad = delta_2./m + p2;nn%=================================所寫代碼========================================%nn% -------------------------------------------------------------nn% =========================================================================nn% Unroll gradientsngrad = [Theta1_grad(:) ; Theta2_grad(:)];nnnendn

最後我們要把sigmoid函數的求導補充完整，打開sigmoidGradient.m：

function g = sigmoidGradient(z)n%SIGMOIDGRADIENT returns the gradient of the sigmoid functionn%evaluated at zn% g = SIGMOIDGRADIENT(z) computes the gradient of the sigmoid functionn% evaluated at z. This should work regardless if z is a matrix or an% vector. In particular, if z is a vector or matrix, you should returnn% the gradient for each element.nng = zeros(size(z));nn% ====================== YOUR CODE HERE ======================n% Instructions: Compute the gradient of the sigmoid function evaluated atn% each value of z (z can be a matrix, vector or scalar).nng = sigmoid(z).*(1-sigmoid(z));nn% =============================================================nnendn

梯度檢驗和隨機初始化等函數已經寫好了所以到這裡就可以了。主要還是第一部分的整個神經網路訓練的過程。RUN一下：

最後提交一下就好了：

筆記整理自Coursera吳恩達機器學習課程。

避免筆記的冗雜，翻閱時不好找，所以分成幾個部分寫，有興趣的同學可以關注一下其它的筆記。

機器學習筆記1 —— 機器學習定義、有監督學習和無監督學習

機器學習筆記2 —— 線性模型、價值函數和梯度下降演算法

機器學習筆記3 —— 線性代數基礎

機器學習筆記4 —— 多特徵量線性回歸

機器學習筆記5 —— 正規方程

機器學習筆記6 —— Matlab編程基礎

機器學習筆記7 —— 編程作業1

機器學習筆記8 —— 邏輯回歸模型的代價函數和梯度下降演算法

機器學習筆記9 —— 過擬合和正則化

機器學習筆記10 —— 編程作業2

機器學習筆記11 —— 神經網路

機器學習筆記12 —— 編程作業3

機器學習筆記13 —— 神經網路的代價函數和反向傳播演算法(BP演算法)