Building Deep Neural Network from scratch-吳恩達深度學習第一課第四周習題答案(1)

05-11

在第四周的習題中，要實現以下內容：

完成多層深度神經網路的設計和編碼實現
完成前向傳播和後向傳播模塊
封裝完整的DNN模塊用於學習
根據構建的DNN來完成貓圖像識別

注意from scratch只能採用numpy第三方庫，不能採用tensorflow等工具。

tensorflow等工具推出之後極大地降低了深度學習的門檻，也得到了廣泛應用，將大家從繁重的底層編碼中解脫出來，尤其是梯度下降求導和最優值求解等工作。但我以為只會用tensorflow等第三方工具來做神經網路、完全不會推導參數求解方法、缺乏從零開始編碼模型是難以在深度學習方面有所進步的，尤其是當一個模型沒有開源實現的時候，需要自己來進行復現。因此在做完這個assignment之後，決定把答案寫出來。

1. 整體架構

本篇要實現一個L層的前饋神經網路，其中L指神經網路層數，每層的神經元數量可以人為進行指定，例如

layers_dims = [12288, 20, 7, 5, 1]

表示一個五層的神經網路

其中12288表示輸入層 $x$ 的維數，由於圖片大小為 $64 imes 64 imes3=12288$
第2、3、4、5層的神經元數量分別為20、7、5、1，第5層得到的結果是最終輸出，表示對這張圖片是否為貓的判斷。

其中第1層到第L-1層的隱藏層神經元激活函數為ReLU，最後輸出層L層的神經元激活函數為sigmoid。結果如下圖1所示

圖1

而整個過程如下圖2所示，包括3個步驟、5個小步驟：

初始化權重W1、b1、W2、b2、...、WL、bL
在每個iteration中循環包括：

前向傳播
後向傳播求取梯度下降方向
更新參數

預測

圖2

2. Initialization 參數初始化

假設每層的神經元數量為 $n^{[i]}$ ，那麼

$W^{[i]} in R^{n^{[i]} imes n^{[i-1]}}$ ，例如第一層神經元的輸入為 $x in R^{12288}$ ，第一層神經元數量為20，那麼初始化參數的時候 $W^{[i]} in R^{12288 imes 20}$ ；
而 $b^{[i]} in R^{n^{[i]} imes1}$ .
由於在運算中是Vectorized之後進行傳播，也就是每次輸入的多個樣本組成的矩陣 $X in R^{n^{[0]} imes m}$ ，其中m為樣本數量，X中的每一列表示一個樣本的特徵向量。因此在計算過程中我們每一層的輸出都是矩陣，寬度為m， $Z^{[i]},A^{[i]} in R^{n^{[i]} imes m}$

對應的總結如下圖3所示：

圖3 DNN參數維度介紹

這裡需要注意我們在執行 $WX+b$ 的時候

$WX$ 是dot product
+b進行了broadcasting，因為b是一個向量

如下圖4所示

圖4 DNN中線性變換實例

代碼實現如下所示，其中W用隨機數進行初始化，b用0向量進行初始化：

# GRADED FUNCTION: initialize_parameters_deepdef initialize_parameters_deep(layer_dims): """ Arguments: layer_dims -- python array (list) containing the dimensions of each layer in our network Returns: parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL": Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1]) bl -- bias vector of shape (layer_dims[l], 1) """ np.random.seed(3) parameters = {} L = len(layer_dims) # number of layers in the network for l in range(1, L): ### START CODE HERE ### (≈ 2 lines of code) parameters[W + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) * 0.01 parameters[b + str(l)] = np.zeros([layer_dims[l], 1]) ### END CODE HERE ### assert(parameters[W + str(l)].shape == (layer_dims[l], layer_dims[l-1])) assert(parameters[b + str(l)].shape == (layer_dims[l], 1)) return parameters

3. Forward propagation module 前向傳播模塊

前向傳播主要包括兩步：

線性變換 $Z^{[l]} = W^{[l]}A^{[l-1]} + b^{[l]}$
計算激活值

第1到L-1層神經元採用ReLU激活函數， $A^{[l]}=ReLU(Z^{[l]}),l=1,...,L-1$
輸出層採用sigmoid激活函數， $A^{[L]}=ReLU(Z^{[L]})$

代碼實現如下所示：

# 1. 首先實現線性變換，# GRADED FUNCTION: linear_forwarddef linear_forward(A, W, b): """ Implement the linear part of a layers forward propagation. Arguments: A -- activations from previous layer (or input data): (size of previous layer, number of examples) W -- weights matrix: numpy array of shape (size of current layer, size of previous layer) b -- bias vector, numpy array of shape (size of the current layer, 1) Returns: Z -- the input of the activation function, also called pre-activation parameter cache -- a python dictionary containing "A", "W" and "b" ; stored for computing the backward pass efficiently """ ### START CODE HERE ### (≈ 1 line of code) Z = np.dot(W, A) + b #這裡採用dot product ### END CODE HERE ### assert(Z.shape == (W.shape[0], A.shape[1])) cache = (A, W, b) return Z, cache# 2. 其次在線性變換的基礎上進行激活函數得到激活輸出# GRADED FUNCTION: linear_activation_forwarddef linear_activation_forward(A_prev, W, b, activation): """ Implement the forward propagation for the LINEAR->ACTIVATION layer Arguments: A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples) W -- weights matrix: numpy array of shape (size of current layer, size of previous layer) b -- bias vector, numpy array of shape (size of the current layer, 1) activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu" Returns: A -- the output of the activation function, also called the post-activation value cache -- a python dictionary containing "linear_cache" and "activation_cache"; stored for computing the backward pass efficiently """ if activation == "sigmoid": # Inputs: "A_prev, W, b". Outputs: "A, activation_cache". ### START CODE HERE ### (≈ 2 lines of code) Z, linear_cache = linear_forward(A_prev, W, b) A, activation_cache = sigmoid(Z) ### END CODE HERE ### elif activation == "relu": # Inputs: "A_prev, W, b". Outputs: "A, activation_cache". ### START CODE HERE ### (≈ 2 lines of code) Z, linear_cache = linear_forward(A_prev, W, b) A, activation_cache = relu(Z) ### END CODE HERE ### assert (A.shape == (W.shape[0], A_prev.shape[1])) cache = (linear_cache, activation_cache) return A, cache# 3.將前面結合起來得到L層DNN的前向傳播過程，輸出AL# GRADED FUNCTION: L_model_forwarddef L_model_forward(X, parameters): """ Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation Arguments: X -- data, numpy array of shape (input size, number of examples) parameters -- output of initialize_parameters_deep() Returns: AL -- last post-activation value caches -- list of caches containing: every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2) the cache of linear_sigmoid_forward() (there is one, indexed L-1) """ caches = [] A = X L = len(parameters) // 2 # number of layers in the neural network # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list. for l in range(1, L): A_prev = A ### START CODE HERE ### (≈ 2 lines of code) W = parameters[W + str(l)] b = parameters[b + str(l)] A, cache = linear_activation_forward(A_prev, W, b, "relu") caches.append(cache) ### END CODE HERE ### # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list. ### START CODE HERE ### (≈ 2 lines of code) AL, cache = linear_activation_forward(A, parameters[W + str(L)], parameters[b + str(L)], "sigmoid") caches.append(cache) ### END CODE HERE ### assert(AL.shape == (1,X.shape[1])) return AL, caches

4. Cost function 損失函數計算

我們採用交叉熵代價函數，公式如下所示：

$J=-frac{1}{m}sum_{i=1}^{m}{(y^{(i)}log(a^{[L](i)})+(1-y^{(i)})log(1-a^{[L](i)})} ag{1}$

Trick 1:對於得到的最終輸出AL和Y都是 $R^{1 imes m}$ ，在求取交叉熵代價函數時可以採用element-wise product的方法

代碼如下所示：

# GRADED FUNCTION: compute_costdef compute_cost(AL, Y): """ Implement the cost function defined by equation (7). Arguments: AL -- probability vector corresponding to your label predictions, shape (1, number of examples) Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples) Returns: cost -- cross-entropy cost """ m = Y.shape[1] # Compute loss from aL and y. ### START CODE HERE ### (≈ 1 lines of code) cost = -np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1 - Y, np.log(1 - AL))) / m ### END CODE HERE ### cost = np.squeeze(cost) # To make sure your costs shape is what we expect (e.g. this turns [[17]] into 17). assert(cost.shape == ()) return cost

5. 反向傳播模塊 backward propagation module

這裡從第L層往前依次傳播求取 $dW^{[i]},db^{[i]}$ ，我們首先實現線性變換的反向傳播，然後實現linear-activation的反向傳播，最後結合起來得到完整的DNN的反向傳播模塊。

5.1 linear backward 線性反向傳播

對於 $l$ 層，線性部分為 $Z^{[l]} = W^{[l]}A^{[l-1]} + b^{[l]}$ ，假設已經計算了 $dZ^{[l]}=frac{partial L}{partial Z^{[l]}}$ ，希望能夠得到 $(dW^{[l]}, db^{[l]} dA^{[l-1]})$ ，如下圖5所示。

圖5 線性部分反向傳播示意圖

計算公式如下所示，大家可以自行推導：

$dW^{[l]} = frac{partial mathcal{L} }{partial W^{[l]}} = frac{1}{m} dZ^{[l]} A^{[l-1] T} ag{2}$

$db^{[l]} = frac{partial mathcal{L} }{partial b^{[l]}} = frac{1}{m} sum_{i = 1}^{m} dZ^{[l](i)} ag{3}$

$dA^{[l-1]} = frac{partial mathcal{L} }{partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} ag{4}$

實現如下所示：

# GRADED FUNCTION: linear_backwarddef linear_backward(dZ, cache): """ Implement the linear portion of backward propagation for a single layer (layer l) Arguments: dZ -- Gradient of the cost with respect to the linear output (of current layer l) cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer Returns: dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev dW -- Gradient of the cost with respect to W (current layer l), same shape as W db -- Gradient of the cost with respect to b (current layer l), same shape as b """ A_prev, W, b = cache m = A_prev.shape[1] ### START CODE HERE ### (≈ 3 lines of code) dW = np.dot(dZ, A_prev.T) / m db = 1./m * np.sum(dZ, axis = 1, keepdims = True) dA_prev = np.dot(W.T, dZ) ### END CODE HERE ### assert (dA_prev.shape == A_prev.shape) assert (dW.shape == W.shape) assert (db.shape == b.shape) return dA_prev, dW, db

5.2 linear-activation backward

假設激活函數為 $g(.)$ ，那麼假設已經計算得到 $dA^{[l]}$ 的情況下，有：

$dZ^{[l]}=dA^{[l]}*g^{}(Z^{[l]}) ag{5}$

因此我們首先實現激活函數的導數，如下所示：

import numpy as npdef sigmoid(Z): """ Implements the sigmoid activation in numpy Arguments: Z -- numpy array of any shape Returns: A -- output of sigmoid(z), same shape as Z cache -- returns Z as well, useful during backpropagation """ A = 1/(1+np.exp(-Z)) cache = Z return A, cachedef relu(Z): """ Implement the RELU function. Arguments: Z -- Output of the linear layer, of any shape Returns: A -- Post-activation parameter, of the same shape as Z cache -- a python dictionary containing "A" ; stored for computing the backward pass efficiently """ A = np.maximum(0,Z) assert(A.shape == Z.shape) cache = Z return A, cachedef relu_backward(dA, cache): """ Implement the backward propagation for a single RELU unit. Arguments: dA -- post-activation gradient, of any shape cache -- Z where we store for computing backward propagation efficiently Returns: dZ -- Gradient of the cost with respect to Z """ Z = cache dZ = np.array(dA, copy=True) # just converting dz to a correct object. # When z <= 0, you should set dz to 0 as well. dZ[Z <= 0] = 0 assert (dZ.shape == Z.shape) return dZdef sigmoid_backward(dA, cache): """ Implement the backward propagation for a single SIGMOID unit. Arguments: dA -- post-activation gradient, of any shape cache -- Z where we store for computing backward propagation efficiently Returns: dZ -- Gradient of the cost with respect to Z """ Z = cache s = 1/(1+np.exp(-Z)) dZ = dA * s * (1-s) assert (dZ.shape == Z.shape) return dZ

然後在此基礎上根據 $dA^{[l]},cache,A^{[l]}$ 來計算 $db^{[l]}$ ，實現如下所示：

# GRADED FUNCTION: linear_activation_backwarddef linear_activation_backward(dA, cache, activation): """ Implement the backward propagation for the LINEAR->ACTIVATION layer. Arguments: dA -- post-activation gradient for current layer l cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu" Returns: dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev dW -- Gradient of the cost with respect to W (current layer l), same shape as W db -- Gradient of the cost with respect to b (current layer l), same shape as b """ linear_cache, activation_cache = cache if activation == "relu": ### START CODE HERE ### (≈ 2 lines of code) dZ = relu_backward(dA, activation_cache) dA_prev, dW, db = linear_backward(dZ, linear_cache) ### END CODE HERE ### elif activation == "sigmoid": ### START CODE HERE ### (≈ 2 lines of code) dZ = sigmoid_backward(dA, activation_cache) dA_prev, dW, db = linear_backward(dZ, linear_cache) ### END CODE HERE ### return dA_prev, dW, db

5.3 L-Model Backward 整個模型的反向傳播

在輸出層L層的梯度為：

$dA^{[L]}=frac{partial mathcal{L}}{partial A^{[L]}} ag{6}$

在經過推導之後得到(採用交叉熵代價函數)：

$frac{partial mathcal{L}}{partial A^{[L](i)}} =-frac{1}{m}sum_{i=1}^{m}{(frac{y^{(i)}}{A^{[L](i)}}frac{1-{y^{(i)}}}{{(1-A^{[L](i)}})})} ag{7}$

Trick 2:對於得到的最終輸出AL和Y都是 $R^{1 imes m}$ ，在求取 $dA^{[L]}$ 可以採用除法的方式，如下所示：

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL

整合5.2和5.1之後得到整個DNN的反向傳播如下所示，最終得到每個參數W、b的梯度下降方向：

# GRADED FUNCTION: L_model_backwarddef L_model_backward(AL, Y, caches): """ Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group Arguments: AL -- probability vector, output of the forward propagation (L_model_forward()) Y -- true "label" vector (containing 0 if non-cat, 1 if cat) caches -- list of caches containing: every cache of linear_activation_forward() with "relu" (its caches[l], for l in range(L-1) i.e l = 0...L-2) the cache of linear_activation_forward() with "sigmoid" (its caches[L-1]) Returns: grads -- A dictionary with the gradients 需要注意這裡的index保持一致，dAn這裡的n比tuple的index多1，因為是從零開始的 grads["dA" + str(l)] = ... grads["dW" + str(l)] = ... grads["db" + str(l)] = ... """ grads = {} L = len(caches) # the number of layers m = AL.shape[1] Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL # Initializing the backpropagation ### START CODE HERE ### (1 line of code) dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) ### END CODE HERE ### # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"] ### START CODE HERE ### (approx. 2 lines) current_cache = caches[L - 1] grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, "sigmoid") ### END CODE HERE ### for l in reversed(range(L - 1)): # lth layer: (RELU -> LINEAR) gradients. # Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] ### START CODE HERE ### (approx. 5 lines) current_cache = caches[l] dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 2)], current_cache, "sigmoid") grads["dA" + str(l + 1)] = dA_prev_temp grads["dW" + str(l + 1)] = dW_temp grads["db" + str(l + 1)] = db_temp ### END CODE HERE ### return grads

6. Update Parameters 更新參數

更新參數如下所示：

$W^{[l]} = W^{[l]} - alpha ext{ } dW^{[l]} ag{8}$

$b^{[l]} = b^{[l]} - alpha ext{ } db^{[l]} ag{9}$

其中 $alpha$ 表示learning rate，是超參數。

代碼實現如下所示：

# GRADED FUNCTION: update_parametersdef update_parameters(parameters, grads, learning_rate): """ Update parameters using gradient descent Arguments: parameters -- python dictionary containing your parameters grads -- python dictionary containing your gradients, output of L_model_backward Returns: parameters -- python dictionary containing your updated parameters parameters["W" + str(l)] = ... parameters["b" + str(l)] = ... """ L = len(parameters) // 2 # number of layers in the neural network # Update rule for each parameter. Use a for loop. ### START CODE HERE ### (≈ 3 lines of code) for l in range(L): parameters["W" + str(l+1)] -= grads["dW" + str(l + 1)] * learning_rate parameters["b" + str(l+1)] -= grads["db" + str(l + 1)] * learning_rate ### END CODE HERE ### return parameters

7. Build L-layer Neural Network 建立多層網路

在1～6中我們實現了

參數初始化(創建參數)
前向傳播、後向傳播
更新參數

之後，現在我們採用之前實現的來構建一個L層神經網路，其中網路結構為： [LINEAR -> RELU]×(L-1) -> LINEAR -> SIGMOID.

### CONSTANTS ###layers_dims = [12288, 20, 7, 5, 1] # 5-layer model# GRADED FUNCTION: L_layer_modeldef L_layer_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):#lr was 0.009 """ Implements a L-layer neural network: [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID. Arguments: X -- data, numpy array of shape (number of examples, num_px * num_px * 3) Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples) layers_dims -- list containing the input size and each layer size, of length (number of layers + 1). learning_rate -- learning rate of the gradient descent update rule num_iterations -- number of iterations of the optimization loop print_cost -- if True, it prints the cost every 100 steps Returns: parameters -- parameters learnt by the model. They can then be used to predict. """ np.random.seed(1) costs = [] # keep track of cost # Parameters initialization. ### START CODE HERE ### parameters = initialize_parameters_deep(layers_dims) ### END CODE HERE ### # Loop (gradient descent) for i in range(0, num_iterations): # Forward propagation: [LINEAR -> RELU]*(L-1) -> LINEAR -> SIGMOID. ### START CODE HERE ### (≈ 1 line of code) AL, caches = L_model_forward(X, parameters) ### END CODE HERE ### # Compute cost. ### START CODE HERE ### (≈ 1 line of code) cost = compute_cost(AL, Y) ### END CODE HERE ### # Backward propagation. ### START CODE HERE ### (≈ 1 line of code) grads = L_model_backward(AL, Y, caches) ### END CODE HERE ### # Update parameters. ### START CODE HERE ### (≈ 1 line of code) parameters = update_parameters(parameters, grads, learning_rate) ### END CODE HERE ### # Print the cost every 100 training example if print_cost and i % 100 == 0: print ("Cost after iteration %i: %f" %(i, cost)) if print_cost and i % 100 == 0: costs.append(cost) # plot the cost plt.plot(np.squeeze(costs)) plt.ylabel(cost) plt.xlabel(iterations (per tens)) plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters

我們的模型完成了自定義iterations的前向傳播、反向傳播和參數更新，並且可定製網路結構。

下一章我們介紹採用我們構造的模型進行圖片分類。