Building Deep Neural Network from scratch-吳恩達深度學習第一課第四周習題答案(1)
在第四周的習題中,要實現以下內容:
- 完成多層深度神經網路的設計和編碼實現
- 完成前向傳播和後向傳播模塊
- 封裝完整的DNN模塊用於學習
- 根據構建的DNN來完成貓圖像識別
注意from scratch只能採用numpy第三方庫,不能採用tensorflow等工具。
tensorflow等工具推出之後極大地降低了深度學習的門檻,也得到了廣泛應用,將大家從繁重的底層編碼中解脫出來,尤其是梯度下降求導和最優值求解等工作。但我以為只會用tensorflow等第三方工具來做神經網路、完全不會推導參數求解方法、缺乏從零開始編碼模型是難以在深度學習方面有所進步的,尤其是當一個模型沒有開源實現的時候,需要自己來進行復現。因此在做完這個assignment之後,決定把答案寫出來。
1. 整體架構
本篇要實現一個L層的前饋神經網路,其中L指神經網路層數,每層的神經元數量可以人為進行指定,例如
layers_dims = [12288, 20, 7, 5, 1]
表示一個五層的神經網路
- 其中12288表示輸入層 的維數,由於圖片大小為
- 第2、3、4、5層的神經元數量分別為20、7、5、1,第5層得到的結果是最終輸出,表示對這張圖片是否為貓的判斷。
其中第1層到第L-1層的隱藏層神經元激活函數為ReLU,最後輸出層L層的神經元激活函數為sigmoid。結果如下圖1所示
而整個過程如下圖2所示,包括3個步驟、5個小步驟:
- 初始化權重W1、b1、W2、b2、...、WL、bL
- 在每個iteration中循環包括:
- 前向傳播
- 後向傳播求取梯度下降方向
- 更新參數
- 預測
2. Initialization 參數初始化
假設每層的神經元數量為 ,那麼
- ,例如第一層神經元的輸入為 ,第一層神經元數量為20,那麼初始化參數的時候 ;
- 而 .
- 由於在運算中是Vectorized之後進行傳播,也就是每次輸入的多個樣本組成的矩陣 ,其中m為樣本數量,X中的每一列表示一個樣本的特徵向量。因此在計算過程中我們每一層的輸出都是矩陣,寬度為m,
對應的總結如下圖3所示:
這裡需要注意我們在執行 的時候
- 是dot product
- +b進行了broadcasting,因為b是一個向量
如下圖4所示
代碼實現如下所示,其中W用隨機數進行初始化,b用0向量進行初始化:
# GRADED FUNCTION: initialize_parameters_deepdef initialize_parameters_deep(layer_dims): """ Arguments: layer_dims -- python array (list) containing the dimensions of each layer in our network Returns: parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL": Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1]) bl -- bias vector of shape (layer_dims[l], 1) """ np.random.seed(3) parameters = {} L = len(layer_dims) # number of layers in the network for l in range(1, L): ### START CODE HERE ### (≈ 2 lines of code) parameters[W + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) * 0.01 parameters[b + str(l)] = np.zeros([layer_dims[l], 1]) ### END CODE HERE ### assert(parameters[W + str(l)].shape == (layer_dims[l], layer_dims[l-1])) assert(parameters[b + str(l)].shape == (layer_dims[l], 1)) return parameters
3. Forward propagation module 前向傳播模塊
前向傳播主要包括兩步:
- 線性變換
- 計算激活值
- 第1到L-1層神經元採用ReLU激活函數,
- 輸出層採用sigmoid激活函數,
代碼實現如下所示:
# 1. 首先實現線性變換,# GRADED FUNCTION: linear_forwarddef linear_forward(A, W, b): """ Implement the linear part of a layers forward propagation. Arguments: A -- activations from previous layer (or input data): (size of previous layer, number of examples) W -- weights matrix: numpy array of shape (size of current layer, size of previous layer) b -- bias vector, numpy array of shape (size of the current layer, 1) Returns: Z -- the input of the activation function, also called pre-activation parameter cache -- a python dictionary containing "A", "W" and "b" ; stored for computing the backward pass efficiently """ ### START CODE HERE ### (≈ 1 line of code) Z = np.dot(W, A) + b #這裡採用dot product ### END CODE HERE ### assert(Z.shape == (W.shape[0], A.shape[1])) cache = (A, W, b) return Z, cache# 2. 其次在線性變換的基礎上進行激活函數得到激活輸出# GRADED FUNCTION: linear_activation_forwarddef linear_activation_forward(A_prev, W, b, activation): """ Implement the forward propagation for the LINEAR->ACTIVATION layer Arguments: A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples) W -- weights matrix: numpy array of shape (size of current layer, size of previous layer) b -- bias vector, numpy array of shape (size of the current layer, 1) activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu" Returns: A -- the output of the activation function, also called the post-activation value cache -- a python dictionary containing "linear_cache" and "activation_cache"; stored for computing the backward pass efficiently """ if activation == "sigmoid": # Inputs: "A_prev, W, b". Outputs: "A, activation_cache". ### START CODE HERE ### (≈ 2 lines of code) Z, linear_cache = linear_forward(A_prev, W, b) A, activation_cache = sigmoid(Z) ### END CODE HERE ### elif activation == "relu": # Inputs: "A_prev, W, b". Outputs: "A, activation_cache". ### START CODE HERE ### (≈ 2 lines of code) Z, linear_cache = linear_forward(A_prev, W, b) A, activation_cache = relu(Z) ### END CODE HERE ### assert (A.shape == (W.shape[0], A_prev.shape[1])) cache = (linear_cache, activation_cache) return A, cache# 3.將前面結合起來得到L層DNN的前向傳播過程,輸出AL# GRADED FUNCTION: L_model_forwarddef L_model_forward(X, parameters): """ Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation Arguments: X -- data, numpy array of shape (input size, number of examples) parameters -- output of initialize_parameters_deep() Returns: AL -- last post-activation value caches -- list of caches containing: every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2) the cache of linear_sigmoid_forward() (there is one, indexed L-1) """ caches = [] A = X L = len(parameters) // 2 # number of layers in the neural network # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list. for l in range(1, L): A_prev = A ### START CODE HERE ### (≈ 2 lines of code) W = parameters[W + str(l)] b = parameters[b + str(l)] A, cache = linear_activation_forward(A_prev, W, b, "relu") caches.append(cache) ### END CODE HERE ### # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list. ### START CODE HERE ### (≈ 2 lines of code) AL, cache = linear_activation_forward(A, parameters[W + str(L)], parameters[b + str(L)], "sigmoid") caches.append(cache) ### END CODE HERE ### assert(AL.shape == (1,X.shape[1])) return AL, caches
4. Cost function 損失函數計算
我們採用交叉熵代價函數,公式如下所示:
Trick 1:對於得到的最終輸出AL和Y都是 ,在求取交叉熵代價函數時可以採用element-wise product的方法
代碼如下所示:
# GRADED FUNCTION: compute_costdef compute_cost(AL, Y): """ Implement the cost function defined by equation (7). Arguments: AL -- probability vector corresponding to your label predictions, shape (1, number of examples) Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples) Returns: cost -- cross-entropy cost """ m = Y.shape[1] # Compute loss from aL and y. ### START CODE HERE ### (≈ 1 lines of code) cost = -np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1 - Y, np.log(1 - AL))) / m ### END CODE HERE ### cost = np.squeeze(cost) # To make sure your costs shape is what we expect (e.g. this turns [[17]] into 17). assert(cost.shape == ()) return cost
5. 反向傳播模塊 backward propagation module
這裡從第L層往前依次傳播求取 ,我們首先實現線性變換的反向傳播,然後實現linear-activation的反向傳播,最後結合起來得到完整的DNN的反向傳播模塊。
5.1 linear backward 線性反向傳播
對於 層,線性部分為 ,假設已經計算了 ,希望能夠得到 ,如下圖5所示。
計算公式如下所示,大家可以自行推導:
實現如下所示:
# GRADED FUNCTION: linear_backwarddef linear_backward(dZ, cache): """ Implement the linear portion of backward propagation for a single layer (layer l) Arguments: dZ -- Gradient of the cost with respect to the linear output (of current layer l) cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer Returns: dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev dW -- Gradient of the cost with respect to W (current layer l), same shape as W db -- Gradient of the cost with respect to b (current layer l), same shape as b """ A_prev, W, b = cache m = A_prev.shape[1] ### START CODE HERE ### (≈ 3 lines of code) dW = np.dot(dZ, A_prev.T) / m db = 1./m * np.sum(dZ, axis = 1, keepdims = True) dA_prev = np.dot(W.T, dZ) ### END CODE HERE ### assert (dA_prev.shape == A_prev.shape) assert (dW.shape == W.shape) assert (db.shape == b.shape) return dA_prev, dW, db
5.2 linear-activation backward
假設激活函數為 ,那麼假設已經計算得到 的情況下,有:
因此我們首先實現激活函數的導數,如下所示:
import numpy as npdef sigmoid(Z): """ Implements the sigmoid activation in numpy Arguments: Z -- numpy array of any shape Returns: A -- output of sigmoid(z), same shape as Z cache -- returns Z as well, useful during backpropagation """ A = 1/(1+np.exp(-Z)) cache = Z return A, cachedef relu(Z): """ Implement the RELU function. Arguments: Z -- Output of the linear layer, of any shape Returns: A -- Post-activation parameter, of the same shape as Z cache -- a python dictionary containing "A" ; stored for computing the backward pass efficiently """ A = np.maximum(0,Z) assert(A.shape == Z.shape) cache = Z return A, cachedef relu_backward(dA, cache): """ Implement the backward propagation for a single RELU unit. Arguments: dA -- post-activation gradient, of any shape cache -- Z where we store for computing backward propagation efficiently Returns: dZ -- Gradient of the cost with respect to Z """ Z = cache dZ = np.array(dA, copy=True) # just converting dz to a correct object. # When z <= 0, you should set dz to 0 as well. dZ[Z <= 0] = 0 assert (dZ.shape == Z.shape) return dZdef sigmoid_backward(dA, cache): """ Implement the backward propagation for a single SIGMOID unit. Arguments: dA -- post-activation gradient, of any shape cache -- Z where we store for computing backward propagation efficiently Returns: dZ -- Gradient of the cost with respect to Z """ Z = cache s = 1/(1+np.exp(-Z)) dZ = dA * s * (1-s) assert (dZ.shape == Z.shape) return dZ
然後在此基礎上根據 來計算 ,實現如下所示:
# GRADED FUNCTION: linear_activation_backwarddef linear_activation_backward(dA, cache, activation): """ Implement the backward propagation for the LINEAR->ACTIVATION layer. Arguments: dA -- post-activation gradient for current layer l cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu" Returns: dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev dW -- Gradient of the cost with respect to W (current layer l), same shape as W db -- Gradient of the cost with respect to b (current layer l), same shape as b """ linear_cache, activation_cache = cache if activation == "relu": ### START CODE HERE ### (≈ 2 lines of code) dZ = relu_backward(dA, activation_cache) dA_prev, dW, db = linear_backward(dZ, linear_cache) ### END CODE HERE ### elif activation == "sigmoid": ### START CODE HERE ### (≈ 2 lines of code) dZ = sigmoid_backward(dA, activation_cache) dA_prev, dW, db = linear_backward(dZ, linear_cache) ### END CODE HERE ### return dA_prev, dW, db
5.3 L-Model Backward 整個模型的反向傳播
在輸出層L層的梯度為:
在經過推導之後得到(採用交叉熵代價函數):
Trick 2:對於得到的最終輸出AL和Y都是 ,在求取 可以採用除法的方式,如下所示:
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL
整合5.2和5.1之後得到整個DNN的反向傳播如下所示,最終得到每個參數W、b的梯度下降方向:
# GRADED FUNCTION: L_model_backwarddef L_model_backward(AL, Y, caches): """ Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group Arguments: AL -- probability vector, output of the forward propagation (L_model_forward()) Y -- true "label" vector (containing 0 if non-cat, 1 if cat) caches -- list of caches containing: every cache of linear_activation_forward() with "relu" (its caches[l], for l in range(L-1) i.e l = 0...L-2) the cache of linear_activation_forward() with "sigmoid" (its caches[L-1]) Returns: grads -- A dictionary with the gradients 需要注意這裡的index保持一致,dAn這裡的n比tuple的index多1,因為是從零開始的 grads["dA" + str(l)] = ... grads["dW" + str(l)] = ... grads["db" + str(l)] = ... """ grads = {} L = len(caches) # the number of layers m = AL.shape[1] Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL # Initializing the backpropagation ### START CODE HERE ### (1 line of code) dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) ### END CODE HERE ### # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"] ### START CODE HERE ### (approx. 2 lines) current_cache = caches[L - 1] grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, "sigmoid") ### END CODE HERE ### for l in reversed(range(L - 1)): # lth layer: (RELU -> LINEAR) gradients. # Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] ### START CODE HERE ### (approx. 5 lines) current_cache = caches[l] dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 2)], current_cache, "sigmoid") grads["dA" + str(l + 1)] = dA_prev_temp grads["dW" + str(l + 1)] = dW_temp grads["db" + str(l + 1)] = db_temp ### END CODE HERE ### return grads
6. Update Parameters 更新參數
更新參數如下所示:
其中 表示learning rate,是超參數。
代碼實現如下所示:
# GRADED FUNCTION: update_parametersdef update_parameters(parameters, grads, learning_rate): """ Update parameters using gradient descent Arguments: parameters -- python dictionary containing your parameters grads -- python dictionary containing your gradients, output of L_model_backward Returns: parameters -- python dictionary containing your updated parameters parameters["W" + str(l)] = ... parameters["b" + str(l)] = ... """ L = len(parameters) // 2 # number of layers in the neural network # Update rule for each parameter. Use a for loop. ### START CODE HERE ### (≈ 3 lines of code) for l in range(L): parameters["W" + str(l+1)] -= grads["dW" + str(l + 1)] * learning_rate parameters["b" + str(l+1)] -= grads["db" + str(l + 1)] * learning_rate ### END CODE HERE ### return parameters
7. Build L-layer Neural Network 建立多層網路
在1~6中我們實現了
- 參數初始化(創建參數)
- 前向傳播、後向傳播
- 更新參數
之後,現在我們採用之前實現的來構建一個L層神經網路,其中網路結構為: [LINEAR -> RELU]×(L-1) -> LINEAR -> SIGMOID.
### CONSTANTS ###layers_dims = [12288, 20, 7, 5, 1] # 5-layer model# GRADED FUNCTION: L_layer_modeldef L_layer_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):#lr was 0.009 """ Implements a L-layer neural network: [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID. Arguments: X -- data, numpy array of shape (number of examples, num_px * num_px * 3) Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples) layers_dims -- list containing the input size and each layer size, of length (number of layers + 1). learning_rate -- learning rate of the gradient descent update rule num_iterations -- number of iterations of the optimization loop print_cost -- if True, it prints the cost every 100 steps Returns: parameters -- parameters learnt by the model. They can then be used to predict. """ np.random.seed(1) costs = [] # keep track of cost # Parameters initialization. ### START CODE HERE ### parameters = initialize_parameters_deep(layers_dims) ### END CODE HERE ### # Loop (gradient descent) for i in range(0, num_iterations): # Forward propagation: [LINEAR -> RELU]*(L-1) -> LINEAR -> SIGMOID. ### START CODE HERE ### (≈ 1 line of code) AL, caches = L_model_forward(X, parameters) ### END CODE HERE ### # Compute cost. ### START CODE HERE ### (≈ 1 line of code) cost = compute_cost(AL, Y) ### END CODE HERE ### # Backward propagation. ### START CODE HERE ### (≈ 1 line of code) grads = L_model_backward(AL, Y, caches) ### END CODE HERE ### # Update parameters. ### START CODE HERE ### (≈ 1 line of code) parameters = update_parameters(parameters, grads, learning_rate) ### END CODE HERE ### # Print the cost every 100 training example if print_cost and i % 100 == 0: print ("Cost after iteration %i: %f" %(i, cost)) if print_cost and i % 100 == 0: costs.append(cost) # plot the cost plt.plot(np.squeeze(costs)) plt.ylabel(cost) plt.xlabel(iterations (per tens)) plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters
我們的模型完成了自定義iterations的前向傳播、反向傳播和參數更新,並且可定製網路結構。
下一章我們介紹採用我們構造的模型進行圖片分類。
推薦閱讀:
※【可解釋 AI 重大突破】DeepMind 構建心智理論神經網路讓機器互相理解
※機器學習是一種認識世界的方式
※受限制玻爾茲曼機(RBM)的能量函數及其梯度求解
※當Node.js遇上OpenCV深度神經網路
※Seq2seq模型及注意力機制