深度學習（Deep Learning）基礎概念7：搭建多層神經網路的python實現

01-24

此專欄文章隨時更新編輯，如果文章還沒寫完，請耐心等待，正常的頻率是日更。

此文章主要是吳恩達在Cursera上的系列課程「深度學習（DeepLearning）」的學習筆記，這一篇是關於課程1第四周編程的筆記。

首發於知乎的專欄「深度學習+自然語言處理（NLP）」。

該專欄專註於梳理深度學習相關學科的基礎知識。

以下是正文：

通過前面幾篇文章的介紹（深度學習（Deep Learning）基礎概念3-6），我們已經由淺入深的梳理了搭建神經網路模型的python實現。

從第四篇文章解決邏輯回歸問題的神經網路開始，本質上是搭建一個具有1層的神經網路，然後第五篇文章我們討論了如何開始搭建2層神經網路，本篇文章作為如何搭建基礎神經網路的最後一篇，我們來討論一下搭建L層神經網路的思路和python實現。

下面先放出文章結構：

我們的目標是什麼
程序設計
動手搭建神經網路模型

初始化
前向傳播
計算代價
後向傳播
優化參數

整合模型
數據測試

以下是正文：

1. 我們的目標是什麼

從我們要解決的問題出發，假設我有一張圖片，想通過程序判斷，該圖片是不是一隻貓？

這個能判斷圖片是否是一隻貓的程序就是我們要實現的目標：L層神經網路

如何得到這個神經網路呢？通過大量有標記圖片的訓練（監督學習）。即輸入所有圖片像素（特徵值）和對應圖片是否是一隻貓（標籤），優化模型參數。

當然，這裡我們只是以識別貓的圖片任務為例，這裡搭建的神經網路可以用於訓練解決其他任務，只要將任務的特徵值和標籤輸入該神經網路進行訓練。

2 程序設計

在我們動手搭建神經網路模型直線，先考慮一下如何設計該程序。

根據前面幾篇文章的經驗，我們知道，一個神經網路模型的工作流程如下：

確定模型參數
前向傳播
計算代價
後向傳播
優化參數
上述2-5循環反覆，直到得到我們滿意的參數

下面我們先寫出該流程的偽代碼：

parameters = initialize_parameters() for i in iterations: forward_propagation() compute_cost() backward_propagation() update_parameters()

接下來就可以開始動手寫代碼了。

3. 動手搭建神經網路模型

首先介紹一下我們需要用到的packages

numpy用於科學計算。
h5py用於對HDF5進行操作。
matplotlib.pyplot用於畫圖。
testCases提供了一些測試集，用於評估我們的函數性能。這裡的"import *"是引入所有函數的意思。
dnn_utils_v2庫中引入了一些計算函數的方法，如sigmoid和relu。
%matplotlib inline是jupyter notebook里的命令, 意思是將那些用matplotlib繪製的圖顯示在頁面里而不是彈出一個窗口
plt.rcParams是在設置繪圖的一些參數。
『autoreload 2』的意思是，如果我們對引入的庫中的模型進行修改，ipython會自動重新載入這些模型。具體例子見這裡。
np.random.seed(1)為保證每個人運行代碼都得到相同的結果（實驗結果的可復現），因此我們隨機生成種子，使代碼中生成一致的隨機數。

import numpy as np import h5pyimport matplotlib.pyplot as pltfrom testCases_v2 import *from dnn_utils_v2 import sigmoid, sigmoid_backward, relu, relu_backward%matplotlib inlineplt.rcParams["figure.figsize"] = (5.0, 4.0) # set default size of plotsplt.rcParams["image.interpolation"] = "nearest"plt.rcParams["image.cmap"] = "gray"%load_ext autoreload%autoreload 2np.random.seed(1)

參考資料：

python中from module import * 的一個陷阱

如何在python下正確運行%matplotlib inline

autoreload - IPython 6.2.0 documentation

np.random.seed(0)的作用：作用：使得隨機數據可預測。 - a821235837的專欄 - CSDN博客

3.1 初始化

首先我們做的是，根據初始化神經網路的參數。

這裡的參數指的是所有的W和b，這裡，每一層的參數是以矩陣的形式存在的。

因此，本質上，我們需要確定矩陣的維度和元素。

先說元素，為保證每一個神經元能起到不同的效果，我們給每層的W矩陣一個隨機數。對b取值初始化全0.

再說維度，對於 $l$ 層的參數 $W^l$ 來說，行數是該層神經元的數量，列數是上一層神經元的數量。

對於b來說，由於numpy中傳播規則的存在，b是行數與W相同，1列的向量。

代碼：

def initialize_parameters_deep(layer_dims): """ Arguments: layer_dims -- python array (list) containing the dimensions of each layer in our network Returns: parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL": Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1]) bl -- bias vector of shape (layer_dims[l], 1) """ np.random.seed(3) parameters = {} L = len(layer_dims) # number of layers in the network for l in range(1, L): parameters["W" + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01 parameters["b" + str(l)] = np.zeros((layer_dims[l],1)) assert(parameters["W" + str(l)].shape == (layer_dims[l], layer_dims[l-1])) assert(parameters["b" + str(l)].shape == (layer_dims[l], 1)) return parameters

3.2 前向傳播

先看一下這個圖，前向傳播的過程可以描述為，先重複線性加權和relu的過程l-1次，最後再進行一次線性加權和sigmoid。

我們首先定義函數：計算線性加權的linear_forward()

先考慮一下這個函數的輸入和輸出：

線性加權函數

如果我們把每一層的輸入看做上一層激活函數的輸出，用 $A^l$ 表示，這裡把第一層的輸入X看做 $A^0$ 。

那麼，線性加權函數linear_forward()的需要的輸入就是上一層的輸出A和加權參數W、b。

輸出是 $Z=WA+b$ 和cache，這裡的cache是用來保存這一層的參數，用於後向傳播。

def linear_forward(A, W, b): """ Implement the linear part of a layer"s forward propagation. Arguments: A -- activations from previous layer (or input data): (size of previous layer, number of examples) W -- weights matrix: numpy array of shape (size of current layer, size of previous layer) b -- bias vector, numpy array of shape (size of the current layer, 1) Returns: Z -- the input of the activation function, also called pre-activation parameter cache -- a python dictionary containing "A", "W" and "b" ; stored for computing the backward pass efficiently """ Z = np.dot(W,A)+b assert(Z.shape == (W.shape[0], A.shape[1])) cache = (A, W, b) return Z, cache

注意這裡的cache使用元組格式保存，可參見機器學習中的Python（一）：Python基礎數據類型、容器、函數和類的介紹

線性加權—激活函數

有了線性加權函數，我們就可以進一步定義線性加權—激活函數。

還記得我們開開始搭建神經網路之前先引入了一些列計算激活函數的model：

from dnn_utils_v2 import sigmoid, sigmoid_backward, relu, relu_backward

這裡的sigmoid，relu可以用於計算激活函數的輸出。

現在，我們可以先定義線性加權—激活函數的偽代碼了。

def linear_activation_forward():if "sigmoid"： linear_forward() sigmoid()elif "relu": linear_forward() relu()

這就是數據在一層神將網路前向傳播的實現，首先對上一層的輸出進行線性加權，然後根據激活函數是sigmoid還是relu進行計算。

下面給出完整函數實現：

def linear_activation_forward(A_prev, W, b, activation): """ Implement the forward propagation for the LINEAR->ACTIVATION layer Arguments: A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples) W -- weights matrix: numpy array of shape (size of current layer, size of previous layer) b -- bias vector, numpy array of shape (size of the current layer, 1) activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu" Returns: A -- the output of the activation function, also called the post-activation value cache -- a python dictionary containing "linear_cache" and "activation_cache"; stored for computing the backward pass efficiently """ if activation == "sigmoid": Z, linear_cache = linear_forward(A_prev, W, b) A, activation_cache = sigmoid(Z) elif activation == "relu": Z, linear_cache = linear_forward(A_prev, W, b) A, activation_cache = relu(Z) assert (A.shape == (W.shape[0], A_prev.shape[1])) cache = (linear_cache, activation_cache) return A, cache

上面是一層神將網路前向傳播的實現，接下來就是如何實現L層神經網路的前向傳播了：

這裡我們僅需要把線性加權—激活函數（relu）循環執行 $l-1$ 次，最後執行一次加權—激活函數（sigmoid）。代碼如下：

def L_model_forward(X, parameters): """ Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation Arguments: X -- data, numpy array of shape (input size, number of examples) parameters -- output of initialize_parameters_deep() Returns: AL -- last post-activation value caches -- list of caches containing: every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2) the cache of linear_sigmoid_forward() (there is one, indexed L-1) """ caches = [] A = X L = len(parameters) // 2 # number of layers in the neural network # Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list. for l in range(1, L): A_prev = A A, cache = linear_activation_forward(A_prev, parameters["W"+str(l)], parameters["b"+str(l)], "relu") caches.append(cache) # Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list. AL, cache = linear_activation_forward(A, parameters["W"+str(L)], parameters["b"+str(L)], "sigmoid") caches.append(cache) assert(AL.shape == (1,X.shape[1])) return AL, caches

這裡可以通過parameters大小的一半來確認神經網路的層數，利用caches.append(cache)每次運行把新的cache保存到caches里。

這裡是一個編程技巧，首先定義caches = []，是一個列表，然後利用列表的性質.append()來添加新的內容，之前我們說過cache的類型是元組，而列表裡的元素可以是任何類型。

所以當我們想創建一個變數用於存儲不同類型的數據時，一半選擇創建列表。

這裡我們的目標函數搭建完成了L_model_forward(X, parameters) -> (AL, caches)

3.3 計算代價

計算代價的目的是確認我們的神經網路通過一次次的迭代正在不斷優化！

先給出公式：

$-frac{1}{m} sumlimits_{i = 1}^{m} (y^{(i)}logleft(a^{[L] (i)} ight) + (1-y^{(i)})logleft(1- a^{[L](i)} ight))$

代價函數的實現就是計算上述公式：

def compute_cost(AL, Y): """ Implement the cost function defined by equation (7). Arguments: AL -- probability vector corresponding to your label predictions, shape (1, number of examples) Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples) Returns: cost -- cross-entropy cost """ m = Y.shape[1] # Compute loss from aL and y. cost = -(1/m)*(np.dot(Y,np.log(AL).T) + np.dot((1-Y), np.log(1-AL).T)) cost = np.squeeze(cost) # To make sure your cost"s shape is what we expect (e.g. this turns [[17]] into 17). assert(cost.shape == ()) return cost

這裡，代價函數的另一個作用是給我們一個優化的「方向」，這個方向就是——代價最小！

實現方式就是通過求代價函數對於神經網路的線性加權部分的參數的偏導數。

當然，這也就是下面我們要開始搭建的函數：後向傳播。

3.4 後向傳播

搭建網路的最後一部分：後向傳播，先看流程圖：

這裡，我們的反向傳播目標是求 $dW,db$

如果我們有了 $dZ$ 就非常容易的得到上面的值，為什麼呢，因為Z是關於W和b的線性函數 $Z=WA+b$

所以：

$dW^{[l]} = frac{partial mathcal{L} }{partial W^{[l]}} = frac{1}{m} dZ^{[l]} A^{[l-1] T}$

$db^{[l]} = frac{partial mathcal{L} }{partial b^{[l]}} = frac{1}{m} sum_{i = 1}^{m} dZ^{[l](i)}$

$dA^{[l-1]} = frac{partial mathcal{L} }{partial A^{[l-1]}} = W^{[l] T} dZ^{[l]}$

把我們的目標函數寫成如上形式的好處是：我們的目標 $dW、db$ 都是關於 $dZ和A$ 的函數！

也就是說，我們的輸入只需要 $dZ和A$

下面是代碼：

def linear_backward(dZ, cache): """ Implement the linear portion of backward propagation for a single layer (layer l) Arguments: dZ -- Gradient of the cost with respect to the linear output (of current layer l) cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer Returns: dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev dW -- Gradient of the cost with respect to W (current layer l), same shape as W db -- Gradient of the cost with respect to b (current layer l), same shape as b """ A_prev, W, b = cache m = A_prev.shape[1] dW = (1/m)*np.dot(dZ,A_prev.T) db = (1/m)*np.sum(dZ,1,keepdims=True) dA_prev = np.dot(W.T,dZ) assert (dA_prev.shape == A_prev.shape) assert (dW.shape == W.shape) assert (db.shape == b.shape) return dA_prev, dW, db

有了這個函數，只要得到 $dZ$ 就可以實現真正的後向傳播了。

這時候我們又需要用到一開始提到的庫裡面的函數：

from dnn_utils_v2 import sigmoid, sigmoid_backward, relu, relu_backward

這裡的 sigmoid_backward和relu_backward可以用來非常方便的得到 $dZ$ 。

下面搭建後向傳播函數，和前向傳播一樣，這裡需要區分激活函數是那種：「relu」還是「sigmoid」。

def linear_activation_backward(dA, cache, activation): """ Implement the backward propagation for the LINEAR->ACTIVATION layer. Arguments: dA -- post-activation gradient for current layer l cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu" Returns: dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev dW -- Gradient of the cost with respect to W (current layer l), same shape as W db -- Gradient of the cost with respect to b (current layer l), same shape as b """ linear_cache, activation_cache = cache if activation == "relu": dZ = relu_backward(dA, activation_cache) dA_prev, dW, db = linear_backward(dZ, linear_cache) elif activation == "sigmoid": dZ = sigmoid_backward(dA, activation_cache) dA_prev, dW, db = linear_backward(dZ, linear_cache) return dA_prev, dW, db

在這個函數的基礎上，我們繼續完成L層神經網路的後向傳播函數：

第一步，求 $dA^{[l]}$ ，也就是 $-frac{1}{m} sumlimits_{i = 1}^{m} (y^{(i)}logleft(a^{[L] (i)} ight) + (1-y^{(i)})logleft(1- a^{[L](i)} ight))$ 對 $a^{[l]}$ 求偏導數。
利用linear_activation_backward()迭代向上一層求導。

def L_model_backward(AL, Y, caches): """ Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group Arguments: AL -- probability vector, output of the forward propagation (L_model_forward()) Y -- true "label" vector (containing 0 if non-cat, 1 if cat) caches -- list of caches containing: every cache of linear_activation_forward() with "relu" (it"s caches[l], for l in range(L-1) i.e l = 0...L-2) the cache of linear_activation_forward() with "sigmoid" (it"s caches[L-1]) Returns: grads -- A dictionary with the gradients grads["dA" + str(l)] = ... grads["dW" + str(l)] = ... grads["db" + str(l)] = ... """ grads = {} L = len(caches) # the number of layers m = AL.shape[1] Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL # Initializing the backpropagation dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "AL, Y, caches". Outputs: "grads["dAL"], grads["dWL"], grads["dbL"] current_cache = caches[L-1] grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, activation = "sigmoid") for l in reversed(range(L-1)): # lth layer: (RELU -> LINEAR) gradients. # Inputs: "grads["dA" + str(l + 2)], caches". Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)] current_cache = caches[l] dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l+2)], current_cache, activation = "relu") grads["dA" + str(l + 1)] = dA_prev_temp grads["dW" + str(l + 1)] = dW_temp grads["db" + str(l + 1)] = db_temp ### END CODE HERE ### return grads

3.5 優化參數

又到了優化參數的時間，得到了 $dW、db$ ，乘以學習率 $alpha$ 就可以方便的優化參數了！

def update_parameters(parameters, grads, learning_rate): """ Update parameters using gradient descent Arguments: parameters -- python dictionary containing your parameters grads -- python dictionary containing your gradients, output of L_model_backward Returns: parameters -- python dictionary containing your updated parameters parameters["W" + str(l)] = ... parameters["b" + str(l)] = ... """ L = len(parameters) // 2 # number of layers in the neural network # Update rule for each parameter. Use a for loop. for l in range(L): parameters["W" + str(l+1)] = parameters["W" + str(l+1)]-learning_rate*grads["dW"+str(l+1)] parameters["b" + str(l+1)] = parameters["b" + str(l+1)]-learning_rate*grads["db"+str(l+1)] return parameters

4. 整合模型

最後一步，整合模型！

首先看一下我們已經寫完的函數：

def initialize_parameters_deep(layer_dims): ... return parameters def L_model_forward(X, parameters): ... return AL, cachesdef compute_cost(AL, Y): ... return costdef L_model_backward(AL, Y, caches): ... return gradsdef update_parameters(parameters, grads, learning_rate): ... return parameters

上面的函數也對應著我們搭建網路的順序：

初始化參數
前向傳播
計算代價
後向傳播
優化參數

代碼：

def L_layer_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):#lr was 0.009 """ Implements a L-layer neural network: [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID. Arguments: X -- data, numpy array of shape (number of examples, num_px * num_px * 3) Y -- true "label" vector (containing 0 if cat, 1 if non-cat), of shape (1, number of examples) layers_dims -- list containing the input size and each layer size, of length (number of layers + 1). learning_rate -- learning rate of the gradient descent update rule num_iterations -- number of iterations of the optimization loop print_cost -- if True, it prints the cost every 100 steps Returns: parameters -- parameters learnt by the model. They can then be used to predict. """ np.random.seed(1) costs = [] # keep track of cost # Parameters initialization. parameters = initialize_parameters_deep(layers_dims) # Loop (gradient descent) for i in range(0, num_iterations): # Forward propagation: [LINEAR -> RELU]*(L-1) -> LINEAR -> SIGMOID. AL, caches = L_model_forward(X, parameters) # Compute cost. cost = compute_cost(AL, Y) # Backward propagation. grads = L_model_backward(AL, Y, caches) # Update parameters. parameters = update_parameters(parameters, grads, learning_rate) # Print the cost every 100 training example if print_cost and i % 100 == 0: print ("Cost after iteration %i: %f" %(i, cost)) if print_cost and i % 100 == 0: costs.append(cost) # plot the cost plt.plot(np.squeeze(costs)) plt.ylabel("cost") plt.xlabel("iterations (per tens)") plt.title("Learning rate =" + str(learning_rate)) plt.show() return parameters

通過不停的迭代，訓練我們的模型參數，最終得到性能優異的模型。

5. 數據測試

train_x_orig, train_y, test_x_orig, test_y, classes = load_data() #讀入數據parameters = L_layer_model(train_x, train_y, layers_dims, num_iterations = 2500, print_cost = True) #訓練模型參數

我們可以看到，通過不斷的訓練，模型的代價越來越小。

pred_train = predict(train_x, train_y, parameters)> Accuracy: 0.985645933014

在訓練集上達到了0.9856的精度，非常高！

pred_test = predict(test_x, test_y, parameters)> Accuracy: 0.8

在測試集上也有0.8的精度！

ok，至此我們的目標達成！搭建一個L層的神經網路，並利用貓的圖片進行訓練，訓練後的模型在測試集上達到80%的精度！