Cousera deeplearning.ai筆記 — 淺層神經網路（Shallow neural network）

03-02

這周演練一波淺層神經網路，講了三大事：正反向傳播，激活函數和隨機初始化。其實神經網路里單個神經元結構與邏輯回歸結構是極其相似的。所以，用邏輯回歸做了一個引例，對邏輯回歸的正向傳播和反向傳播。

抽象來說，每一個神經元都可以歸納為， $z=W^{T}X+b, a=sigma(z)$ ，接受前部分輸入，進行線性運算，之後通過激活函數運算，得到此神經元的輸出，為所連接的後方神經元服務。

然而，反向傳播呢，Andrew稱此為神經網路中最複雜最難理解的數學推到了。但我覺得吧，如果要手推確實是會推導手酸。但是呢，理解還是容易的。反向傳播的講解網上一搜一大堆，大部分都通過鏈式法則來解釋正向傳導和反向傳導。

此處隨手給個鏈接：

一文弄懂神經網路中的反向傳播法--BackPropagation - Charlotte77 - 博客園

但在這裡我寫一個我自己的東西，在Andrew的Machine Learning課程里，我其實學過神經網路，但是我一直將神經網路的反向傳播理解成了，誤差的反向傳導調整各參數，然而並不是那麼簡單滴，這確實是我之前的淺見。

其實，細心的朋友（我不是）可以看得出神經網路預測，即正向傳播，的過程其實是一種套了一層一層有一層的複合函數求解過程，每一層都帶了一個 $W, b$ 參數。那麼，反向傳播要去優化 $W, b$ 。通過對loss函數求 $W, b$ 等參數的偏導得到能使得loss函數導數為零，達到預測值和實際值差距最小的參數組合。那麼求 $W, b$ 等參數的偏導過程，總得一層一層剖開來求吧。所以，反向傳播的實質更像複合函數求導，但不是像我們高數中常見的拆分到底，只拆到對應 $W, b$ 等參數的那層。

也許你會問，為什麼不能直接求導數，要用反向傳播，因為它（計算機）笨啊，它做不到的，他又不會推公式，常見的計算機求導有四種方法：人工解析微分法，數值微分法，符號微分法和自動微分法。其中，自動微分法就是神經網路反向傳播的影子，感興趣的可以繼續了解。

好了，After that，還有重要的兩件事情

熟不熟悉，眼不眼熟。第二件重要的事情：激活函數。兩個問題，為什麼要用非線性激活函數。從仿生的角度講，這就像我們的神經元一樣，要接受了一定的刺激之後，才會被激活往下一個神經元傳播信號，然而這種跟其他仿生演算法一樣帶有忽悠性質的描述不知道都沒關係。

Andrew的解釋就是「如果沒有這個非線性激活函數壓一下，那就跟線性回歸一樣無窮增大和減小了」，第一周的課也提到過，其實就是防止傳播過程用力過猛。然而！！！神經網路也就是因為害怕用力過猛採用了sigmoid多年，導致。。。反向傳播用力過輕，網路層數一上去，反向傳播失效。

第二個問題：如何選擇，總結課程「除非你的輸出做零一分類，永遠別用sigmoid了，用tanh。默認的是relu，或者leaky relu也挺好。」

第三件重要的事情：初始化，1. 堅決不能將各個參數初始化為零。 2. 盡量初始化得很小，因為在那個時候，激活函數值的斜率很大，能夠很好地正傳反傳。但是，有時候也會考慮很大的值哦，以後的課再講。

----------------作業內容----------------

這次作業其實挺簡單，全程照著公式輸就好。

### START CODE HERE ### (≈ 3 lines of code)shape_X = X.shapeshape_Y = Y.shapem = shape_X[1] # training set size### END CODE HERE ###def layer_sizes(X,Y): ### START CODE HERE ### (≈ 3 lines of code) n_x = X.shape[0] # size of input layer n_h = 4 n_y = Y.shape[0] # size of output layer ### END CODE HERE ###return (n_x, n_h, n_y)def initialize_parameters(n_x, n_h, n_y): ### START CODE HERE ### (≈ 4 lines of code) W1 = np.random.randn(n_h,n_x)*0.01 b1 = np.zeros((n_h,1))*0.01 W2 = np.random.randn(n_y,n_h)*0.01 b2 = np.zeros((n_y,1))*0.01 ### END CODE HERE ### return parametersdef forward_propagation(X, parameters): ### START CODE HERE ### (≈ 4 lines of code) W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"] ### END CODE HERE ### # Implement Forward Propagation to calculate A2 (probabilities) ### START CODE HERE ### (≈ 4 lines of code) Z1 = np.dot(W1,X)+b1 A1 = np.tanh(Z1) Z2 = np.dot(W2,A1)+b2 A2 = sigmoid(Z2) ### END CODE HERE ### return A2, cachedef compute_cost(A2, Y, parameters): ### START CODE HERE ### (≈ 2 lines of code) logprobs = None cost = -1*(np.dot(np.log(A2),Y.T)+np.dot(np.log(1-A2),(1-Y).T))/m ### END CODE HERE ### return costdef backward_propagation(parameters, cache, X, Y): # First, retrieve W1 and W2 from the dictionary "parameters". ### START CODE HERE ### (≈ 2 lines of code) W1 = parameters[W1] W2 = parameters[W2] ### END CODE HERE ### # Retrieve also A1 and A2 from dictionary "cache". ### START CODE HERE ### (≈ 2 lines of code) A1 = cache[A1] A2 = cache[A2] ### END CODE HERE ### # Backward propagation: calculate dW1, db1, dW2, db2. ### START CODE HERE ### (≈ 6 lines of code, corresponding to 6 equations on slide above) dZ2 = A2-Y dW2 = np.dot(dZ2,A1.T)/m db2 = np.sum(dZ2,axis=1,keepdims=True)/m dZ1 = np.dot(W2.T,dZ2)*(1-np.power(A1,2)) dW1 = np.dot(dZ1,X.T)/m db1 = np.sum(dZ1,axis=1,keepdims=True)/m ### END CODE HERE ### return gradsdef update_parameters(parameters, grads, learning_rate = 1.2): # Retrieve each gradient from the dictionary "grads" ### START CODE HERE ### (≈ 4 lines of code) dW1 = grads["dW1"] db1 = grads["db1"] dW2 = grads["dW2"] db2 = grads["db2"] ## END CODE HERE ### # Update rule for each parameter ### START CODE HERE ### (≈ 4 lines of code) W1 = W1-learning_rate*dW1 b1 = b1-learning_rate*db1 W2 = W2-learning_rate*dW2 b2 = b2-learning_rate*db2 ### END CODE HERE ### return parametersdef nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False): """ Arguments: X -- dataset of shape (2, number of examples) Y -- labels of shape (1, number of examples) n_h -- size of the hidden layer num_iterations -- Number of iterations in gradient descent loop print_cost -- if True, print the cost every 1000 iterations Returns: parameters -- parameters learnt by the model. They can then be used to predict. """ np.random.seed(3) n_x = layer_sizes(X, Y)[0] n_y = layer_sizes(X, Y)[2] # Initialize parameters, then retrieve W1, b1, W2, b2. Inputs: "n_x, n_h, n_y". Outputs = "W1, b1, W2, b2, parameters". ### START CODE HERE ### (≈ 5 lines of code) parameters = initialize_parameters(n_x,n_h,n_y) W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"] ### END CODE HERE ### # Loop (gradient descent) for i in range(0, num_iterations): ### START CODE HERE ### (≈ 4 lines of code) # Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache". A2, cache = forward_propagation(X, parameters) # Cost function. Inputs: "A2, Y, parameters". Outputs: "cost". cost = compute_cost(A2, Y, parameters) # Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads". grads = backward_propagation(parameters, cache, X, Y) # Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters". parameters = update_parameters(parameters, grads) ### END CODE HERE ### # Print the cost every 1000 iterations if print_cost and i % 1000 == 0: print ("Cost after iteration %i: %f" %(i, cost)) return parametersdef predict(parameters, X): # Computes probabilities using forward propagation, and classifies to 0/1 using 0.5 as the threshold. ### START CODE HERE ### (≈ 2 lines of code) A2, cache = forward_propagation(X,parameters) predictions = (A2>0.5) ### END CODE HERE ### return predictions