Cousera deeplearning.ai筆記 — 規則化神經網路（Regularizing your neural network）

03-02

Week 1 的第二個part講的是Regularization，網上大部分都翻譯為規則，但是我個人更喜歡稱之為一種約束，是用來約束權重的取值，防止權重取值過大導致模型對於訓練數據的表現過於強大，而弱化了對預測數據的表現。

Regularization的方式很簡單，就是在loss function上加一個 $frac{lambda}{2m}left| left| w ight| ight|_{2}^{2}$ ，一般限制 $W$ 而不是 $b$ 因為相對而言， $W$ 很多 $b$ 很少，你喜歡你可以加。這玩意 $left| left| w ight| ight|_{2}^{2}$ 呢說的嚇人一點叫作L2 範數，但其實就是歐氏距離，將 $w$ 每一項平方相加然後乘以前面的 $frac{lambda}{2m}$ 。當然也有人用L1，但Andrew堅持用L2比較多哦。L1 regularization 會使得W變得稀疏（0很多）的意思且可以壓縮模型，他覺得作用不大，lamda是屬於需要調優的超參數。還有就是，lambda是python保留字

OK，這裡涉及到了一個"weight decay"的概念，課堂有推導。因為在反向傳播的過程中，新加入的 $frac{lambda}{2m}left| left| w ight| ight|_{2}^{2}$ 也需要進行求導，而其導數通過基本運算之後，可以發現如紅色框框中是一個小於一的數，這樣子使得 $w$ 的更新速度減緩，弱化的模型擬合的能力。

好了，按照他的習慣他會告訴你為什麼能work。如果是tanh函數做傳播，lambda上升， $w$ 下降，逼迫傳到過程趨於0附近，接近傳到效率低的線性傳播，所以複雜的網路是無法實現的。

tips：將梯度下降情況（包含正則項）plot出來看看。而且更小的神經網路結構，更好正則化。

接下來，另一個Regularization神器Dropout。一開始看視頻還以為是神經網路的鏈接斷掉了，其實換一句更好懂的話。

當下一層神經元要接受上一層神經元的映射的時候，隨機把上一層某幾個神經元失效（將值置為零），然後先除以dropout的比例放大之後再傳導下去。

OK，因為dropout說起來的原理說簡單也簡單，說難可以去百度，反正現在還沒有一套嚴密的理論可以證明吧（我個人目前的知識局限性告訴我是這樣），只能通俗的按照Andrew說的理解成，一個解釋是dropout相當於讓數據訓練更加簡單的模型，弱化了模型擬合的能力，提高泛化。第二個解釋就是當此神經元偶爾有偶爾又沒有的時候，反向傳播過程中會更加平均的分配權重到本層的各個節點鏈接，而且平均分配到各個節點更容易達到激活值。（這個可以動手在草稿紙上推一下小例子）

下面列舉一下課堂重要的點：

預測的時候請關閉dropout。
模型在沒有overfitting的情況下是不需要考慮dropout，除了視覺領域且有大量圖片的情況下，才會依賴dropout。
因為coss function不確定導致plot檢查是否在梯度下降失效，我會關掉dropout。
第一層神經元存活率一般比較高，中間網路連接複雜的時候比較低，後面網路的比較高。
其他的一些正則化的方法：early stoping，但我還是喜歡lambda的方法。
通過翻轉，放大，變形圖片都可以得到新的數據呢。

----------------作業內容----------------

此次作業里尤其是第二三部分結尾的what you should remember 個人感覺都值得細思一番。

What you should remember about dropout:

Dropout is a regularization technique.
You only use dropout during training. Dont use dropout (randomly eliminate nodes) during test time.
Apply dropout both during forward and backward propagation.
During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5。

def compute_cost_with_regularization(A3, Y, parameters, lambd): ### START CODE HERE ### (approx. 1 line) L2_regularization_cost = lambd*(np.sum(np.square([W1]))+np.sum(np.square([W2]))+np.sum(np.square([W3])))/m/2 ### END CODER HERE ### return costdef backward_propagation_with_regularization(X, Y, cache, lambd): ### START CODE HERE ### (approx. 1 line) dW3 = 1./m * np.dot(dZ3, A2.T) + lambd*W3/m ### END CODE HERE ### db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True) dA2 = np.dot(W3.T, dZ3) dZ2 = np.multiply(dA2, np.int64(A2 > 0)) ### START CODE HERE ### (approx. 1 line) dW2 = 1./m * np.dot(dZ2, A1.T) + lambd*W2/m ### END CODE HERE ### db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True) dA1 = np.dot(W2.T, dZ2) dZ1 = np.multiply(dA1, np.int64(A1 > 0)) ### START CODE HERE ### (approx. 1 line) dW1 = 1./m * np.dot(dZ1, X.T) + + lambd*W1/m ### END CODE HERE ### db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True) return gradientsdef forward_propagation_with_dropout(X, parameters, keep_prob = 0.5): # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID Z1 = np.dot(W1, X) + b1 A1 = relu(Z1) ### START CODE HERE ### (approx. 4 lines) # Steps 1-4 below correspond to the Steps 1-4 described above. D1 = np.random.rand(A1.shape[0],A1.shape[1]) # Step 1: initialize matrix D1 = np.random.rand(..., ...) D1 = (D1<keep_prob) # Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold) A1 = A1*D1 # Step 3: shut down some neurons of A1 A1 = A1/keep_prob # Step 4: scale the value of neurons that havent been shut down ### END CODE HERE ### Z2 = np.dot(W2, A1) + b2 A2 = relu(Z2) ### START CODE HERE ### (approx. 4 lines) D2 = np.random.rand(A2.shape[0],A1.shape[1]) # Step 1: initialize matrix D2 = np.random.rand(..., ...) D2 = (D2<keep_prob) # Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold) A2 = A2*D2 # Step 3: shut down some neurons of A2 A2 = A2/keep_prob # Step 4: scale the value of neurons that havent been shut down ### END CODE HERE ### Z3 = np.dot(W3, A2) + b3 A3 = sigmoid(Z3) cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) return A3, cachedef backward_propagation_with_dropout(X, Y, cache, keep_prob): """ Implements the backward propagation of our baseline model to which we added dropout. Arguments: X -- input dataset, of shape (2, number of examples) Y -- "true" labels vector, of shape (output size, number of examples) cache -- cache output from forward_propagation_with_dropout() keep_prob - probability of keeping a neuron active during drop-out, scalar Returns: gradients -- A dictionary with the gradients with respect to each parameter, activation and pre-activation variables """ m = X.shape[1] (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache dZ3 = A3 - Y dW3 = 1./m * np.dot(dZ3, A2.T) db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True) dA2 = np.dot(W3.T, dZ3) ### START CODE HERE ### (≈ 2 lines of code) dA2 = dA2*D2 # Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation dA2 = dA2/keep_prob # Step 2: Scale the value of neurons that havent been shut down ### END CODE HERE ### dZ2 = np.multiply(dA2, np.int64(A2 > 0)) dW2 = 1./m * np.dot(dZ2, A1.T) db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True) dA1 = np.dot(W2.T, dZ2) ### START CODE HERE ### (≈ 2 lines of code) dA1 = dA1*D1 # Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation dA1 = dA1/keep_prob # Step 2: Scale the value of neurons that havent been shut down ### END CODE HERE ### dZ1 = np.multiply(dA1, np.int64(A1 > 0)) dW1 = 1./m * np.dot(dZ1, X.T) db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True) gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1} return gradients