深度學習（Deep Learning）基礎概念5：深度學習基礎概念測試題及詳解

01-30

此專欄文章隨時更新編輯，如果你看到的文章還沒寫完，那麼多半是作者正在更新或者上一次沒有更新完，請耐心等待，正常的頻率是每天更新一篇文章。

此文章主要是吳恩達在Cursera上的系列課程「深度學習（DeepLearning）」的學習筆記，這一篇是關於第三周測試題的筆記，首發於知乎的專欄「深度學習+自然語言處理（NLP）」。

該系列文章的目的在於理清深度學習進的一些基本概念。

以下是正文：

====================================================================

直接進入正題，以下是習題及解答：

1。Which of the following are true? (Check all that apply.)

X is a matrix in which each column is one training example.
$a^{[2]}$ denotes the activation vector of the 2nd layer.
$a^{[2]}_4$ is the activation output by the 4th neuron of the 2nd layer
$a^{[2](12)}$ denotes the activation vector of the 2nd layer for the 12th training example.
$a^{[2](12)}$ denotes activation vector of the 12th layer on the 2nd training example.
X is a matrix in which each row is one training example.
$a^{[2]}_4$ is the activation output of the 2nd layer for the 4th training example

第一題是一道多選題，選擇正確的說法，這裡的[ ]上角標代表哪一層，（）上角標里的數字代表哪組數據，下角標代表這一層的第幾個神經元。

因此答案是1，2，3，4

2。The tanh activation usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, and so it centers the data better for the next layer. True/False?

True
False

判斷題，tanh激活函數比sigmoid函數的優勢在於輸出值的均值接近0，因此對於下一層來說更好處理。

3。Which of these is a correct vectorized implementation of forward propagation for layer l, where 1≤l≤L?

Z[l]=W[l]A[l?1]+b[l]
A[l]=g[l](Z[l])

Z[l]=W[l]A[l]+b[l]
A[l+1]=g[l](Z[l])

Z[l]=W[l?1]A[l]+b[l?1]
A[l]=g[l](Z[l])

Z[l]=W[l]A[l]+b[l]
A[l+1]=g[l+1](Z[l])

這道題需要選出哪個選項是正確的l層的向量化的前向傳播公式，這裡l是當前的層，因此數據來自上一層，所以A的上角標是l-1，所以正確答案是a

4。You are building a binary classifier for recognizing cucumbers (y=1) vs. watermelons (y=0). Which one of these activation functions would you recommend using for the output layer?

ReLU
Leaky ReLU
sigmoid
tanh

對於邏輯回歸問題，sigmoid作為激活函數更加合適。答案是c

5。Consider the following code:

A = np.random.randn(4,3)nB = np.sum(A, axis = 1, keepdims = True)n

What will be B.shape? (If you』re not sure, feel free to run this in python to find out).

(4, )
(4, 1)
(, 3)
(1, 3)

理解代碼，這裡的axis = 1代表沿著行的方向求和，keepdims = True的意思是保持維度的表示是(4, 1)而不是(4, )，因此答案是b。

6。Suppose you have built a neural network. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

Each neuron in the first hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.
Each neuron in the first hidden layer will perform the same computation in the first iteration. But after one iteration of gradient descent they will learn to compute different things because we have 「broken symmetry」.
Each neuron in the first hidden layer will compute the same thing, but neurons in different layers will compute different things, thus we have accomplished 「symmetry breaking」 as described in lecture.
The first hidden layer』s neurons will perform different computations from each other even in the first iteration; their parameters will thus keep evolving in their own way.

關於初始化參數的問題，如果我們初始化全零參數，將導致無論經過多少層，都是一樣的結果。答案是a

7。Logistic regression』s weights w should be initialized randomly rather than to all zeros, because if you initialize to all zeros, then logistic regression will fail to learn a useful decision boundary because it will fail to 「break symmetry」, True/False?

True
False

如果我們在邏輯回歸問題中初始化參數為0，並不影響我么的結果，因為後向傳播會改變參數的值，因此答案是false。

8。You have built a network using the tanh activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*1000. What will happen?

This will cause the inputs of the tanh to also be very large, thus causing gradients to be close to zero. The optimization algorithm will thus become slow.
Yes. tanh becomes flat for large values, this leads its gradient to be close to zero. This slows down the optimization algorithm.
It doesn』t matter. So long as you initialize the weights randomly gradient descent is not affected by whether the weights are large or small.
This will cause the inputs of the tanh to also be very large, thus causing gradients to also become large. You therefore have to set α to be very small to prevent divergence; this will slow down learning.
This will cause the inputs of the tanh to also be very large, causing the units to be 「highly activated」 and thus speed up learning compared to if the weights had to start from small values.

如果我們採用 tanh作為激活函數，並初始化參數一個特別大的值，那麼根據tanh的函數圖像，我們對其求導將得到特別小的值，導致我們的演算法優化過程很慢，答案是a。

9。Consider the following 1 hidden layer neural network:

Which of the following statements are True? (Check all that apply).

W[1] will have shape (2, 4)
b[1] will have shape (4, 1)
W[1] will have shape (4, 2)
b[1] will have shape (2, 1)
W[2] will have shape (1, 4)
b[2] will have shape (4, 1)
W[2] will have shape (4, 1)
b[2] will have shape (1, 1)

關於參數的維度，技巧如下：W的行數等於後一層的數量（這裡是4），列數是前一層的數量（這裡是2），b永遠和w的行數相同，列數是1, 同理克制w[2]的維度。答案是2、3、5、8

10。In the same network as the previous question, what are the dimensions of Z[1] and A[1]?

Z[1] and A[1] are (4,m)
Z[1] and A[1] are (1,4)
Z[1] and A[1] are (4,2)
Z[1] and A[1] are (4,1)

上面的圖中，Z[1]和A[1]的維度是多少，A是多有輸入樣本的組合，每一列代表一個樣本，因此列數就是樣本數量m，行數是一個樣本中的特徵的數量，這裡是4，因此答案是a。