機器學習筆記8 —— 邏輯回歸模型的代價函數和梯度下降演算法

01-30

不知道大家是否還記得我們在筆記1的時候提到的監督學習裡面的分類。

例如這就是一些分類的問題，我們收到的是否為垃圾郵件？我們在網上交易是否有欺詐？我們的腫瘤為良性還是惡性？這種非黑即白的分類問題結果也往往只有兩種，所以結果可以如下表示：

結果要不就是0要不就是1。通常0代表不好的類別，1代表好的類別。

然後我們用腫瘤是否為良性舉個例子：

在這幅圖裡面，我們可以看到腫瘤的尺寸與腫瘤是否為惡性的關係。

假如我們採用以前線性擬合的方法，根據我們筆記2所學的我們建立一個假設函數： $h_{theta}(x) =theta_{0} + theta_{1}*x$ 。那麼我們就會得到一條直線。

緊接著，我們隊線性擬合的結果進行預測，因為其結果不是0就是1，所以我們將中間值0.5取為閾值， $h_{theta}(x)>0.5$ 的腫瘤大小就是惡性的，小於則為良性的。

在這個例子中其預測結果無疑是正確的：

但是，假如我們多了一個點：

那麼我們的假設函數 $h_{theta}(x)$ 就不是這樣了，明顯對比之前的要往右下再偏一點，那麼在中間值0.5取閾值就行不通了：

所以，我們將線性回歸這種方法應用到分類問題上面，顯然不是一種很好的辦法。

練習1
Which of the follow statements is true?
A.If linear regression doesnt work on a classification task as in the previous example shown in the video, applying feature scaling may help.
B.If the training set satisfies $0leq y^{(i)}leq1$ for every training example $(x^{(i)},y^{(i)})$ ,then linear regressions will also satisfy $0leq h_{theta}(x)leq1$ for all values of $x$ .
C.If there is a feature $x$ that perfectly predicts $y$ ,i.e. if $y=1$ when $xgeq c$ and $y=0$ whenever $x<c$ (for some constant $c$ ),then linear regression will obtain zero classification error.
D.None of the above statements are true.
Answer：D

分析：A線性回歸對於分類問題行不通，用特徵縮放的方法也許對其有幫助。特徵縮放只是用在多特徵向量的線性回歸。
B.訓練集滿足不能代表我們的擬合曲線都滿足的哦。
C.上面多了那個點就是反面例子。

所以自然而言，我們希望我們的假設函數 $h_theta(x)$ 能在 $[0,1]$ 這個區間範圍內：

那麼什麼樣的函數符合這個要求呢？

通常我們在前面的h(x)加上一個函數g用來控制h(x)的範圍：

它的圖像是這樣子的：

這個函數就叫Sigmoid function（S型函數）或者 Logistic function（邏輯函數）：

$h_theta(x)=g(theta^{T}x)=frac{1}{1+e^{-theta^{T}x}}$ (輸出我們用y=1的概率表示)

練習2
Suppose we want to predict, from data $x$ about a tumor, whether it is malignant( $y =1$ ) or benign( $y =0$ ).Our logistic regression classifier outputs, for a specific tumor, $h_theta(x)=P(y=1|x;theta)=0.7$ ,so we estimate that there is a 70% chance of this tumor being malignant. What should be our estimate for $P(y=0|x;theta)$ ,the probability the tumor is benign?
A. $P(y=0|x;theta)=0.3$
B. $P(y=0|x;theta)=0.7$
C. $P(y=0|x;theta)=0.7^2$
D. $P(y=0|x;theta)=0.3 *0.7$
Answer：A
分析： $P(y=0|x;theta)=1-P(y=1|x;theta)$

下面我們再來談談Decision Boundary（決策邊界）的問題，也就是邊界是什麼和邊界的兩邊分別是什麼。

我們可以從上圖的 $g(z)$ 函數圖像可以看出：

而 $z=0$ 正是它的臨界點，因為我們的輸出只有1或者0，所以當：

當我們的輸入是 $theta^Tx$ 的時候，那就是意味著：

其輸入與輸出的關係可以這樣表示：

我們來舉個複雜點的例子：

假設我們的分類集是這樣的，我們的假設函數如圖右所示：

我們輸入 $theta^Tx=theta_0+theta_1x_1+theta_2x_2$ ，我們現在還暫時不知道怎麼擬合參數 $theta_i$ ，這個我們以後會再講，假設我們現在已經擬合好三個參數了： $theta_0=3，theta_1=1，theta_2=1$ ，

那麼到底是哪一邊輸出 $y=0$ 哪一邊 $y=1$ 呢？

回到我們之前的判斷條件：

當輸入大於0的哪一邊輸出為1，也就是說當： $3+x_1+x_2geq 0$ 的時候， $y=1$ 。

在圖上畫出 $3+x_1+x_2= 0$ ，顯而易見我們將紅色叉叉就是 $y=1$ ，藍色圓圈自然就是 $y=0$ 了。

而 $3+x_1+x_2= 0$ 就是我們的決策邊界了。

練習3
Consider logistic regression with two features $x_{1}$ and $x_{2}$ .Suppose $theta_0=5$ , $theta_1=-1$ , $theta_2=0$ ,so that $h_theta(x)=g(5-x_1)$ .Which of these shows the decision boundary of $h_theta(x)$ ?
A.

B.

C.

D.

Answer：A
分析： $5-x_1geq0$ 時， $y=1$ 。

我們再來看一個更複雜的：

假設我們已經知道了5個參數： $theta_0=1，theta_1=0，theta_2=0，theta_2=1，theta_2=1$ 。

那麼我們的決策邊界就是： $-1+x_1^2+x_2^2=0$ 。在圓外面的都會預測為 $y=1$ ，圓裡面的則為 $y=0$ 。

那麼在邏輯回歸模型中，代價函數和梯度下降函數又將如何表示呢？

在線性回歸模型中，我們的代價函數為： $J(theta) = frac{1}{m}*sum_{i=1}^{m}frac{1}{2}({ h_theta(x^{(i)}})-y^{(i)})^{2}$

我們函數中後半部分定義為： $Cost (h_theta(x^{(i)}),y^{(i)})=frac{1}{2}({ h_theta(x^{(i)}})-y^{(i)})^{2}$

這個函數當然對線性回歸有用，但是對於邏輯回歸是否又同樣有用呢？

不知道大家是否還記得筆記4裡面，代價函數 $J(theta)$ 應該隨著迭代次數的增加而下降直至收斂，這樣運算我們的梯度下降演算法才能找到局部最小值，也就是說代價函數 $J(theta)$ 圖像應該是往下凸的，因為我們這裡的假設函數 $h_{theta}(x)=frac{1}{1+e^{-theta^Tx}}$ 是一個看起來就挺複雜的非線性函數，會造成「非凸性」：

什麼意思呢？我們可以看到左圖 $J(theta)$ 圖像有波浪形，這就導致我們運行梯度下降函數的時候，容易卡在某個波浪的最小值，而不是全局的最小值。

練習4：
Consider minimizing a cost function $J(theta)$ .Which one of these function is convex?
A.

B.

C.

D.

Answer：B

所以之前的代價函數我們就不可以用了，我們要找到一個新的代價函數以致不會產生非凸性。

所以我們定義 $Cost (h_theta(x),y)$ 為：

當 $y=1$ 的時候，圖像是這樣子的：

當 $y=0$ 的時候，圖像是這樣子的：

所以我們的邏輯回歸代價函數為：

我們可以根據上面的圖像看出以下性質：

練習5
In logistic regression, the cost function for our hypothesis outputting(predicting) $h_theta(x)$ on a training example that has label $ysubseteq$ {0,1} is :

Which of the following are true? Check all that apply.
A.If $h_theta(x) =y$ ,then $Cost (h_theta(x),y)=0$ (for $y=0$ and $y=1$ )
B.If $y=0$ ,then $Cost (h_theta(x),y)→infty$ as $h_theta(x)→1$ .
C.If $y=0$ ,then $Cost (h_theta(x),y)→infty$ as $h_theta(x)→0$ .
D.Regardless of whether $y=0$ or $y=1$ , if $h_theta(x) =0.5$ ,then $Cost (h_theta(x),y)>0$ .
Answer：A、B、D
分析：根據上面兩幅圖像可以得出答案。

因為我們的目標值 $y$ 非0即1，所以我們可以將 $Cost (h_theta(x),y)$ 改寫為：

所以我們的邏輯回歸代價函數可以表示為：

該代價函數可以從統計學中使用最大似然估計原理推導出來的。

假如我們用向量表示，可以表示為：

練習6：
Suppose you are running gradient descent to fit a logistic regression model with parameter $thetasubseteq R^{n+1}$ Which of the following is a reasonable way to make sure the learning rate $alpha$ is set properly and tat gradient descent is running correctly?
A.Plot $J(theta) = frac{1}{m}*sum_{i=1}^{m}({ h_theta(x^{(i)}})-y^{(i)})^{2}$ as a function of the number of iterations(i.e. the horizontal axis is the iteration number) and make sure $J(theta)$ is decreasing on every iteration.
B.Plot $J(theta) = -frac{1}{m}*sum_{i=1}^{m}[y^{(i)}logh_theta(x^{(i)})+(1-y^{(i)})log(1-h_theta(x^{(i)}))]$ as a function of the number of iterations(i.e. the horizontal axis is the iteration number) and make sure $J(theta)$ is decreasing on every iteration.
C.Plot $J(theta)$ as as function of $theta$ and make sure it is decreasing on every iteration.
D.Plot $J(theta)$ as as function of $theta$ and make sure it is convex.
Answer：B

那麼怎麼找到我們的參數 $theta_i$ 呢？自然就是我們的梯度下降函數了：

我們將 $theta_{j} ：=theta_{j} -alphafrac{d}{dtheta_{j}}J(theta)$ 可以化簡為：

向量表示：

練習7
One iteration of gradient descent simultaneously performs these updates:
$theta_{0} ：=theta_{0} -alphafrac{1}{m}sum_{i=1}^{m}{(h_theta(x^{(i)})-y^{(i)})*x_0^{(i)}}$ $theta_{1} ：=theta_{1} -alphafrac{1}{m}sum_{i=1}^{m}{(h_theta(x^{(i)})-y^{(i)})*x_1^{(i)}}$ …… $theta_{n} ：=theta_{n} -alphafrac{1}{m}sum_{i=1}^{m}{(h_theta(x^{(i)})-y^{(i)})*x_n^{(i)}}$
We would like a vectorized implementation of the form $theta_{0} ：=theta-alphadelta$ (for some vector $deltasubseteq R^{n+1}$ ).
What should the vectorized implementation be ？
A. $theta ：=theta -alphafrac{1}{m}sum_{i=1}^{m}[{(h_theta(x^{(i)} )-y^{(i)})*x^{(i)}}]$
B. $theta ：=theta -alphafrac{1}{m}sum_{i=1}^{m}[{(h_theta(x^{(i)} )-y^{(i)})]*x^{(i)}}$
C. $theta ：=theta -alphafrac{1}{m}x^{(i)}[sum_{i=1}^{m}{(h_theta(x^{(i)} )-y^{(i)})}]$
D.All of the above are correct implementations.
Answer：A

在我們之前的線性回歸使用的特徵縮放可以加快梯度下降，在這裡的邏輯回歸也同樣適用。

優化演算法（Optimization algorithm）：

我們可以利用一些高級優化演算法，來加快梯度下降的計算過程，以解決大型的機器學習問題。

什麼是梯度下降？

其實就是我們根據代價函數 $J(theta)$ ，使其最小化的過程。

所以當我們編程輸入 $theta$ 的時候，其輸出通常有： $J(theta)$ 和 $frac{alpha}{alphatheta_j}J(theta)$ （for $j=0,1,…,n$ ），然後將其代入 $theta_{j} ：=theta_{j} -alphafrac{d}{dtheta_{j}}J(theta)$ 進行計算反覆更新 $theta$ 。

當然除了梯度下降演算法我們還是有其它更高級、更複雜的演算法去計算 $J(theta)$ 和 $frac{alpha}{alphatheta_j}J(theta)$ ，例如：Conjugate gradient（共軛梯度法）， BFGS 和 L - BFGS。

這三種演算法的好處就是不需要手動去選擇學習速率 $alpha$ ，而且比梯度下降的運算速率快很多。當然缺點是比梯度下降複雜很多。

這裡我們就不深究其原理了，學會使用就好，下面舉個例子：

假如我們的代價函數 $J(theta)$ 如上，根據我們的目測就可以得到,

最後我們解出來的參數為： $theta_0=5,theta_1=5$

然後我們在Matlab裡面新建一個函數costFunction，其輸出為 $theta$ ，輸出為代價函數 $J(theta)$ 和gradient（也就是對兩個 $theta$ 的求導）：

然後我們就可以調用高級函數fminunc：

結果：

這裡的fminunc是試圖找到一個多變數函數的最小值，從一個估計的初試值開始，這通常被認為是無約束非線性優化問題。

一般用法：

x =fminunc(fun,x0) %試圖從x0附近開始找到函數的局部最小值，x0可以是標量，向量或矩陣nx =fminunc(fun,x0,options) %根據結構體options中的設置來找到最小值，可用optimset來設置optionsnx =fminunc(problem) %為problem找到最小值,而problem是在Input Arguments中定義的結構體nn[x,fval]= fminunc(...) %返回目標函數fun在解x處的函數值n[x,fval,exitflag]= fminunc(...) %返回一個描述退出條件的值exitflagn[x,fval,exitflag,output]= fminunc(...) %返回一個叫output的結構體，它包含著優化的信息n[x,fval,exitflag,output,grad]= fminunc(...) %返回函數在解x處的梯度的值，存儲在grad中n[x,fval,exitflag,output,grad,hessian]= fminunc(...) %返回函數在解x處的Hessian矩陣的值，存儲在hessian中n

練習8
Suppose you want to use an advanced optimization algorithm to minimize the cost function for logistic regression with parameter $theta_0$ and $theta_1$ . You write the following code：

What should CODE#1 and CODE#2 above compute ?
A.CODE#1 and CODE#2 should compute $J(theta)$ .
B.CODE#1 should be $theta_1$ and CODE#1 should be $theta_2$ .
C.CODE#1 should compute $frac{1}{m}sum_{i=1}^{m}[{(h_theta(x^{(i)})-y^{(i)})*x_0^{(i)}}](=frac{alpha}{alphatheta_0}J(theta))$ and CODE#2 should compute $frac{1}{m}sum_{i=1}^{m}[{(h_theta(x^{(i)})-y^{(i)})*x_1^{(i)}}](=frac{alpha}{alphatheta_1}J(theta))$
D.None of the above.
Answer：C

Multiclass Classification: One-vs-all（多類別邏輯回歸問題）：

左圖為之前我們討論的二元分類，而右圖則為我們的多元分類。

我們的思路是：將三個種類分為三個二元分類問題，分別得出三個擬合函數 $h_theta(x)$ ：

練習9：
Suppose you have a multi-class classification problem with $k$ classes(so $ysubseteq$ {1,2,…,k}).Using the 1-vs-all method, how many different logistic regression classifiers will you end up training？
A. $k-1$
B. $k$
C. $k+1$
D.Approximately $log_2(k)$
Answer：B

下面是關於以上內容的題目：

1.Suppose that you have trained a logistic regression classifier, and it outputs on a new example $x$ a prediction $h_theta(x)=0.2$ .This means (check all that apply)：
A.Our estimate for $P(y=1|x;theta)$ is 0.8.
B.Our estimate for $P(y=1|x;theta)$ is 0.2.
C.Our estimate for $P(y=0|x;theta)$ is 0.8.
D.Our estimate for $P(y=0|x;theta)$ is 0.2.
Answer：B、C
分析：輸出是 $y=1$ 的概率。
2.Suppose you have the following training set, and fit a logistic regression classifier $h_theta(x)=g(theta_0+theta_1x_1+theta_2x_2)$ .

Which of the following are true?Check all that apply.
A. $J(theta)$ will be a convex function, so gradient descent should converge to the global minimum.
B.Adding polynomial features(e.g., instead using $h_theta(x)=g(theta_0+theta_1x_1+theta_2x_2+theta_3x_1^2+theta_4x_1x_2+theta_5x_2^2)$ could increase how well we can fit the training data.
C.The positive and negative examples cannot be separated using a straight line,So, gradient descent will fail to converge.
D.Because the positive and negative examples cannot be separated using a straight line, linear regression will perform as well as logistic regression on this data.
E.Adding polynomial features (e.g., instead using $h_θ(x)=g(θ_0+θ_1x_1+θ_2x_2+θ_3x^2_1+θ_4x_1x_2+θ_5x_2^2)$ ) would increase $J(θ)$ because we are now summing over more terms.
F.If we train gradient descent for enough iterations, for some examples $x^{(i) }$ in the training set it is possible to obtain $h_θ(x^{(i)})>1$ .
G.At the optimal value of $θ$ (e.g., found by fminunc), we will have $J(θ)≥0$ .
Answer：A、B、H
分析：A.代價函數 $J(theta)$ 是凸性的，所以梯度下降函數可以收斂到全局最小值。
B.可以根據該模型擬合數據。
C.數據不能用一條直線分類，所以梯度下降演算法不能收斂，錯誤。
D.線性回歸跟邏輯回歸一樣，錯誤。
E.增加多項式的項會增加 $J(θ)$ 類加次數，錯誤，兩者沒有聯繫。
F. $h_θ(x^{(i)})$ 範圍永遠是在 $[0,1]$ .
G. 在最優解處 $J(θ)≥0$ 沒錯， $J(θ)$ 的凸性曲線是趨向於0的。
3.For logistic regression, the gradient is given by $frac{alpha}{alphatheta_0}J(theta)=frac{1}{m}sum_{i=1}^{m}[{(h_theta(x^{(i)})-y^{(i)})*x_j^{(i)}}]$ .Which of these is a correct gradient descent update for logistic regression with a learning rate of $alpha$ ?Check all that apply.
A. $theta：=theta -alphafrac{1}{m}sum_{i=1}^{m}{(frac{1}{1+e^{theta^Tx^{(i)}}}-y^{(i)})*x^{(i)}}$
B. $theta_{j} ：=theta_{j} -alphafrac{1}{m}sum_{i=1}^{m}{(theta^Tx-y^{(i)})*x_j^{(i)}}$ （simultaneously update for all $j$ ）
C. $theta：=theta -alphafrac{1}{m}sum_{i=1}^{m}{(h_theta(x^{(i)})-y^{(i)})*x^{(i)}}$
D. $theta：=theta -alphafrac{1}{m}sum_{i=1}^{m}{(theta^{T}x-y^{(i)})*x^{(i)}}$
E. $theta_{j} ：=theta_{j} -alphafrac{1}{m}sum_{i=1}^{m}{(h_theta(x^{(i)}) -y^{(i)})*x^{(i)}}$
Answer：A、C
分析：按照上面的定義。
4.Which of the following statements are true? Check all that apply.
A.Since we train one classifier when there are two classes, we train two classifiers when there are three classes(and we do one-vs-all classification).
B.The one-vs-all technique allows you to use logistic regression for problems in which each $y^{(i)}$ comes from a fixed, discrete set of values.
C.The cost function $J(theta)$ for logistic regression trained with $mgeq1$ examples is always greater than or equal to zero.
D.For logistic regression, sometimes gradient descent will converge to a local minimum(and fail to find the global minimum). This is the reason we prefer more advanced optimization algorithms such as fminunc (conjugate gradient/BFGS/L-BFGS/etc.)
E.The sigmoid function $g(z)=frac{1}{1+e^{-z}}$ is never greater than one ( $>1$ ).
F.Linear regression always works well for classification if you classify by using a threshold on the prediction made by linear regression.
Answer：B、C、E
分析：
A.三種分類需要三個分類器，前面習題。
B.其實就是一對多的邏輯分類的解釋。
C.m為訓練種類。
D.某些梯度下降演算法會收斂到局部最小值是因為那些高級演算法，錯誤。
E.S函數從來不大於1，正確。
F.線性回歸不適用於分類問題。
5.Suppose you train a logistic classifier $h_theta(x)=g(theta_0+theta_1x_1+theta_2x_2)$ .Suppose $theta=-6$ , $theta_1=0$ , $theta_2=1$ .Which of the following figures represents the decision boundary found by your classifier?
A.