機器學習筆記9 —— 過擬合和正則化

過擬合(over-fitting)的問題:

我們用線性回歸中所熟悉的房價預測舉個例子,假如我們的假設函數分別是這樣子的:

左側的假設函數是: theta_0+theta_1x ,那麼可以看出對擬合數據的效果就不是非常好了,隨著房子大小的增大,其價錢也隨之等幅度增加,這與真實數據是不符的。我們將這個問題叫做:under-fitting(欠擬合)或者說這個演算法有high bias(高偏差)

中間的假設函數: theta_0+theta_1x+theta_2x^2 ,我們在之前的基礎上加入一個二次項,得到的擬合曲線擬合效果非常好。

右邊的話是我們的一個極端情況,假設函數: theta_0+theta_1x+theta_2x^2+theta_3x^3+theta_4x^4 ,假設函數都經過了五個訓練數據,看上去做了一個很好的擬合,但是這樣一條扭曲的曲線卻不符合預測房價的趨勢,所以這並不是一個好的模型。所以這個問題我們就叫做:over-fitting(過擬合)或者說著演算法有high variance(高方差)

過擬合其實就是千方百計擬合訓練數據,但是卻無法泛化到新的書樣本裡面。

在邏輯回歸裡面也一樣:

練習1

Consider the medical diagnosis problem of classifying tumors as malignant or benign.If a hypothesis h_theta(x) has overfit the training set,it means that:

A.It makes accurate predictions for examples in the training set and generalizes well to make accurate predictions on new, previously unseen examples.

B.It does not make accurate predictions for examples in the training set, but it does generalize well to make accurate predictions on new, previously unseen examples.

C.It makes accurate predictions for examples in the training set,but it does generalize well to make accurate predictions on new, previously unseen examples.

D.It does not make accurate predictions for examples in the training set and does not generalize well to make accurate predictions on new, previously unseen examples.

Answer:C

分析:就是完美擬合訓練數據但是不能泛化到新的數據當中。

那我們應該怎麼解決過擬合的問題呢?

例如上面一維或者二維的數據,我們可以畫出假設函數的圖像,用來判斷其是否出現過擬合的現象。但是假如我們有很多很多變數呢,這樣的圖像就比較難畫出來了,一般我們有兩個選擇:

第一個選擇就是減少變數的數量。我們可以人工選擇哪些變數更為重要應該保留,哪些變數不那麼重要可以捨棄。我們也可以採用模型選擇演算法,這種演算法可以自動選擇應該採用哪些特徵變數(這個以後我們再探討學習)。這種辦法確實可以很好的解決過擬合的問題,但是缺點是捨棄了原來數據的一些信息。

第二個選擇隨後討論的正則化。我們保留所有的特徵變數,但是適當減少 theta_j 參數的幅度大小。 theta_4approx0

例如上面的例子:

正是因為多了後面的項: theta_3x^3+theta_4x^4 ,假設我們就在後面加上 1000*theta_3^2+1000*theta_4^2 ,可以使theta_3approx0theta_4approx0 ,那麼後面的兩項就等於沒加了,這就消除了它的過擬合影響。

更一般的情況,正則化其實就是加上 theta_j 去抵消該多項式過擬合的影響。

那麼線性回歸代價函數正則化公式表示就是:

因為我們不知道是哪一個多項式造成了過擬合,所以我們這裡是加上了全部的參數 sum_{j=1}^{n}{theta_j^2},達到整體平滑的效果。

練習2

In regularized linear regression, we choose theta to minimize:

J(theta)=frac{1}{2m}[sum_{i=1}^{m}{(h_theta(x^{(i)})-y^{(i)})^2+lambda}sum_{j=1}^{n}{theta_j^2}] What if lambda is set to an extremely large value(perhaps too large for our problems,say lambda=10^{10} )?

A . Algorithm works fine;setting lambda to be very large cant hurt it.

B.Algorithm fails to eliminate over-fitting.

C.Algorithm results in under-fitting(fails to fit even the training set).

D.Gradient descent will fail to converge.

Answer:C

分析:這裡 lambda=10^{10} 明顯太大了,這就會導致 lambdasum_{j=1}^{n}{theta_j^2} 佔主要影響,那麼結果就會趨向於一條水平直線,導致欠擬合了。所以我們需要選擇一個合適的正則化參數 lambda

線性回歸梯度下降正則化:

其中, theta_j 可以化簡為:

練習3

Suppose you are doing gradient descent on a training set of m>0examples.using a fairly small learning rate alpha>0 and some regularization parameter lambda>0 .Consider the update rule:

Which of the following statements about the term (1-alphafrac{lambda}{m}) must be true?

A. 1-alphafrac{lambda}{m}>1

B. 1-alphafrac{lambda}{m}=1

C. 1-alphafrac{lambda}{m}<1 theta=(X^{T}X)^{-1}X^{T}y

D.None of the above.

Answer:C

分析:學習速率 alpha 、正則化參數 lambda 和訓練數據 m 都為正數,所以 alphafrac{lambda}{m} 也是正數。

我們對比一下之前的線性回歸梯度下降函數,後面的那一項是一樣的,只是前面的 theta_j 多了前面的參數 (1-alphafrac{lambda}{m}) ,那麼對 theta_j 的減少只是被替換為 1-alphafrac{lambda}{m} 的倍數。

那麼對正規方程(筆記5)正則化呢?

正規方程: theta=(X^{T}X)^{-1}X^{T}y

正規方程正則化函數:

具體的證明這裡就不詳細展開了,其實這是類似筆記5裡面的證明,只是加上了一個矩陣。

邏輯回歸的正則化:

邏輯回歸代價函數正則化公式:J(θ)=?frac{1}{m}sum_{i=1}^{m}[y^{(i)}log(h_θ(x^{(i)}))+(1?y^{(i)}) log(1?h_θ(x^{(i)}))]+frac{lambda}{2m}sum_{j=1}^{n}{theta_j^2}

跟前面的線性回歸的一樣,後面這項 frac{lambda}{2m}sum_{j=1}^{n}{theta_j^2} 正是用來抵消過擬合的影響的。

邏輯回歸梯度下降函數正則化公式:

Repeat {

theta_0:=theta_0-alphafrac{1}{m}sum_{i=1}^{m}({h_theta(x^{(i)})-y^{(i)})x_0^{(i)}}

theta_j:=theta_j-alpha[frac{1}{m}sum_{i=1}^{m}({h_theta(x^{(i)})-y^{(i)})x_j^{(i)}}+frac{lambda}{m}theta_j] ( j=1,2,3,…n )

}

練習4

When using regularized logistic regression, which of these is the best way to monitor whether gradient descent is work correctly?

A.Plot ?frac{1}{m}sum_{i=1}^{m}[y^{(i)}log(h_θ(x^{(i)}))+(1?y^{(i)}) log(1?h_θ(x^{(i)}))] as a function of the number of iterations and make sure its decreasing.

B.Plot ?frac{1}{m}sum_{i=1}^{m}[y^{(i)}log(h_θ(x^{(i)}))+(1?y^{(i)}) log(1?h_θ(x^{(i)}))]-frac{lambda}{2m}sum_{j=1}^{n}{theta_j^2} as a function of the number of iterations and make sure its decreasing.

C.Plot ?frac{1}{m}sum_{i=1}^{m}[y^{(i)}log(h_θ(x^{(i)}))+(1?y^{(i)}) log(1?h_θ(x^{(i)}))]+frac{lambda}{2m}sum_{j=1}^{n}{theta_j^2} as a function of the number of iterations and make sure its decreasing.

D.Plot frac{lambda}{2m}sum_{j=1}^{n}{theta_j^2} as a function of the number of iterations and make sure its decreasing.

Answer:C

下面是關於以上的題目:

題目1

You are training a classification model with logistic regression. Which of the following statements are true? Check all that apply.

A.Introducing regularization to the model always results in equal or better performance on examples not in the training set.

B.Adding many new features to the model makes it more likely to overfit the training set.

C.Adding a new feature to the model always results in equal or better performance on examples not in the training set.

D.Introducing regularization to the model always results in equal or better performance on the training set.

Answer:C

分析:

A、D正則化的引入是解決過擬合的問題,而過擬合正是過度擬合數據但無法泛化到新的數據樣本中。

B.

C.增加一些特徵量可能導致擬合在訓練集原本沒有被擬合到的數據,正確,這就是過擬合。

題目2

Suppose you ran logistic regression twice, once with λ=0 ,and once with λ=1 . One of the times, you got parameters theta=left[ begin{array}{ccc} 26.29  65.41 end{array} right], and the other time you got

theta=left[ begin{array}{ccc} 1.37  0.51 end{array} right] . However, you forgot which value of λ corresponds to which value of θ. Which one do you think corresponds to λ=1 ?

A. theta=left[ begin{array}{ccc} 26.29  65.41 end{array} right]

B. theta=left[ begin{array}{ccc} 1.37  0.51 end{array} right]

Answer:B

分析: λ=0 是沒有正則化, λ=1 是正則化之後的,那麼正則化其實讓我們的 theta_j變小,所以選B。

題目3

Which of the following statements about regularization are true? Check all that apply.

A.Because logistic regression outputs values 0≤h_θ(x)≤1 , its range of output values can only be "shrunk" slightly by regularization anyway, so regularization is generally not helpful for it.

B.Using a very large value of λ cannot hurt the performance of your hypothesis; the only reason we do not set λ to be too large is to avoid numerical problems.

C.Consider a classification problem. Adding regularization may cause your classifier to incorrectly classify some training examples (which it had correctly classified when not using regularization, i.e. when λ=0 ).

D.Using too large a value of λ can cause your hypothesis to overfit the data; this can be avoided by reducing λ .

Answer:C

分析:

A.正則化對邏輯回歸沒用,錯誤。

B.D. lambda 過大是因為會導致欠擬合。

題目4

In which one of the following figures do you think the hypothesis has overfit the training set?

A.

B.

C.

D.

Answer:A

分析:

A過擬合。

BC剛好擬合。

D欠擬合。

題目5

In which one of the following figures do you think the hypothesis has underfit the training set?

A.

B.

C.

D.

Answer:A


筆記整理自Coursera吳恩達機器學習課程。

避免筆記的冗雜,翻閱時不好找,所以分成幾個部分寫,有興趣的同學可以關注一下其它的筆記。

機器學習筆記1 —— 機器學習定義、有監督學習和無監督學習

機器學習筆記2 —— 線性模型、價值函數和梯度下降演算法

機器學習筆記3 —— 線性代數基礎

機器學習筆記4 —— 多特徵量線性回歸

機器學習筆記5 —— 正規方程

機器學習筆記6 —— Matlab編程基礎

機器學習筆記7 —— 編程作業1

機器學習筆記8 —— 邏輯回歸模型的代價函數和梯度下降演算法


推薦閱讀:

10分鐘快速入門PyTorch (1)
【預測演算法】01. 線性回歸原理及R語言實現
廣義線性模型(Generalized Linear Model)
線性回歸建模中殘差異方差性的分析和處理

TAG:机器学习 | 正规化 | 线性回归 |