標籤:

平移不變的正則線性回歸

註:之前寫過一個系列的機器學習文章,討論了我對一些流行的機器學習模型在文章和教科書里很難找到的但是我認為又很重要的性質。由於放到了牆外的blog上,國內很難訪問。有朋友建議我搬到知乎上,所以就有了下面這篇試水文章。下面是正文。

In this post, I will discuss a seldom documented aspect or trick for one of the simplest model: linear regression. I will show how to make the solution of ridge regression translation invariant and what the meaning of the bias term is. Some people might find it obvious. However it is important for practical purpose. And later when we deal with non-linear models, this trick will not be obvious.

For a recap, the regularized linear regression is to fit a linear model y = mathbf{w}^Tmathbf{x} by minimizing a regularized least square target

 Q=|mathbf{y}-mathbf{w}^Tmathbf{X}|^2+lambda|mathbf{w}|^2 (1)

The solution is simply mathbf{w}=(mathbf{X}mathbf{X}^T+lambda mathbf{I})^{-1}mathbf{X}mathbf{y}^T.

Actually, we want to fit a linear model with bias term y = mathbf{w}^Tmathbf{x}+w_0. Most of the textbook will tell you that we dont have to worry about the bias. Since we can always augment the variables by adding an extra dimension as {	ilde{mathbf{x}}=[1,mathbf{x}^T]^T} and {	ilde{mathbf{w}}=[w_0,mathbf{w}^T]^T}. What the textbook missed, the bias term {w_0}, is actually important when regularizer is involved. Lets take a better look at {w_0}. We can equivalently rewrite the (1) as

Q=|mathbf{y}-(mathbf{w}^Tmathbf{X}+w_0mathbf{1}^T)|^2+lambda|mathbf{w}|^2+lambda_0w_0^2

When augmented variables are used, what (1) really does is to let {lambda_0=lambda}. However, it actually make little sense to use non-zero lambda_0. To see why, lets take a look at the solution for w_0, we solve

frac{partial Q}{partial w_0}=2nw_0-2(mathbf{y}-mathbf{w}^Tmathbf{X})mathbf{1}+2lambda_0w_0=0

and get

 egin{array}{rcl} w_0&=&frac{1}{n+lambda_0}(mathbf{y}-mathbf{w}^Tmathbf{X})mathbf{1}\ &=&ar{y}-mathbf{w}^Tar{mathbf{x}} end{array}

Here we define a pseudo version of the average notation

 egin{array}{rcl} ar{y}&=&frac{1}{n+lambda_0}mathbf{y}mathbf{1}\ ar{mathbf{x}}&=&frac{1}{n+lambda_0}mathbf{X}mathbf{1} end{array}

This result means that, before we see the data, we assume there are lambda_0 pseudo samples sitting at the origin 0. Why would one make such an assumption? In my opinion, there is absolutely no reason. So the conclusion is, never regularize bias.

If we want our solution invariant w.r.t. translation, we should let lambda_0=0, which means we should minimize

Q=|mathbf{y}-(mathbf{w}^Tmathbf{X}+w_0mathbf{1}^T)|^2+lambda|mathbf{w}|^2

Substituting w_0=ar{y}-mathbf{w}^Tar{mathbf{x}} , we have

 egin{array}{rcl} Q&=&|mathbf{y}-(mathbf{w}^Tmathbf{X}+(ar{y}-mathbf{w}^Tar{mathbf{x}})mathbf{1}^T)|^2+lambda|mathbf{w}|^2\ &=&|(mathbf{y}-ar{y}mathbf{1}^T)-mathbf{w}^T(mathbf{X}-ar{mathbf{x}}mathbf{1}^T)|^2+lambda|mathbf{w}|^2\&=& |mathbf{y}-mathbf{w}^Tmathbf{X}|^2+lambda|mathbf{w}|^2 end{array}

where

egin{align}mathbf{y}=&mathbf{y}-ar{y}mathbf{1}^T\mathbf{X}=&mathbf{X}-ar{mathbf{x}}mathbf{1}^Tend{align}

The solution for mathbf{w} then is

mathbf{w}=(mathbf{X}mathbf{X}^T+lambda mathbf{I})^{-1}mathbf{X}mathbf{y}^T

This solution means that we first centerize the data mathbf{X} and mathbf{y}, then regress the centerized data. In this way, we have a translation invariant solution for mathbf{w}, since we always centerize our data first.

Here is some Matlab code to show the idea.

function [w, w0] = linReg(X, y, lambda)d = size(X,1);xbar = mean(X,2);ybar = mean(y,2);X = bsxfun(@minus,X,xbar);y = bsxfun(@minus,y,tbar);w = (X*X+lambda*eye(d))(X*y); w0 = ybar-dot(w,xbar);

推薦閱讀:

極市分享|機器視覺技術在智能檢測產品研發過程中的應用研究
十分種讀懂KNN
機器學習篇-評估機器學習的模型
EdX-Columbia機器學習課第5講筆記:貝葉斯線性回歸
學Python,這10道題你一定得會

TAG:機器學習 |