


In this post, I will discuss a seldom documented aspect or trick for one of the simplest model: linear regression. I will show how to make the solution of ridge regression translation invariant and what the meaning of the bias term is. Some people might find it obvious. However it is important for practical purpose. And later when we deal with non-linear models, this trick will not be obvious.

For a recap, the regularized linear regression is to fit a linear model y = mathbf{w}^Tmathbf{x} by minimizing a regularized least square target

 Q=|mathbf{y}-mathbf{w}^Tmathbf{X}|^2+lambda|mathbf{w}|^2 (1)

The solution is simply mathbf{w}=(mathbf{X}mathbf{X}^T+lambda mathbf{I})^{-1}mathbf{X}mathbf{y}^T.

Actually, we want to fit a linear model with bias term y = mathbf{w}^Tmathbf{x}+w_0. Most of the textbook will tell you that we dont have to worry about the bias. Since we can always augment the variables by adding an extra dimension as {	ilde{mathbf{x}}=[1,mathbf{x}^T]^T} and {	ilde{mathbf{w}}=[w_0,mathbf{w}^T]^T}. What the textbook missed, the bias term {w_0}, is actually important when regularizer is involved. Lets take a better look at {w_0}. We can equivalently rewrite the (1) as


When augmented variables are used, what (1) really does is to let {lambda_0=lambda}. However, it actually make little sense to use non-zero lambda_0. To see why, lets take a look at the solution for w_0, we solve

frac{partial Q}{partial w_0}=2nw_0-2(mathbf{y}-mathbf{w}^Tmathbf{X})mathbf{1}+2lambda_0w_0=0

and get

 egin{array}{rcl} w_0&=&frac{1}{n+lambda_0}(mathbf{y}-mathbf{w}^Tmathbf{X})mathbf{1}\ &=&ar{y}-mathbf{w}^Tar{mathbf{x}} end{array}

Here we define a pseudo version of the average notation

 egin{array}{rcl} ar{y}&=&frac{1}{n+lambda_0}mathbf{y}mathbf{1}\ ar{mathbf{x}}&=&frac{1}{n+lambda_0}mathbf{X}mathbf{1} end{array}

This result means that, before we see the data, we assume there are lambda_0 pseudo samples sitting at the origin 0. Why would one make such an assumption? In my opinion, there is absolutely no reason. So the conclusion is, never regularize bias.

If we want our solution invariant w.r.t. translation, we should let lambda_0=0, which means we should minimize


Substituting w_0=ar{y}-mathbf{w}^Tar{mathbf{x}} , we have

 egin{array}{rcl} Q&=&|mathbf{y}-(mathbf{w}^Tmathbf{X}+(ar{y}-mathbf{w}^Tar{mathbf{x}})mathbf{1}^T)|^2+lambda|mathbf{w}|^2\ &=&|(mathbf{y}-ar{y}mathbf{1}^T)-mathbf{w}^T(mathbf{X}-ar{mathbf{x}}mathbf{1}^T)|^2+lambda|mathbf{w}|^2\&=& |mathbf{y}-mathbf{w}^Tmathbf{X}|^2+lambda|mathbf{w}|^2 end{array}



The solution for mathbf{w} then is

mathbf{w}=(mathbf{X}mathbf{X}^T+lambda mathbf{I})^{-1}mathbf{X}mathbf{y}^T

This solution means that we first centerize the data mathbf{X} and mathbf{y}, then regress the centerized data. In this way, we have a translation invariant solution for mathbf{w}, since we always centerize our data first.

Here is some Matlab code to show the idea.

function [w, w0] = linReg(X, y, lambda)d = size(X,1);xbar = mean(X,2);ybar = mean(y,2);X = bsxfun(@minus,X,xbar);y = bsxfun(@minus,y,tbar);w = (X*X+lambda*eye(d))(X*y); w0 = ybar-dot(w,xbar);



TAG:機器學習 |