平移不變的正則線性回歸

04-23

註：之前寫過一個系列的機器學習文章，討論了我對一些流行的機器學習模型在文章和教科書里很難找到的但是我認為又很重要的性質。由於放到了牆外的blog上，國內很難訪問。有朋友建議我搬到知乎上，所以就有了下面這篇試水文章。下面是正文。

In this post, I will discuss a seldom documented aspect or trick for one of the simplest model: linear regression. I will show how to make the solution of ridge regression translation invariant and what the meaning of the bias term is. Some people might find it obvious. However it is important for practical purpose. And later when we deal with non-linear models, this trick will not be obvious.

For a recap, the regularized linear regression is to fit a linear model $y = mathbf{w}^Tmathbf{x}$ by minimizing a regularized least square target

$Q=|mathbf{y}-mathbf{w}^Tmathbf{X}|^2+lambda|mathbf{w}|^2$ (1)

The solution is simply $mathbf{w}=(mathbf{X}mathbf{X}^T+lambda mathbf{I})^{-1}mathbf{X}mathbf{y}^T$ .

Actually, we want to fit a linear model with bias term $y = mathbf{w}^Tmathbf{x}+w_0$ . Most of the textbook will tell you that we dont have to worry about the bias. Since we can always augment the variables by adding an extra dimension as ${ ilde{mathbf{x}}=[1,mathbf{x}^T]^T}$ and ${ ilde{mathbf{w}}=[w_0,mathbf{w}^T]^T}$ . What the textbook missed, the bias term ${w_0}$ , is actually important when regularizer is involved. Lets take a better look at ${w_0}$ . We can equivalently rewrite the (1) as

$Q=|mathbf{y}-(mathbf{w}^Tmathbf{X}+w_0mathbf{1}^T)|^2+lambda|mathbf{w}|^2+lambda_0w_0^2$

When augmented variables are used, what (1) really does is to let ${lambda_0=lambda}$ . However, it actually make little sense to use non-zero $lambda_0$ . To see why, lets take a look at the solution for $w_0$ , we solve

$frac{partial Q}{partial w_0}=2nw_0-2(mathbf{y}-mathbf{w}^Tmathbf{X})mathbf{1}+2lambda_0w_0=0$

and get

$egin{array}{rcl} w_0&=&frac{1}{n+lambda_0}(mathbf{y}-mathbf{w}^Tmathbf{X})mathbf{1}\ &=&ar{y}-mathbf{w}^Tar{mathbf{x}} end{array}$

Here we define a pseudo version of the average notation

$egin{array}{rcl} ar{y}&=&frac{1}{n+lambda_0}mathbf{y}mathbf{1}\ ar{mathbf{x}}&=&frac{1}{n+lambda_0}mathbf{X}mathbf{1} end{array}$

This result means that, before we see the data, we assume there are $lambda_0$ pseudo samples sitting at the origin 0. Why would one make such an assumption? In my opinion, there is absolutely no reason. So the conclusion is, never regularize bias.

If we want our solution invariant w.r.t. translation, we should let $lambda_0=0$ , which means we should minimize

$Q=|mathbf{y}-(mathbf{w}^Tmathbf{X}+w_0mathbf{1}^T)|^2+lambda|mathbf{w}|^2$

Substituting $w_0=ar{y}-mathbf{w}^Tar{mathbf{x}}$ , we have

$egin{array}{rcl} Q&=&|mathbf{y}-(mathbf{w}^Tmathbf{X}+(ar{y}-mathbf{w}^Tar{mathbf{x}})mathbf{1}^T)|^2+lambda|mathbf{w}|^2\ &=&|(mathbf{y}-ar{y}mathbf{1}^T)-mathbf{w}^T(mathbf{X}-ar{mathbf{x}}mathbf{1}^T)|^2+lambda|mathbf{w}|^2\&=& |mathbf{y}-mathbf{w}^Tmathbf{X}|^2+lambda|mathbf{w}|^2 end{array}$

where

$egin{align}mathbf{y}=&mathbf{y}-ar{y}mathbf{1}^T\mathbf{X}=&mathbf{X}-ar{mathbf{x}}mathbf{1}^Tend{align}$

The solution for $mathbf{w}$ then is

$mathbf{w}=(mathbf{X}mathbf{X}^T+lambda mathbf{I})^{-1}mathbf{X}mathbf{y}^T$

This solution means that we first centerize the data $mathbf{X}$ and $mathbf{y}$ , then regress the centerized data. In this way, we have a translation invariant solution for $mathbf{w}$ , since we always centerize our data first.

Here is some Matlab code to show the idea.

function [w, w0] = linReg(X, y, lambda)d = size(X,1);xbar = mean(X,2);ybar = mean(y,2);X = bsxfun(@minus,X,xbar);y = bsxfun(@minus,y,tbar);w = (X*X+lambda*eye(d))(X*y); w0 = ybar-dot(w,xbar);