[PRML學習筆記] CHA 4 Linear Models for Classification

04-05

Least Square容易受到outliers的影響。

文中介紹了一個叫Fishers Linear discriminant的模型，其實是將data project在其他平面上使得histograms的overlapping減少，獲得更好的classify效果

引入了一個概念：Fisher criterion，組間方差和組內方差的比值 $J(w)=frac{(m_2-m_1)^2}{s_1^2+s_2^2}$ ，m1和m2使兩個class里的data的均值。目標是最大化 $J(w)$ ，最終得到的結果是 $w propto S_W^{-1}(m_2-m_1)$ 即為Fishers linear discriminant。

strictly it is not a discriminant but rather a specific choice of direction for projection of the data down to one dimension.

如果對target value的形式作一點調整，那麼可以得出least square 和 fisher solution 是等價的。

Fisher』s discriminant for multiple classes

假設是：input space的維數D大於類數K。

引入 $D>1$ 個新的feature $y_k=w_k^Tx$ 。用這個features計算組間和組內方差，同時我們的目標函數應該具備的條件是，組間方差 $s_B$ 越大，組內 $s_W$ 方差越小，函數值越大。這樣的函數有很多，比如 $J(mathbf{W})=Trleft{s_W^{-1}s_B ight}$

two classes的時候組間方差 $S_B$ 是 $(mathbf{m}_2-mathbf{m}_1)(mathbf{m}_2-mathbf{m}_1)^T$ ，multi-class時組間方差是 $sum_{k=1}^KN_k(mathbf{m}_k-mathbf{m})(mathbf{m}_k-mathbf{m})^T$ ，不過用K=2代入，兩者差距為 $frac{2N_1N_2}{N_1+N_2}$ ，不受w影響。
另外，組間方差的計算要注意，兩個相乘向量的維度是 $D imes1$ 和 $1 imes D$ ，所以得出的結果維度是 $D imes D$ ，但rank是1。

Perceptron Algorithm

在模式識別歷史上非常重要。用於two-class模型。先把input vector x進行非線性變換，成為特徵向量 $phi(x)$ ，然後用於構建線性模型 $y(x)=f(w^Tphi(x))$ ，其中 $egin{equation} f(a)=left{ egin{aligned} +1, ageq 0 \ -1, a<0 \ end{aligned} ight. end{equation}$ 。target $t_n$ 表示所屬類別，用1和-1表示。Perceptron Algo有自己適用的error function叫perceptron criterion： $E_P(mathbf w)=-sum_{nin cal M} mathbf w^Tphi_nt_n$ ，這般設置就可以使錯誤歸類的Ep值為正數，選取w最小化Ep，其中 $cal M$ 是所有錯誤歸類的集合。然後使用SGD尋找w。有一個Perceptron convergence theorem，如果training data是線性可分的，那麼這個演算法一定會找到solution，但如果不是線性可分的話，演算法不會收斂。

缺點：

訓練演算法複雜
不輸出概率計算結果，（只有分類結果）
對 $K>2$ 支持不佳
最大的限制在於：其是linear combination of fixed basis function

Probabilistic Generative Models（可以計算出概率的）

基本方法就是用貝葉斯法則，即根據 $p(x|C_k)$ 和prior $p(C_k)$ 得出 $p(C_k|x)$ 。

---------------------------------------------------------------------------------------------------------------------------

Activation Function：

logistic sigmoid function（for two class）

$p(C_1|x)=frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1)+p(x|C_2)p(C_2)}=frac{1}{1+exp(-a)}=sigma (a)$ （1）

$a=lnfrac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}$

softmax function（for multiclass）

當對 $K>2$ 時， $p(C_k|x)=frac{p(x|C_k)p(C_k)}{sum _jp(x|C_j)p(C_j)}$ ， $a_k=ln p(x|C_k)p(C_k)$ ，就是softmax function。

---------------------------------------------------------------------------------------------------------------------------

Continuous inputs

K = 2

假設：class-conditional densities are Gaussian，all classes share same cov

即 $p(x|C_k)=frac{1}{(2pi)^{D/2}}frac{1}{left | Sigma ight |^{1/2}}expleft{ -frac{1}{2}(x-mu_k)^TSigma^{-1}(x-mu_k) ight}$ （2）

可以化簡（1）為 $p(C_1|x)=sigma(mathbf w^Tx+w_0)$ ，其中

$mathbf w =Sigma^{-1}(m{mu}_1-mmu_2)$

$w_0 = -frac{1}{2}mmu_1^TSigma^{-1}mmu_1+frac{1}{2}mmu_2^TSigma^{-1}mmu_2+lnfrac{p(C_1)}{p(C_2)}$

證明：（2）代入 $a=lnfrac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}$ 得 $a=frac{1}{2}left[(x-mmu_2)^TSigma^{-1}(x-mmu_2)-(x-mmu_1)^TSigma^{-1}(x-mmu_1) ight]+lnfrac{p(C_1)}{p(C_2)}$
拆開後
$egin{align*}& a=x^TSigma^{-1}x-x^TSigma^{-1}mmu_2-mmu_2^TSigma^{-1}x+mmu_2^TSigma^{-1}mmu_2+x^TSigma^{-1}x\& +x^TSigma^{-1}mmu_1+mmu_1^TSigma^{-1}x-mmu_1^TSigma^{-1}mmu_1+lnfrac{p(C_1)}{p(C_2)}\ & =mathbf w^Tx+w_0end{align*}$

若將前述假設改為，classes have different cov那麼，就得到quadratic discriminant model

$K>2$

$egin{align*}& a_k=mathbf w_k^Tx+w_{k0}\& mathbf w_k = Sigma^{-1}mmu_k \& w_{k0}=-frac{1}{2}mmu_k^TSigma^{-1}mmu_k+ln p(C_k)end{align*}$

用極大似然法求參數以及prior $p(C_k)$

計算結果：

$p(C_1)$ =class 1s fraction of total sample
$mmu_1$ = mean of x that belongs to class 1
$mSigma$ = weighted average of the covariance matrices associated with each of the two classes separately. 即 $mSigma = frac{N_1}{N}(frac{1}{N_1}sum_{nin C_1}(x_n-mmu_1)(x_n-mmu_1)^T)+frac{N_2}{N}(frac{1}{N_2}sum_{nin C_1}(x_n-mmu_2)(x_n-mmu_2)^T)$

Discrete inputs

class-conditional distribution： $p(x|C_k)=prod_{i=1}^Dmu_{ki}^{x_i}(1-mu_{ki})^{1-x_i}$

Exponential Family

不管continuous還是dicrete，其實都是更general的exponential family的class-conditional density

---------------------------------------------------------------------------------------------------------------------------

Probabilistic Discriminative Models

Two class

先用非線性方程 $phi (x)$ 作為新特徵向量，如此之後， $phi(x)$ 是linearly separable，但x不是。

用cross entropy error function即logarithm of the likelihood：

$E(mathbf w)=-ln p(mathbf t| mathbf w)=-sum_{n=1}^{N} left{ t_n ln y_n + (1-t_n)ln (1-y_n) ight}$

順便求一個gradient： $abla E(mathbf w)=sum_{n=1}^N(y_n-t_n)phi_n$

介紹了一個新演算法叫Newton-Raphson，就是用Hessian matrix（second order gradient）。

$mathbf w^{(new)} = mathbf w^{(old)}-mathbf H^{-1} abla E(mathbf w)$ （3）

用NR演算法求linear regression的least square error function只需要一步，因為該方程是quadratic。
求Logistic regression model的cross entropy error function的話，沒有closed form solution，所以需要iterative的方法：Iterative reweighted least squares

已經算了gradient，再算 $mathbf H = abla abla E(mathbf w)=sum_{n=1}^Ny_n(1-y_n)phi_nphi_n^T=mPhi^Tmathbf RmPhi$ ，其中R是一個diagonal matrix with elements $R_{nn}=y_n(1-y_n)$ ，就是方差矩陣吧。 $mPhi$ 是特徵向量 $phi$ 合起來的矩陣， $mPhi$ 的行是 $phi_n^T$
代入（3） $egin{align*}& mathbf w^{(new)} = mathbf w^{(old)}-(mPhi^Tmathbf RmPhi)^{-1}mPhi^T(mathbf{y-t}) \& =(mPhi^Tmathbf RmPhi)^{-1}mPhi^Tmathbf Rmathbf zend{align*}$ ，其中 $mathbf z=mPhimathbf w^{old}-mathbf R^{-1}(mathbf{y-t})$
$mathbf z$ 的每一個元素 $z_i$ 可以看作是 $phi_i$ 在當前weighting $mathbf w^{old}$ 處作線性approximation得到的target value。 $mathbf w^Tphi_i =phi_i^Tmathbf w^{old}+frac{da_n}{dy_n}igg|_{mathbf w^{old}}= phi_i^Tmathbf w^{old}-frac{y_i-t_i}{y_i(1-y_i)}=z_i$
因為，R是要用w計算的，所以需要循環

Multi Class

還是用cross entropy error function， $E(mathbf w_1,...mathbf w_K)=-ln p(mathbf T|mathbf w_1,...mathbf w_K)=-sum_{n=1}^{N}sum_{k=1}^{K}t_{nk}ln y_{nk}$

Probit regression

用累積密度函數上的某個level作為分類boundary.

Canonical link functions

link function就是activation function的反函數
若target variable的conditional distribution（Gaussian）是exponential family，且選取的activation function屬於canonical link function（例如sigmoid，softmax），那麼該模型的error function相對於w的gradient的形式都是『error』 $y_n-t_n$ 和features向量 $phi_n$ 的乘積這樣的簡單形式。

---------------------------------------------------------------------------------------------------------------------------

Laplace Approximation（很像Copula）

原變數 $z$ 的分布 $p(z)=frac{1}{Z}f(z)$ ， $Z=int f(z)dz$ 不是正態分布，取原變數分布的眾數 $z_0$ 為中心，近似建立一個正態分布。用眾數大概是因為，眾數是概率密度函數的最大值，一階導為0。令A表示眾數所在點的二階導（負的），那麼近似的分布就是 $q(z)=left(frac{A}{2pi} ight)^{1/2}expleft{-frac{A}{2}(z-z_0)^2 ight}$

Model Comparison: penalize model complexity using BIC(Bayesian Information Criterion)

用近似分布 $q(z)$ 計算Z的近似值。 $Z=int f(mathbf z)dmathbf z=f(mathbf z_0)frac{(2pi)^{M/2}}{left|mathbf A ight|^{1/2}}$
這篇文章講的比較易懂：frequentist希望找到parameter $heta$ 使得 $p(cal D| heta)$ 達到最大值，但這樣往往造成overfit。bayesian方法是用model class（a set of distribution of $heta$ ）

In Bayesian evidence, if a model m1 is very complex, then m1 can generate many possible data sets. As a result, it is unlikely to generate the particular data set D at random. If the model m2 is too simple, the likelihood will be small, suggesting that it is unlikely to be a good fit. So, roughly speaking, Bayesian evidence has a regularization in itself allowing us to choose the model which is just right.

3. 但是， $p(D|m)$ 很難算，用Laplace approximation可以協助，BIC更加簡化了Laplace approximation的結果。

---------------------------------------------------------------------------------------------------------------------------

維數問題

看到後面突然發現維數一定要搞清楚，文中的黑體小寫字母表示列向量，大寫字母表示矩陣，所以input data matrix $ilde{X}$ 的nth 行是 $x_n^T$ 。

對於input space dimension為D的模型，x的維度是 $D imes1$ ，所以平均值 $mathbf{m}_1$ 的維度也是 $D imes1$ 。Fishers Linear Discriminant

Class conditional density and Activation Function

這倆是本章比較多篇幅的概念，順序都是從具體（gaussian或binomial，sigmoid或softmax）到一般（exponential family, canonical link function）。