Day9-《The Introduction of Statistical Learning》學習筆記

05-15

第十章--無監督學習

目標不再是用解釋變數來預測被解釋變數，而是：

--用低維表示高維盡量多的信息(PCA)

--發現樣本內部各種子類（Clusters）

10.1無監督學習的挑戰

10.2 主因子分析

10.2.1什麼是主因子

10.2.2主因子的另一種解釋

10.2.3更多關於PCA

10.2.4主因子的其他應用

10.3聚類方法

10.3.1 K-means聚類

10.3.2 hierachical聚類

10.3.3 聚類中的實際問題

10.1無監督學習的挑戰

沒有簡單明確的目標（如預測y），更多作為探索性數據分析；

沒有被一致認可的評判學習效果的標準（因為沒有y，也就不能用cross-validation）

10.2 主因子分析

10.2.1什麼是主因子

面對n*p的數據，我們希望找到m個因子來表示p個特徵儘可能多的信息（m < p）。每個因子都是p個特徵的線性組合。當p=2時，每個因子是一條線，如：Z1=0.8*pop+0.5*ad。『儘可能多的信息』意味著把樣本點投影在每個因子上的投影點的方差最大（變異大才有更多的信息）。更一般地說，如果我們把所有樣本點投影在由 $phi _{1}$ 張成的子空間上，那麼該子空間上的投影點的方差達到最大。

p=2

p=3

重要概念：

the first principal component : Z1

loadings of the first principal component: $phi _{1}$

scores of the first principal component: $z_{11}$ , $z_{21}$ , …… $z_{n1}$

the second principal component: 與Z1無關的，且使得樣本點投射在該線上的投射點的方差最大的那條線。

10.2.2主因子的另一種解釋

尋找某條直線（某個因子）使得該線與所有樣本點盡量接近，接近意味著能更好總結出樣本的信息。因為每個因子都總結出樣本的一部分信息（第一主因子總結出信息最多，其餘因子依次遞減），所以我們可以把M個主因子總結出來的信息相加來近似原樣本，即：

10.2.3更多關於PCA的問題

--是否需要將樣本變成0均值，標準差為1？

o均值總是需要；當p個特徵單位不同時，需要標準差為1；當相同單位時，不需變標準差。

左下代表scores，右上代表loadings。

--principal component，loading是否唯一？

principal component代表一條線，線是唯一的；loading代表向量，表示同一條線的向量不唯一，正負號的差別。

--每個principal component 解釋了多少樣本的變異（信息）？

指標PVE（proportion of variance explained）

--如何決定使用多少個principal component？

若PCA是用於將數據降維後供監督學習使用，則可用cross validation決定M。

若僅用於無監督學習，由PVE圖大致判斷。

10.2.4主因子的其他應用

10.3聚類方法

10.3.1 K-means聚類

step1：確定分為K類（主觀）

step2：為每個樣本隨機指定1到K之間的某個數，代表屬於不同類別

step3：計算每個類別的centroid，即各個特徵的均值組成的向量

step4：重新為每個樣本指定1到K之間的某個數，這個數就是離樣本最近的類別

step5：迭代3，4步，直至沒有樣本需要改變類別

由於此演算法得到的是局部最優，因此需要多重複幾遍以確定是否全局最優。

10.3.2 hierachical聚類

step1：將n個樣本視為n類，計算每對類別之間的相似度（歐氏距離或相關性），將相似度最高的一對合為一類。

step2：計算剩餘n-1類，每對類別的相似度，將相似度最高的合為一類。當計算融合後的類別與其他類別的相似度時，一般按complete原則（計算與融合類別內每一類的相似度，取兩者中較低的相似度作為其他類別與融合類別的相似度）或average（取平均）原則。

左側的尺度代表不相似程度，越往上不相似程度越高。注意，9和2的不相似程度與9和7的不相似程度一樣，均為1.8。

--選擇歐氏距離還是相關性作為相似度的衡量？

下圖中1和2的相關性更強，1和3的歐氏距離更近。

當我們根據消費者的購物籃來劃分消費者類型時，如果按照歐氏距離，則將消費水平（金額）相近的消費者劃分為一類；如果按照相關性，則將購買物品相近的消費者劃分為一類。若我們的目標是確定消費者的偏好，從而確定消費者類別，做出相關商品的推薦，則按相關性作為相似度的衡量更佳。

--是否需要標準化樣本？

當p個特徵單位不同時，需要標準差為1；當相同單位時，不需變標準差。

10.3.3 聚類中的實際問題

K-means clusters：決定K

hierachicy clusters：選擇linkage，相似度的衡量指標，選擇在什麼位置截樹（即選擇類別數）

R語言代碼：

#PCA

dim(USArrests)

#scale first

pca=prcomp(USArrests,scale=TRUE);pca

summary(pca)

plot(pca)

#we got all loadings of pc

pca$rotation

#we got all scores

head(pca$x)

#scale=0 ensure the arrow scale to the loadings

biplot(pca,scale=0)

#base on pve to decide how many components we need

pve=(pca$sdev^2)/sum(pca$sdev^2);pve

plot(pve,xlab=principal component,ylab=principal component,ylim=c(0,1),type=b)

plot(cumsum(pve),xlab=principal component,ylab=cumulative pve,ylim=c(0,1),type=b)

#k-means Clustering

set.seed(2)

x=matrix(rnorm(50*2),ncol=2)

x[1:25,1]=x[1:25,1]+3

x[1:25,2]=x[1:25,2]-4

plot(x)

#nstart=1 only ensure local max,so we need to repeat 20 times to find out the real max

km=kmeans(x,3,nstart=20);km

km$cluster

#care the parameter, we plot x first, then based on km to cluster

plot(x,col=(km$cluster+1),main=k-means clustering result with k=3,pch=20,cex=2)

#hierarchical Clustering

#set up how similarity measure--dist,or as.dist;choose how to decide the similarity between the fused class and other class--complete,or average

hier1=hclust(dist(x),method=complete)

hier2=hclust(dist(x),method=average)

par(mfrow=c(1,2))

plot(hier1,main=complete linkage)

plot(hier2,main=average linkage)

#choose how many classes you want

#if we want to scale the sample

xsc=scale(x)

plot(hclust(dist(xsc),method=complete))

#if we want to use cor as the measurement of similarity;cor(x) calulate columnss correaltion,but we want rows correlation,so use cor(t(x))

dd=as.dist(1-cor(t(x)))

plot(hclust(dd,method=complete),main=comlete linkage with correlation-based distance)

#example

library(ISLR)

nci.labs=NCI60$labs

nci.data=NCI60$data

dim(nci.data)

nci.labs[1:4]

table(nci.labs)

#PCA

pca=prcomp(nci.data,scale=TRUE)

summary(pca)

plot(pca)

pve=pca$sdev^2/sum(pca$sdev^2)

par(mfrow=c(1,2))

plot(pve,main=pve,type=o)

plot(cumsum(pve),main=cumsum-pve,type=o)

Cols=function(vec){

cols=rainbow(length(unique(vec)))

return(cols[as.numeric(as.factor(vec))])

}

par(mfrow=c(1,2))

plot(pca$x[,1:2],col=Cols(nci.labs),pch=19,xlab=Z1,ylab=Z2)

plot(pca$x[,c(1,3)],col=Cols(nci.labs),pch=19,xlab=Z1,ylab=Z3)

#clustering

xsc=scale(nci.data)

hc=hclust(dist(xsc),method=complete)

hc.cluster=cutree(hc,4)

table(hc.cluster,nci.labs)

plot(hc,labels=nci.labs)

#instead of choosing the number of class, we can choose the dissimilarity

abline(h=139,col=blue)

#compare kmeans and hc

set.seed(2)

km.cluster=kmeans(xsc,4,nstart=20)$cluster

hc.cluster=cutree(hclust(dist(xsc),method=complete),4)

table(km.cluster,hc.cluster)