Dimensionality Reduction——LDA線性判別分析實現篇

01-23

緊接著上文Dimensionality Reduction--LDA線性判別分析原理篇 - 知乎專欄

LDA降維演算法：

從上文中我們可以總結出，LDA降維一般分為5個步驟：

計算數據集中每個類別樣本的均值向量。
通過均值向量，計算類間散度矩陣 $S_B$ 和類內散度矩陣 $S_W$
對 $S_W^{-1}S_BW=lambda W$ 進行特徵值求解，求出 $S_W^{-1}S_B$ 的特徵向量和特徵值。
對特徵向量按照特徵值的大小降序排列，並選擇前K個特徵向量組成投影矩陣W。
通過D*K維的特徵值矩陣將樣本點投影到新的子空間中，Y=X*W。

LDA step by step：

首先我們選擇鳶尾花的數據集來進行實驗，下載地址：UCI Machine Learning Repository: Iris Data Set。

step1：處理數據

import pandas as pd# u may download data from (https://archive.ics.uci.edu/ml/datasets/Iris).df = pd.read_csv("iris.data", header=None)feature_dict = {i:label for i,label in zip( range(4), ("sepal length in cm", "sepal width in cm", "petal length in cm", "petal width in cm", ))}df.columns = [l for i,l in sorted(feature_dict.items())] + ["class label"]df.dropna(how="all", inplace=True) # to drop the empty line at file-enddf.tail()

可以看到這裡面的類別還是文本形式，我們一般會將其處理為數字。

from sklearn.preprocessing import LabelEncoderX = df[[0,1,2,3]].valuesy = df["class label"].valuesenc = LabelEncoder()label_encoder = enc.fit(y)y = label_encoder.transform(y) + 1label_dict = {1: "Setosa", 2: "Versicolor", 3:"Virginica"}

為了更直觀的感受數據的分布，由於特徵是4維的，我們也畫不出4維空間的圖，因此可以將每個特徵的分布情況都畫出來。

from matplotlib import pyplot as pltimport numpy as npimport mathfig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,6))for ax,cnt in zip(axes.ravel(), range(4)): # set bin sizes min_b = math.floor(np.min(X[:,cnt])) max_b = math.ceil(np.max(X[:,cnt])) bins = np.linspace(min_b, max_b, 25) # plottling the histograms for lab,col in zip(range(1,4), ("blue", "red", "green")): ax.hist(X[y==lab, cnt], color=col, label="class %s" %label_dict[lab], bins=bins, alpha=0.5,) ylims = ax.get_ylim() # plot annotation leg = ax.legend(loc="upper right", fancybox=True, fontsize=8) leg.get_frame().set_alpha(0.5) ax.set_ylim([0, max(ylims)+2]) ax.set_xlabel(feature_dict[cnt]) ax.set_title("Iris histogram #%s" %str(cnt+1)) # hide axis ticks ax.tick_params(axis="both", which="both", bottom="off", top="off", labelbottom="on", left="off", right="off", labelleft="on") # remove axis spines ax.spines["top"].set_visible(False) ax.spines["right"].set_visible(False) ax.spines["bottom"].set_visible(False) ax.spines["left"].set_visible(False) axes[0][0].set_ylabel("count")axes[1][0].set_ylabel("count")fig.tight_layout() plt.show()

從上圖的結果中可以看出，花瓣的長度和寬度看起來在不同的類別有較強的區分性，因為這2個特徵的直方圖中，不同的類別的樣本較為分散。

因此在實際應用中，我們對特徵進行降維，除了使用類似於LDA的特徵投影方法（或者叫extraction），特徵選擇（selection）也是一種較好的方式。像上圖這種低緯度的數據集，看一眼直方圖我們就可以做出一定的判斷。

step 2：計算D維特徵樣本的均值向量

np.set_printoptions(precision=4)mean_vectors = []for cl in range(1,4): mean_vectors.append(np.mean(X[y==cl], axis=0)) print("Mean Vector class %s: %s " %(cl, mean_vectors[cl-1]))

這部分比較簡單就不細說了。

step 3 ：計算散度矩陣

# 計算類內散度矩陣：SwS_W = np.zeros((4,4))for cl,mv in zip(range(1,4), mean_vectors): class_sc_mat = np.zeros((4,4)) # scatter matrix for every class for row in X[y == cl]: row, mv = row.reshape(4,1), mv.reshape(4,1) # make column vectors class_sc_mat += (row-mv).dot((row-mv).T) S_W += class_sc_mat # sum class scatter matricesprint("within-class Scatter Matrix: ", S_W)

within-class Scatter Matrix：

[[ 38.9562, 13.683 , 24.614 , 5.6556],[ 13.683 , 17.035 , 8.12 , 4.9132],[ 24.614 , 8.12 , 27.22 , 6.2536],[ 5.6556, 4.9132, 6.2536, 6.1756]]

# 計算類間三度矩陣:Sboverall_mean = np.mean(X, axis=0)S_B = np.zeros((4,4))for i,mean_vec in enumerate(mean_vectors): n = X[y==i+1,:].shape[0] mean_vec = mean_vec.reshape(4,1) # make column vector overall_mean = overall_mean.reshape(4,1) # make column vector S_B += n * (mean_vec - overall_mean).dot((mean_vec - overall_mean).T)print("between-class Scatter Matrix: ", S_B)

between-class Scatter Matrix:

[[ 63.2121, -19.534 , 165.1647, 71.3631],[ -19.534 , 10.9776, -56.0552, -22.4924],[ 165.1647, -56.0552, 436.6437, 186.9081],[ 71.3631, -22.4924, 186.9081, 80.6041]

step 4：求解 $S_W^{-1}S_B$ 的特徵值問題

eig_vals, eig_vecs = np.linalg.eig(np.linalg.inv(S_W).dot(S_B))for i in range(len(eig_vals)): eigvec_sc = eig_vecs[:,i].reshape(4,1) print(" Eigenvector {}: {}".format(i+1, eigvec_sc.real)) print("Eigenvalue {:}: {:.2e}".format(i+1, eig_vals[i].real))

step 5：選擇新的特徵空間

# Make a list of (eigenvalue, eigenvector) tupleseig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]# Sort the (eigenvalue, eigenvector) tuples from high to loweig_pairs = sorted(eig_pairs, key=lambda k: k[0], reverse=True)# Visually confirm that the list is correctly sorted by decreasing eigenvaluesprint("Eigenvalues in decreasing order: ")for i in eig_pairs: print (i[0], i[1])

先將特徵向量按照特徵值的大小降序排列，線代中告訴我我們，矩陣乘法可以看做一種線性變換，而特徵向量和特徵值代表了變換後的方向以及該方向上的縮放比例，因此特徵值越大，說明這個方向在變換中越顯著，也就是信息量最大。因此我們需要拋棄的是特徵值較小的方向，因此我們只需要選取前topk個特徵值對應的特徵向量，就得到了映射矩陣W。

從上面的特徵值可以看到有2個特徵值非常接近0，這2個值之所以接近0，一是代表了他們不包含信息量，第二是因為浮點運算的精確度問題。

實際上這2分特徵值應該就是0，因為在LDA中，如果有C類，線性判別式最多只有C-1個，因此對於之前3類的數據集，最多只有2個特徵值。

# 我們通過特徵值的比例來體現方差的分布：print("Variance explained: ")eigv_sum = sum(eig_vals)for i,j in enumerate(eig_pairs): print("eigenvalue {0:}: {1:.2%}".format(i+1, (j[0]/eigv_sum).real))

# 選擇K個特徵向量作為映射矩陣，這裡選了前2個有信息的。W = np.hstack((eig_pairs[0][1].reshape(4,1), eig_pairs[1][1].reshape(4,1)))print("Matrix W: ", W.real)

step 6：將樣本投影到新的空間

X_lda = X.dot(W)assert X_lda.shape == (150,2), "The matrix is not 150x2 dimensional."

到此為止，我們已經得到了LDA降維後的結果，降維後為2維特徵，可以使用matpoltlib進行可視化：

from matplotlib import pyplot as pltdef plot_step_lda(): ax = plt.subplot(111) for label,marker,color in zip( range(1,4),("^", "s", "o"),("blue", "red", "green")): plt.scatter(x=X_lda[:,0].real[y == label], y=X_lda[:,1].real[y == label], marker=marker, color=color, alpha=0.5, label=label_dict[label] ) plt.xlabel("LD1") plt.ylabel("LD2") leg = plt.legend(loc="upper right", fancybox=True) leg.get_frame().set_alpha(0.5) plt.title("LDA: Iris projection onto the first 2 linear discriminants") # hide axis ticks plt.tick_params(axis="both", which="both", bottom="off", top="off", labelbottom="on", left="off", right="off", labelleft="on") # remove axis spines ax.spines["top"].set_visible(False) ax.spines["right"].set_visible(False) ax.spines["bottom"].set_visible(False) ax.spines["left"].set_visible(False) plt.grid() plt.tight_layout plt.show()plot_step_lda()

從上面這個散點圖中，我們可以看出，第一個線性判別LD1可以很好地將類別區分開來，但是第二個線性判別LD2，由於不含有充分的信息（可以從之前的特徵向量中看出）。並沒有很好的區分性。

LDA的教程到此結束，相關代碼已經放在t2396156/zhihuResourse

之後還會有PCA等相關內容，敬請期待。

參考網頁：

Linear Discriminant Analysis