Python實現PCA降維

02-04

特徵降維之PCA

特徵降維是無監督學習的另一個應用，在圖像識別方面，經常會遇到特徵維度非常高的訓練樣本(像素矩陣)時，很難提供數據展現，訓練學習模型也很耗時耗力，特徵降維不僅重構了有效的低維度特徵向量，也為數據展現提供了可能，其中主成分分析(Principal Component Analysis)是應用最廣泛的特徵降維技術。

PCA簡單來說，是一種特徵選擇(重構)手段，將原來的特徵空間做了映射，使得新的映射後特徵空間數據彼此正交，儘可能保留下具備區分性的低維度數據特徵。特徵不需要標準化處理！！，因為是重構。下面以手寫體數字(sklearn庫內置)識別為例：

獲取訓練，測試樣本

# 導入數據載入器n>>> from sklearn.datasets import load_digitsn>>> digits = load_digits()n# 64維，1797條數據n>>> digits.data.shapen(1797, 64)n>>> digits.target.shapen(1797,)n# 切割數據，25%測試n>>> from sklearn.cross_validation import train_test_splitnX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=33)n

直接SVM模型性能評估

>>> from sklearn.preprocessing import StandardScalern>>> from sklearn.svm import LinearSVCn# 標準化數據n>>> ss = StandardScaler()n>>> X_dire_train = ss.fit_transform(X_train)n>>> X_dire_train.shapen(1347, 64)n>>> X_dire_test = ss.transform(X_test)n# 線性核函數初始化n>>> lsvc = LinearSVC()n# 訓練模型n>>> lsvc.fit(X_dire_train, y_train)n# 預測n>>> y_dire_predict = lsvc.predict(X_dire_test)n

直接SVM模型性能評估

# 使用模型自帶的評估函數進行準確性測評 n>>> print(The Accuracy of Linear SVC is, lsvc.score(X_dire_test, y_test))n# 導入classification_report模塊對預測結果做更加詳細的分析nfrom sklearn.metrics import classification_reportnprint(classification_report(y_test, y_dire_predict, target_names=digits.target_names.astype(str)))n

顯示

The Accuracy of Linear SVC is 0.953333333333n ttprecision recall f1-score supportn 0 0.92 1.00 0.96 35n 1 0.96 0.98 0.97 54n 2 0.98 1.00 0.99 44n 3 0.93 0.93 0.93 46n 4 0.97 1.00 0.99 35n 5 0.94 0.94 0.94 48n 6 0.96 0.98 0.97 51n 7 0.92 1.00 0.96 35n 8 0.98 0.84 0.91 58n 9 0.95 0.91 0.93 44nnavg/total 0.95 0.95 0.95 450n

PCA特徵降維

>>> from sklearn.decomposition import PCAn>>> estimator = PCA(n_components=20) # 初始化，64維壓縮至20維n# 利用訓練特徵決定（fit）20個正交維度的方向，並轉化（transform）原訓練特徵n>>> pca_X_train = estimator.fit_transform(X_train)n>>> pca_X_train.shapen(1347, 20) # 維度從64變為20n# 測試特徵也按照上述的20個正交維度方向進行轉化（transform）n>>> pca_X_test = estimator.transform(X_test)n

線性核函數的支持向量機分類

from sklearn.svm import LinearSVCnpca_svc = LinearSVC()npca_svc.fit(pca_X_train, y_train) # 訓練模型 ny_predict = pca_svc.predict(pca_X_test) # 進行預測n

模型性能評價

>>> from sklearn.metrics import classification_reportn# 自帶評價函數n>>> print(The Accuracy of Linear SVC after PCA is, pca_svc.score(pca_X_test,y_test))n# 詳細評價精確率，回調率，f1指數n>>> print(classification_report(y_test, y_predict, target_names=np.arange(10).astype(str)))n

顯示：

The Accuracy of Linear SVC after PCA is 0.935555555556nttprecision recall f1-score supportn 0 0.94 0.97 0.96 35n 1 0.88 0.93 0.90 54n 2 1.00 0.98 0.99 44n 3 0.91 0.89 0.90 46n 4 1.00 0.94 0.97 35n 5 0.94 0.92 0.93 48n 6 0.96 0.98 0.97 51n 7 0.95 1.00 0.97 35n 8 0.90 0.90 0.90 58n 9 0.93 0.89 0.91 44nnavg/total 0.94 0.94 0.94 450n

總結

相比於將樣本數據不降維處理直接拿來訓練，PCA降維處理後數據的會損失一點預測準確性(約0.02)，因為在降維過程中，儘管規避掉了大量的特徵冗餘和雜訊，但是也會損失一些有用的模式信息，但是維度的大大壓縮不僅節省了大量模型訓練時間，也降低了模型的訓練難度，對於高維樣本來說是划算的選擇。

作者：Genius Python愛好者社區專欄作者，請勿轉載，謝謝。
出處：Python實現PCA降維

配套視頻教程：Python3爬蟲三大案例實戰分享：貓眼電影、今日頭條街拍美圖、淘寶美食 Python3爬蟲三大案例實戰分享
公眾號：Python愛好者社區（微信ID：python_shequ），關注，查看更多連載內容。