scikit-learn實戰

02-12

上一篇《使用python機器學習（五）-scikit-learn》簡單介紹了scikit-learn的基本知識，此文主要通過一個公開數據集，使用scikit-learn進行實戰，其中會使用到numpy、pandas、matplotlib等，可以參考前面的文章。

數據載入

首先，數據要被載入到內存中，才能對其操作。Scikit-Learn庫在它的實現中使用了NumPy數組，所以我們將用Numpy來載入*.csv文件。讓我們從UCI Machine Learning Repository下載其中印度人糖尿病的數據集。該數據集共有九列，分別為:

懷孕次數
口服葡萄糖耐量試驗中2小時中血漿葡萄糖濃度
舒張壓（mm Hg）

三頭肌皮褶厚度（mm）
2小時血清胰島素（μU/ ml）
體重指數（kg /（身高（m））^ 2）
糖尿病譜系功能
年齡
是否得糖尿病label

import numpy as npimport urllib.request# url with dataseturl = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"# download the fileraw_data = urllib.request.urlopen(url)# load the CSV file as a numpy matrixdataset = np.loadtxt(raw_data, delimiter=",")# separate the data from the target attributesX = dataset[:,0:8]y = dataset[:,8]print("size:",dataset.size)

X作為特徵向量，y作為目標變數。

數據標準化

我們都知道大多數的梯度方法（幾乎所有的機器學習演算法都基於此）對於數據的縮放很敏感。因此，在運行演算法之前，我們應該進行標準化、規格化（歸一化）。標準化是將數據按比例縮放，使之落入一個小的特定區間。歸一化是一種簡化計算的方式，即將有量綱的表達式，經過變換，化為無量綱的表達式，成為純量，把數據映射到0～1範圍之內處理。Scikit-Learn庫已經為其提供了相應的函數。

from sklearn import preprocessing# standardize the data attributesstandardized_X = preprocessing.scale(X)# normalize the data attributesnormalized_X = preprocessing.normalize(X)

特徵的選取

毫無疑問，解決一個問題最重要的是恰當選取特徵、甚至創造特徵的能力。這叫做特徵選取和特徵工程。雖然特徵工程是一個相當有創造性的過程，有時候更多的是靠直覺和專業的知識，但對於特徵的選取，已經有很多的演算法可供直接使用。如樹演算法就可以計算特徵的信息量。

from sklearn import metricsfrom sklearn.ensemble import ExtraTreesClassifiermodel = ExtraTreesClassifier()model.fit(X, y)# display the relative importance of each attributeprint(model.feature_importances_)

output:

[ 0.11193263 0.26076795 0.10153987 0.08278266 0.07190955 0.12292174 0.11527441 0.13287119]

其他所有的方法都是基於對特徵子集的高效搜索，從而找到最好的子集，意味著演化了的模型在這個子集上有最好的質量。遞歸特徵消除演算法（RFE）是這些搜索演算法的其中之一，Scikit-Learn庫同樣也有提供。

from sklearn.feature_selection import RFEfrom sklearn.linear_model import LogisticRegressionmodel = LogisticRegression()# create the RFE model and select 3 attributesrfe = RFE(model, 3)rfe = rfe.fit(X, y)# summarize the selection of the attributesprint(rfe.support_)print(rfe.ranking_)

output

[ True False False False False True True False][1 2 3 5 6 1 1 4]

演算法的開發

正像我說的，Scikit-Learn庫已經實現了所有基本機器學習的演算法。讓我們來瞧一瞧它們中的一些。

邏輯回歸

大多數情況下被用來解決分類問題（二元分類），但多類的分類（所謂的一對多方法）也適用。這個演算法的優點是對於每一個輸出的對象都有一個對應類別的概率。

from sklearn import metricsfrom sklearn.linear_model import LogisticRegressionmodel = LogisticRegression()model.fit(X, y)print(model)# make predictionsexpected = ypredicted = model.predict(X)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

output

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class=ovr, n_jobs=1, penalty=l2, random_state=None, solver=liblinear, tol=0.0001, verbose=0, warm_start=False) precision recall f1-score support 0.0 0.79 0.90 0.84 500 1.0 0.74 0.55 0.63 268avg / total 0.77 0.77 0.77 768[[448 52] [121 147]]

準確率(accuracy),其定義是: 對於給定的測試數據集，分類器正確分類的樣本數與總樣本數之比。精確率(precision)計算的是所有"正確被檢索的item(TP)"占所有"實際被檢索到的(TP+FP)"的比例.

召回率(recall)計算的是所有"正確被檢索的item(TP)"占所有"應該檢索到的item(TP+FN)"的比例。

F1-score

可以看到，recall 體現了分類模型H對正樣本的識別能力，recall 越高，說明模型對正樣本的識別能力越強，precision 體現了模型對負樣本的區分能力，precision越高，說明模型對負樣本的區分能力越強。F1-score 是兩者的綜合。F1-score 越高，說明分類模型越穩健。

樸素貝葉斯

它也是最有名的機器學習的演算法之一，它的主要任務是恢復訓練樣本的數據分布密度。這個方法通常在多類的分類問題上表現的很好。

from sklearn import metricsfrom sklearn.naive_bayes import GaussianNBmodel = GaussianNB()model.fit(X, y)print(model)# make predictionsexpected = ypredicted = model.predict(X)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

output

GaussianNB(priors=None) precision recall f1-score support 0.0 0.80 0.84 0.82 500 1.0 0.68 0.62 0.64 268avg / total 0.76 0.76 0.76 768[[421 79] [103 165]]

k-最近鄰

kNN（k-最近鄰）方法通常用於一個更複雜分類演算法的一部分。例如，我們可以用它的估計值做為一個對象的特徵。有時候，一個簡單的kNN演算法在良好選擇的特徵上會有很出色的表現。當參數（主要是metrics）被設置得當，這個演算法在回歸問題中通常表現出最好的質量。

from sklearn import metricsfrom sklearn.neighbors import KNeighborsClassifier# fit a k-nearest neighbor model to the datamodel = KNeighborsClassifier()model.fit(X, y)print(model)# make predictionsexpected = ypredicted = model.predict(X)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

output

KNeighborsClassifier(algorithm=auto, leaf_size=30, metric=minkowski, metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights=uniform) precision recall f1-score support 0.0 0.83 0.88 0.85 500 1.0 0.75 0.65 0.70 268avg / total 0.80 0.80 0.80 768[[442 58] [ 93 175]]

決策樹

分類和回歸樹（CART）經常被用於這麼一類問題，在這類問題中對象有可分類的特徵且被用於回歸和分類問題。決策樹很適用於多類分類。

from sklearn import metricsfrom sklearn.tree import DecisionTreeClassifier# fit a CART model to the datamodel = DecisionTreeClassifier()model.fit(X, y)print(model)# make predictionsexpected = ypredicted = model.predict(X)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

output

DecisionTreeClassifier(class_weight=None, criterion=gini, max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter=best) precision recall f1-score support 0.0 1.00 1.00 1.00 500 1.0 1.00 1.00 1.00 268avg / total 1.00 1.00 1.00 768[[500 0] [ 0 268]]

支持向量機

SVM（支持向量機）是最流行的機器學習演算法之一，它主要用於分類問題。同樣也用於邏輯回歸，SVM在一對多方法的幫助下可以實現多類分類。

from sklearn import metricsfrom sklearn.svm import SVC# fit a SVM model to the datamodel = SVC()model.fit(X, y)print(model)# make predictionsexpected = ypredicted = model.predict(X)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

output

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=auto, kernel=rbf, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) precision recall f1-score support 0.0 1.00 1.00 1.00 500 1.0 1.00 1.00 1.00 268avg / total 1.00 1.00 1.00 768[[500 0] [ 0 268]]

如何優化演算法的參數

在編寫高效的演算法的過程中最難的步驟之一就是正確參數的選擇。一般來說如果有經驗的話會容易些，但無論如何，我們都得尋找。幸運的是Scikit-Learn提供了很多函數來幫助解決這個問題。

作為一個例子，我們來看一下規則化參數的選擇，在其中不少數值被相繼搜索了：

import numpy as npfrom sklearn.linear_model import Ridgefrom sklearn.model_selection import GridSearchCV# prepare a range of alpha values to testalphas = np.array([1,0.1,0.01,0.001,0.0001,0])# create and fit a ridge regression model, testing each alphamodel = Ridge()grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))grid.fit(X, y)print(grid)# summarize the results of the grid searchprint(grid.best_score_)print(grid.best_estimator_.alpha)

output

GridSearchCV(cv=None, error_score=raise, estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver=auto, tol=0.001), fit_params={}, iid=True, n_jobs=1, param_grid={alpha: array([ 1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03, 1.00000e-04, 0.00000e+00])}, pre_dispatch=2*n_jobs, refit=True, return_train_score=True, scoring=None, verbose=0)0.2796175593131.0

有時候隨機地從既定的範圍內選取一個參數更為高效，估計在這個參數下演算法的質量，然後選出最好的

import numpy as npfrom scipy.stats import uniform as sp_randfrom sklearn.linear_model import Ridgefrom sklearn.model_selection import RandomizedSearchCV# prepare a uniform distribution to sample for the alpha parameterparam_grid = {alpha: sp_rand()}# create and fit a ridge regression model, testing random alpha valuesmodel = Ridge()rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)rsearch.fit(X, y)print(rsearch)# summarize the results of the random parameter searchprint(rsearch.best_score_)print(rsearch.best_estimator_.alpha)

output

RandomizedSearchCV(cv=None, error_score=raise, estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver=auto, tol=0.001), fit_params={}, iid=True, n_iter=100, n_jobs=1, param_distributions={alpha: <scipy.stats._distn_infrastructure.rv_frozen object at 0x10efc1438>}, pre_dispatch=2*n_jobs, random_state=None, refit=True, return_train_score=True, scoring=None, verbose=0)0.2796175312520.998565254036

至此我們已經看了整個使用Scikit-Learn庫的過程，下一篇我將介紹特徵工程。

文中涉及的代碼在此：源代碼

參考

機器學習 F1-Score, recall, precision

Introduction to Machine Learning with Python and Scikit-Learn