乾貨|Scikit-Learn的五種機器學習方法使用案例(python代碼)

01-26

微信公眾號：全球人工智慧

文章參考：datadw 編輯：徐征

基於scikit-learn庫的機器學習

對於一些開始搞機器學習演算法有害怕下手的小朋友，該如何快速入門，這讓人挺掙扎的。

在從事數據科學的人中，最常用的工具就是R和Python了，每個工具都有其利弊，但是Python在各方面都相對勝出一些，這是因為scikit-learn庫實現了很多機器學習演算法。

載入數據

我們假設輸入時一個特徵矩陣或者csv文件。首先，數據應該被載入內存中。scikit-learn的實現使用了NumPy中的arrays，所以，我們要使用NumPy來載入csv文件。以下是從UCI機器學習數據倉庫中下載的數據。

import numpy as np

import urllib

# url with dataset

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

# download the file

raw_data = urllib.urlopen(url)

# load the CSV file as a numpy matrix

dataset = np.loadtxt(raw_data, delimiter=",")

# separate the data from the target attributes

X = dataset[:,0:7]

y = dataset[:,8]

我們要使用該數據集作為例子，將特徵矩陣作為X，目標變數作為y。

數據歸一化

大多數機器學習演算法中的梯度方法對於數據的縮放和尺度都是很敏感的，在開始跑演算法之前，我們應該進行歸一化或者標準化的過程，這使得特徵數據縮放到0-1範圍中。scikit-learn提供了歸一化的方法：

from sklearn import preprocessing

# normalize the data attributes

normalized_X = preprocessing.normalize(X)

# standardize the data attributes

standardized_X = preprocessing.scale(X)

特徵選擇

在解決一個實際問題的過程中，選擇合適的特徵或者構建特徵的能力特別重要。這成為特徵選擇或者特徵工程。特徵選擇時一個很需要創造力的過程，更多的依賴於直覺和專業知識，並且有很多現成的演算法來進行特徵的選擇。

下面的樹演算法(Tree algorithms)計算特徵的信息量：

from sklearn import metrics

from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()

model.fit(X, y)

# display the relative importance of each attribute

print(model.feature_importances_)

演算法的使用

scikit-learn實現了機器學習的大部分基礎演算法，讓我們快速了解一下。

一、邏輯回歸

大多數問題都可以歸結為二元分類問題。這個演算法的優點是可以給出數據所在類別的概率。

from sklearn import metrics

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

結果：

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001)

precision recall f1-score support

0.0 0.79 0.89 0.84 500

1.0 0.74 0.55 0.63 268

avg / total 0.77 0.77 0.77 768

[[447 53]

[120 148]]

二、樸素貝葉斯

這也是著名的機器學習演算法，該方法的任務是還原訓練樣本數據的分布密度，其在多類別分類中有很好的效果。

from sklearn import metrics

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

結果：

GaussianNB()

precision recall f1-score support

0.0 0.80 0.86 0.83 500

1.0 0.69 0.60 0.64 268

avg / total 0.76 0.77 0.76 768

[[429 71]

[108 160]]

三、k近鄰

k近鄰演算法常常被用作是分類演算法一部分，比如可以用它來評估特徵，在特徵選擇上我們可以用到它。

from sklearn import metrics

from sklearn.neighbors import KNeighborsClassifier

# fit a k-nearest neighbor model to the data

model = KNeighborsClassifier()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

結果：

KNeighborsClassifier(algorithm=auto, leaf_size=30, metric=minkowski,

n_neighbors=5, p=2, weights=uniform)

precision recall f1-score support

0.0 0.82 0.90 0.86 500

1.0 0.77 0.63 0.69 268

avg / total 0.80 0.80 0.80 768

[[448 52]

[ 98 170]]

四、決策樹

分類與回歸樹(Classification and Regression Trees ,CART)演算法常用於特徵含有類別信息的分類或者回歸問題，這種方法非常適用於多分類情況。

from sklearn import metrics

from sklearn.tree import DecisionTreeClassifier

# fit a CART model to the data

model = DecisionTreeClassifier()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

結果：

DecisionTreeClassifier(compute_importances=None, criterion=gini,

max_depth=None, max_features=None, min_density=None,

min_samples_leaf=1, min_samples_split=2, random_state=None,

splitter=best)

precision recall f1-score support

0.0 1.00 1.00 1.00 500

1.0 1.00 1.00 1.00 268

avg / total 1.00 1.00 1.00 768

[[500 0]

[ 0 268]]

五、支持向量機

SVM是非常流行的機器學習演算法，主要用於分類問題，如同邏輯回歸問題，它可以使用一對多的方法進行多類別的分類。

from sklearn import metrics

from sklearn.svm import SVC

# fit a SVM model to the data

model = SVC()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

結果：

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,

kernel=rbf, max_iter=-1, probability=False, random_state=None,

shrinking=True, tol=0.001, verbose=False)

precision recall f1-score support

0.0 1.00 1.00 1.00 500

1.0 1.00 1.00 1.00 268

avg / total 1.00 1.00 1.00 768

[[500 0]

[ 0 268]]

除了分類和回歸演算法外，scikit-learn提供了更加複雜的演算法，比如聚類演算法，還實現了演算法組合的技術，如Bagging和Boosting演算法。

如何優化演算法參數

一項更加困難的任務是構建一個有效的方法用於選擇正確的參數，我們需要用搜索的方法來確定參數。scikit-learn提供了實現這一目標的函數。

下面的例子是一個進行正則參數選擇的程序：

import numpy as np

from sklearn.linear_model import Ridge

from sklearn.grid_search import GridSearchCV

# prepare a range of alpha values to test

alphas = np.array([1,0.1,0.01,0.001,0.0001,0])

# create and fit a ridge regression model, testing each alpha

model = Ridge()

grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))

grid.fit(X, y)

print(grid)

# summarize the results of the grid search

print(grid.best_score_)

print(grid.best_estimator_.alpha)

結果：

GridSearchCV(cv=None,

estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,

normalize=False, solver=auto, tol=0.001),

estimator__alpha=1.0, estimator__copy_X=True,

estimator__fit_intercept=True, estimator__max_iter=None,

estimator__normalize=False, estimator__solver=auto,

estimator__tol=0.001, fit_params={}, iid=True, loss_func=None,

n_jobs=1,

param_grid={『alpha』: array([ 1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03,

1.00000e-04, 0.00000e+00])},

pre_dispatch=2*n_jobs, refit=True, score_func=None, scoring=None,

verbose=0)

0.282118955686

1.0

有時隨機從給定區間中選擇參數是很有效的方法，然後根據這些參數來評估演算法的效果進而選擇最佳的那個。

import numpy as np

from scipy.stats import uniform as sp_rand

from sklearn.linear_model import Ridge

from sklearn.grid_search import RandomizedSearchCV

# prepare a uniform distribution to sample for the alpha parameter

param_grid = {alpha: sp_rand()}

# create and fit a ridge regression model, testing random alpha values

model = Ridge()

rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)

rsearch.fit(X, y)

print(rsearch)

# summarize the results of the random parameter search

print(rsearch.best_score_)

print(rsearch.best_estimator_.alpha)

結果：

RandomizedSearchCV(cv=None,

estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,

normalize=False, solver=auto, tol=0.001),

estimator__alpha=1.0, estimator__copy_X=True,

estimator__fit_intercept=True, estimator__max_iter=None,

estimator__normalize=False, estimator__solver=auto,

estimator__tol=0.001, fit_params={}, iid=True, n_iter=100,

n_jobs=1,

param_distributions={『alpha』:

小結

我們總體了解了使用scikit-learn庫的大致流程，希望這些總結能讓初學者沉下心來，一步一步儘快的學習如何去解決具體的機器學習問題。

兼職翻譯招聘

《全球人工智慧》面向全球招聘多名：圖像技術、語音技術、自然語言、機器學習、數據挖掘等專業技術領域的兼職翻譯，工作內容及待遇請在公眾號內回復「兼職+個人微信號」聯繫工作人員。

熱門文章推薦

資源|歡迎加入《全球人工智慧》AI技術博士群

重磅|百度PaddlePaddle發布最新API 從三大方面優化了性能

重磅|NVIDIA發布兩款"深度神經網路訓練"開發者產品：DIGITS 5 和 TensorRT

重磅|「薩德」——不怕！我國的人工智慧巡航導彈可破解

重磅|MIT發布腦控機器人:用腦電波（10毫秒分類）糾正機器人錯誤

重磅|谷歌預言：2029年通過納米機器人和器官再造或將實現人類永生

重磅|Messenger bot錯誤率高達70% Facebook被迫削減AI投資

招聘|騰訊大規模招聘AI開發工程師年薪30-80W

討論|周志華教授gcForest論文的價值與技術討論（微信群）

最新|李飛飛：人口普查不用上門，谷歌街景加深度學習就搞定（附論文）