【用Sklearn進行機器學習】第一篇 - 介紹Scikit-Learn

04-15

概要

主要目標：介紹機器學習核心概念，和如何使用sklearn包。

- 機器學習的定義

- sklearn中的數據表示

- sklearn API介紹

關於Scikit-Learn

sklearn用Python實現了許多眾所周知的機器學習演算法，並且提供清晰和成熟的API介面。全世界達數百人參與貢獻了sklearn的代碼，它在工業和學術界被大量應用。

sklearn基於NumPy和SciPy，這兩個庫在數組處理和科學計算方面很強大。此外sklearn不適合大數據集處理，儘管在這方面已經做了一些工作。

什麼是機器學習？

在這部分我們將開始探索基本的機器學習規則。機器學習通過調整參數來學習已知的數據，從而建立模型來預測新的數據。機器學習作為人工智慧的一個領域，通過某種程度的泛化使得計算機更加智能。

這裡我們將看一下兩個非常簡單的例子。第一個是分類,圖片顯示一個二維數據的集合，不同的顏色表示不同的分類。一個分類演算法可以劃分兩種不同顏色的點集：

%matplotlib inline# set seaborn plot defaults.# This can be safely commented outimport seaborn; seaborn.set()# eliminate warningsdef warn(*args, **kwargs): passimport warningswarnings.warn = warn# Import the example plot from the figures directoryfrom fig_code import plot_sgd_separatorplot_sgd_separator()

這貌似像一個簡單的任務，但大道至簡。通過畫出分割線形成模型，我們可以讓這個模型泛化到新的數據集，它可以用來預測分割新的點集（紅色和藍色）。

如果你想要查看源碼，可以使用%load命令。

# %load fig_code/sgd_separator.pyimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import SGDClassifierfrom sklearn.datasets.samples_generator import make_blobsdef plot_sgd_separator(): # we create 50 separable points X, Y = make_blobs(n_samples=50, centers=2, random_state=0, cluster_std=0.60) # fit the model clf = SGDClassifier(loss="hinge", alpha=0.01, n_iter=200, fit_intercept=True) clf.fit(X, Y) # plot the line, the points, and the nearest vectors to the plane xx = np.linspace(-1, 5, 10) yy = np.linspace(-1, 5, 10) X1, X2 = np.meshgrid(xx, yy) Z = np.empty(X1.shape) for (i, j), val in np.ndenumerate(X1): x1 = val x2 = X2[i, j] p = clf.decision_function(np.array([[x1, x2]])) Z[i, j] = p[0] levels = [-1.0, 0.0, 1.0] linestyles = [dashed, solid, dashed] colors = k ax = plt.axes() ax.contour(X1, X2, Z, levels, colors=colors, linestyles=linestyles) ax.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired) ax.axis(tight)if __name__ == __main__: plot_sgd_separator() plt.show()

下一個我們看下回歸的例子，數據集的最佳擬合線

from fig_code import plot_linear_regressionplot_linear_regression()

我們又看到了一個擬合數據的模型，這個模型也可以泛化到新的數據。這個模型通過學習訓練數據，從而預測測試集的結果：給它x值，它可以預測y值。

數據在Scikit-learn中的表示

機器學習通過數據建立模型，基於這點，我們開始討論，數據如何表示才能讓計算機可以理解。我們將舉一些matplotlib的例子並且可視化它們。

在sklearn中的大多數機器學習演算法的數據存儲在二維的數組或者矩陣中。有的採用numpy arrays，還有的採用scipy.sparse矩陣。數組的大小設定為[n_samples, n_features]

n_samples: 樣本的個數，每個樣本是待處理的一項數據。它可以是一篇文檔，一張圖片，一段音頻，一段視頻，資料庫或CSV文件中的某一行，甚至一個你能描述其特徵的可量化的集合。
n_features 特徵的數目，用一種可以量化的方式來描述每一個樣本的特徵數目。它一般是實數，在一些場合下可以為布爾值或者離散的數值。

特徵的數量必須是提前確定的。它可能有很高的維度（例如：上百萬的維度），其中一些樣本中大部分的特徵值為零。在這種場景下scipy.sparse矩陣就很有用了，它比numpy arrays更加節約內存。

一個簡單的例子：Iris（鳶尾花）數據集

在這個例子中，我們將看一下存儲在sklearn中的iris數據。這個數據集包含三種不同種類的iris測量值。讓我們看下下圖：

from IPython.core.display import Image, displaydisplay(Image(filename=images/iris_setosa.jpg))print("Iris Setosa ")display(Image(filename=images/iris_versicolor.jpg))print("Iris Versicolor ")display(Image(filename=images/iris_virginica.jpg))print("Iris Virginica")

Iris Setosa

Iris Versicolor

Iris Virginica

問題：

如果我們需要一個演算法來識別不同種類的iris（鳶尾花），需要哪些數據？

我們需要一個[n_samples x n_features]的二維數組

n_samples表示什麼？
n_features表示什麼？

每一個樣本必須有確定數量的特徵，每個特徵都是樣本的某種量化值。

從sklearn中載入iris數據集

sklearn中包含了這些種類的iris數據集。這數據集包含下面的特徵值：

數據集中的特徵：

花萼長度（單位cm）
花萼寬度（單位cm）
花瓣長度（單位cm）
花瓣寬度（單位cm）

預測的目標分類：

山鳶尾（Iris Setosa）
雜色鳶尾（Iris Versicolour）
維吉尼亞鳶尾（Iris Virginica）

sklearn包含iris的CSV文件包括載入到numpy arrays的函數：

from sklearn.datasets import load_irisiris = load_iris()iris.keys()[target_names, data, target, DESCR, feature_names]n_samples, n_features = iris.data.shapeprint((n_samples, n_features))print(iris.data[0])(150, 4)[ 5.1 3.5 1.4 0.2]print(iris.data.shape)print(iris.target.shape)(150, 4)(150,)print(iris.target)[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

這個數據集是四維的，但是我們可以用scatter plot每次展示兩維：

import numpy as npimport matplotlib.pyplot as pltx_index = 0y_index = 1# this formatter will label the colorbar with the correct target namesformatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target, cmap=plt.cm.get_cmap(RdYlBu, 3))plt.colorbar(ticks=[0, 1, 2], format=formatter)plt.clim(-0.5, 2.5)plt.xlabel(iris.feature_names[x_index])plt.ylabel(iris.feature_names[y_index]);

練習

改變上面代碼中x_index和y_index值，找出一個點可以最大化的分割這三類

這是個降維的練習，我們稍後會見到。

其他可用的數據

有以下的三種形式：

sklearn包數據 在安裝的時候，這些小的數據集被打包在sklearn中，通過sklearn.datasets.load_*函數能夠被載入
下載的數據 這些大的數據集可以通過sklearn提供的sklearn.datasets.fetch_*函數從網上下載
生成的數據 這些數據可以通過sklearn提供的sklearn.datasets.make_*函數，基於一個隨機種子從模型中產生

你能通過IPython的tab補全功能，瀏覽這些函數。在從sklearn中導入datasets子模塊後，你可以

輸入datasets.load + TAB或datasets.fetch + TAB或datasets.make_ + TAB來瀏覽這些函數列表。

from sklearn import datasets# Type datasets.fetch_<TAB> or datasets.load_<TAB> in IPython to see all possibilities# datasets.fetch_# datasets.load_

在下一章，我們將使用這些數據集，並且學習機器學習的基本方法。

Jupyter實現