【用Sklearn進行機器學習】第四篇 - 深入監督學習：隨機森林

04-29

前面我們見過了一種強大的分類器：支持向量機。

這裡我們將看到另外一種強大的演算法，這是一種non-parametric演算法叫做隨機森林。

%matplotlib inlineimport numpy as npimport matplotlib.pyplot as pltfrom scipy import stats# use seaborn plotting defaultsimport seaborn as sns; sns.set()

隨機森林：決策樹

隨機森林基於決策樹，它是集成學習的一個例子。基於這個原因，我們用決策樹開始介紹。

決策樹用一種非常直觀的方式來分類對象：你可以問一系列問題來做出yes或no的判斷。

import fig_codefig_code.plot_example_decision_tree()

上面的二叉樹分類很有效率。這個技巧的關鍵是對特徵提問。

這個過程是這樣的：在訓練決策樹分類器過程中，演算法查找每一個特徵，然後決定哪一個是正確的答案。

創建決策樹

這裡是sklearn中的決策樹分類器的一個例子。我們開始定義一些二維的打過標籤的數據。

from sklearn.datasets import make_blobsX, y = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=1.0)plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=rainbow);

我們引入一些可視化的函數

from fig_code import visualize_tree, plot_tree_interactive

現在我們使用IPython的interact，來看一下決策樹的分割過程：

plot_tree_interactive(X, y);

注意隨著深度的增加，這些節點一分為二，這些節點只包含一個分類。它是非常快速的非參數分類，在實踐中很有用。

問題：你能發現上面的有什麼問題？

決策樹和過擬合

決策樹有個問題，就是它很容易產生過擬合。因為他們在分類的過程很容易學習到noise而不是signal。我們看下兩棵決策樹根據同一個數據集中兩個不同子類構造的過程：

from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier()plt.figure()visualize_tree(clf, X[:200], y[:200], boundaries=False)plt.figure()visualize_tree(clf, X[-200:], y[-200:], boundaries=False)<matplotlib.figure.Figure at 0x118962a50>

<matplotlib.figure.Figure at 0x118f4de50>

<matplotlib.figure.Figure at 0x11858d790>

分類完全不同！很明顯這產生了過擬合：當預測新的樣本時，結果反映了模型中的noise而不是signal。

Estimators的集成：隨機森林

一個解決過擬合的可以方式是使用集成方法：這是一個meta-estimator，它由多個獨立的estimator組成，能夠平衡每個estimator過擬合結果。和它包含的單獨的estimator相比，評估的結果看上去更加健壯和精確。

一個最常見的集成方法叫做隨機森林，它由許多決策樹組成。

讓我先看看下面的決策樹例子獲得一個感性的認識：

def fit_randomized_tree(random_state=0): X, y = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=2.0) clf = DecisionTreeClassifier(max_depth=15) rng = np.random.RandomState(random_state) i = np.arange(len(y)) rng.shuffle(i) visualize_tree(clf, X[i[:250]], y[i[:250]], boundaries=False, xlim=(X[:, 0].min(), X[:, 0].max()), ylim=(X[:, 1].min(), X[:, 1].max()))from IPython.html.widgets import interactinteract(fit_randomized_tree, random_state=[0, 100]);

雖然樣本發生變化時模型對樣本的分類也變化了，但是大的特性保持不變。隨機森林分類器做的事情類似，它綜合考慮了所有決策樹的判斷：

from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier(n_estimators=100, random_state=0)visualize_tree(clf, X, y, boundaries=False);

通過平衡100個隨機模型，隨機森林最後得出的模型更好擬合了數據。

例子：移步到回歸

上面我們在分類問題上使用了隨機森林。它在回歸問題上也能很好地工作。使用的estimator叫做sklearn.ensemble.RandomForestRegressor。

讓我們來看下怎麼使用它：

from sklearn.ensemble import RandomForestRegressorx = 10 * np.random.rand(100)def model(x, sigma=0.3): fast_oscillation = np.sin(5 * x) slow_oscillation = np.sin(0.5 * x) noise = sigma * np.random.randn(len(x)) return slow_oscillation + fast_oscillation + noisey = model(x)plt.errorbar(x, y, 0.3, fmt=o);

xfit = np.linspace(0, 10, 1000)yfit = RandomForestRegressor(100).fit(x[:, None], y).predict(xfit[:, None])ytrue = model(xfit, 0)plt.errorbar(x, y, 0.3, fmt=o)plt.plot(xfit, yfit, -r);plt.plot(xfit, ytrue, -k, alpha=0.5);

我們可以看到，我們甚至沒有指定一個multi-period模型，隨機森林就能夠足夠靈活地適應了multi-period數據。

例子：應用隨機森林分類數字

我們之前見過了手寫數字數據集。我們這裡也用它來測試隨機森林的準確率。

from sklearn.datasets import load_digitsdigits = load_digits()digits.keys()[images, data, target_names, DESCR, target]X = digits.datay = digits.targetprint(X.shape)print(y.shape)(1797, 64)(1797,)

首先我們看一下這些手寫數字：

# set up the figurefig = plt.figure(figsize=(6, 6)) # figure size in inchesfig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)# plot the digits: each image is 8x8 pixelsfor i in range(64): ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[]) ax.imshow(digits.images[i], cmap=plt.cm.binary) # label the image with the target value ax.text(0, 7, str(digits.target[i]))

我們能快速使用決策樹來對這些數字進行分類：

from sklearn.model_selection import train_test_splitfrom sklearn import metricsXtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)clf = DecisionTreeClassifier(max_depth=11)clf.fit(Xtrain, ytrain)ypred = clf.predict(Xtest)

我們檢查一下分類器準確率：

metrics.accuracy_score(ypred, ytest)0.84222222222222221

用混淆矩陣更直觀：

plt.imshow(metrics.confusion_matrix(ypred, ytest), interpolation=nearest, cmap=plt.cm.binary)plt.grid(False)plt.colorbar()plt.xlabel("predicted label")plt.ylabel("true label");

練習

用sklearn.ensemble.RandomForestClassifier來實現上面的任務。調節`max_depth，max_features和n_estimators``參數
會對結果造成什麼影響？
再用sklearn.svm.SVC分類器，調節kernel, C和gamma試試看？
對每一個模型用一些參數集，然後檢查一下F1分數（sklearn.metrics.f1_score）。

Jupyter實現