python3機器學習經典實例-學習筆記9-分類演算法

05-06

實例1-依據汽車屬性進行評估

接下來看看如何用分類技術解決現實問題。我們將用一個包含汽車多種細節的數據集，例如車門數量、後備箱大小、維修成本等，來確定汽車的質量。分類的目的是把車輛的質量分成4種類型：不達標、達標、良好、優秀。

? buying：取值範圍是vhigh、high、med、low；

? maint：取值範圍是vhigh、high、med、low；

? doors：取值範圍是2、3、4、5等；

? persons：取值範圍是2、4等；

? lug_boot：取值範圍是small、med、big；

? safety：取值範圍是low、med、high。

數據示例：vhigh,vhigh,2,2,small,low,unacc

程序步驟

導入必要的數據包

import numpy as npfrom sklearn import preprocessingfrom sklearn.ensemble import RandomForestClassifierimport matplotlib.pyplot as plt

讀取數據

input_file = car.data.txt# Reading the dataX = []y = []count = 0with open(input_file, r) as f: for line in f.readlines(): data = line[:-1].split(,) X.append(data)X = np.array(X)

數據示例：vhigh,vhigh,2,2,small,low,unacc

每一行都包含由逗號分隔的單詞列表。因此，我們解析輸入文件，對每一行進行分割，然後將該列表附加到主數據。我們忽略每一行最後一個字元，因為那是一個換行符。由於Python程序包只能處理數值數據，所以需要把這些屬性轉換成程序包可以理解的形式。

將字元串轉換成數值

# Convert string data to numerical datalabel_encoder = [] X_encoded = np.empty(X.shape)for i,item in enumerate(X[0]): label_encoder.append(preprocessing.LabelEncoder()) X_encoded[:, i] = label_encoder[-1].fit_transform(X[:, i])X = X_encoded[:, :-1].astype(int)y = X_encoded[:, -1].astype(int)

由於每個屬性可以取有限數量的數值，所以可以用標記編碼器將它們轉換成數字。我們需要為不同的屬性使用不同的標記編碼器，例如，lug_boot屬性可以取3個不同的值，需要建立一個懂得給這3個屬性編碼的標記編碼器。每一行的最後一個值是類，將它賦值給變數y。

建立訓練分類器

params = {n_estimators: 200, max_depth: 8, random_state: 7}classifier = RandomForestClassifier(**params)classifier.fit(X, y)

你可以改變n_estimators和max_depth參數的值，觀察它們如何改變分類器的準確性。我們將用一個標準化的方法處理參數選擇問題。

進行交叉驗證

# Cross validationfrom sklearn import cross_validationaccuracy = cross_validation.cross_val_score(classifier, X, y, scoring=accuracy, cv=3)print ("Accuracy of the classifier: " + str(round(100*accuracy.mean(), 2)) + "%")

一旦訓練好分類器，我們就需要知道它是如何執行的。我們用三折交叉驗證（three-foldcross-validation，把數據分3組，輪換著用其中兩組數據驗證分類器）來計算分類器的準確性。

建立分類器的主要目的就是要用它對孤立的和未知的數據進行分類。下面用分類器對一個單一數據點進行分類：

# Testing encoding on single data instanceinput_data = [vhigh, vhigh, 2, 2, small, low] input_data_encoded = [-1] * len(input_data)for i,item in enumerate(input_data): input_data_encoded[i] = int(label_encoder[i].transform([input_data[i]]))input_data_encoded = np.array(input_data_encoded).reshape(1,6)

第一步是把數據轉換成數值類型。需要使用之前訓練分類器時使用的標記編碼器，因為我們需要保持數據編碼規則的前後一致。如果輸入數據點裡出現了未知數據，標記編碼器就會出現異常，因為它不知道如何對這些數據進行編碼。例如，如果你把列表中的第一個值vhigh改成abcd ，那麼標記編碼器就不知道如何編碼了，因為它不知道怎麼處理這個字元串。這就像是錯誤檢查，看看輸入數據點是否有效。

預測輸出類型

# Predict and print output for a particular datapointoutput_class = classifier.predict(input_data_encoded)print ("Output class:", label_encoder[-1].inverse_transform(output_class)[0])

輸出結果out

Accuracy of the classifier: 78.19%Output class: unacc

生成驗證曲線

前面用隨機森林建立了分類器，但是並不知道如何定義參數。本節來處理兩個參數：n_estimators和max_depth參數。它們被稱為超參數（hyperparameters），分類器的性能是由它們決定的。當改變超參數時，如果可以看到分類器性能的變化情況，那就再好不過了。這就是驗證曲線的作用。這些曲線可以幫助理解每個超參數對訓練得分的影響。基本上，我們只對感興趣的超參數進行調整，其他參數可以保持不變。下面將通過可視化圖片演示超參數的變化對訓練得分的影響。

驗證曲線

######################### Validation curvesfrom sklearn.learning_curve import validation_curveclassifier = RandomForestClassifier(max_depth=4, random_state=7)parameter_grid = np.linspace(25, 200, 8).astype(int)train_scores, validation_scores = validation_curve(classifier, X, y, "n_estimators", parameter_grid, cv=5)print (" ##### VALIDATION CURVES #####")print (" Param: n_estimators Training scores: ", train_scores)print (" Param: n_estimators Validation scores: ", validation_scores)

運行代碼的結果

##### VALIDATION CURVES #####Param: n_estimatorsTraining scores: [[0.80680174 0.80824891 0.80752533 0.80463097 0.81358382] [0.79522431 0.80535456 0.81041968 0.8089725 0.81069364] [0.80101302 0.80680174 0.81114327 0.81476122 0.8150289 ] [0.8024602 0.80535456 0.81186686 0.80752533 0.80346821] [0.80028944 0.80463097 0.81114327 0.80824891 0.81069364] [0.80390738 0.80535456 0.81041968 0.80969609 0.81647399] [0.80390738 0.80463097 0.81114327 0.81476122 0.81719653] [0.80390738 0.80607815 0.81114327 0.81403763 0.81647399]]Param: n_estimatorsValidation scores: [[0.71098266 0.76589595 0.72543353 0.76300578 0.75290698] [0.71098266 0.75433526 0.71965318 0.75722543 0.74127907] [0.71098266 0.72254335 0.71965318 0.75722543 0.74418605] [0.71098266 0.71387283 0.71965318 0.75722543 0.72674419] [0.71098266 0.74277457 0.71965318 0.75722543 0.74127907] [0.71098266 0.74277457 0.71965318 0.75722543 0.74127907] [0.71098266 0.74566474 0.71965318 0.75722543 0.74418605] [0.71098266 0.75144509 0.71965318 0.75722543 0.74127907]]

把數據畫成圖形

# Plot the curveplt.figure()plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color=black)plt.title(Training curve)plt.xlabel(Number of estimators)plt.ylabel(Accuracy)plt.show()

對參數max_depth進行驗證與上相似

classifier = RandomForestClassifier(n_estimators=20, random_state=7)parameter_grid = np.linspace(2, 10, 5).astype(int)train_scores, valid_scores = validation_curve(classifier, X, y, "max_depth", parameter_grid, cv=5)print (" Param: max_depth Training scores: ", train_scores)print (" Param: max_depth Validation scores: ", validation_scores)# Plot the curveplt.figure()plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color=black)plt.title(Validation curve)plt.xlabel(Maximum depth of the tree)plt.ylabel(Accuracy)plt.show()

結果輸出

Param: max_depthTraining scores: [[0.71852388 0.70043415 0.70043415 0.70043415 0.69942197] [0.80607815 0.80535456 0.80752533 0.79450072 0.81069364] [0.90665702 0.91027496 0.92836469 0.89797395 0.90679191] [0.97467438 0.96743849 0.96888567 0.97829233 0.96820809] [0.99421129 0.99710564 0.99782923 0.99855282 0.99277457]]Param: max_depthValidation scores: [[0.71098266 0.76589595 0.72543353 0.76300578 0.75290698] [0.71098266 0.75433526 0.71965318 0.75722543 0.74127907] [0.71098266 0.72254335 0.71965318 0.75722543 0.74418605] [0.71098266 0.71387283 0.71965318 0.75722543 0.72674419] [0.71098266 0.74277457 0.71965318 0.75722543 0.74127907] [0.71098266 0.74277457 0.71965318 0.75722543 0.74127907] [0.71098266 0.74566474 0.71965318 0.75722543 0.74418605] [0.71098266 0.75144509 0.71965318 0.75722543 0.74127907]]

生成學習曲線

學習曲線可以幫助我們理解訓練數據集的大小對機器學習模型的影響。當遇到計算能力限制時，這一點非常有用。下面改變訓練數據集的大小，把學習曲線畫出來。

我們想分別用200、500、800、1100的訓練數據集的大小測試模型的性能指標。我們把learning_curve方法中的cv參數設置為5，就是用五折交叉驗證。

######################### Learning curvesfrom sklearn.learning_curve import learning_curveclassifier = RandomForestClassifier(random_state=7)parameter_grid = np.array([200, 500, 800, 1100])train_sizes, train_scores, validation_scores = learning_curve(classifier, X, y, train_sizes=parameter_grid, cv=5)print (" ##### LEARNING CURVES #####")print (" Training scores: ", train_scores)print (" Validation scores: ", validation_scores)# Plot the curveplt.figure()plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color=black)plt.title(Learning curve)plt.xlabel(Number of training samples)plt.ylabel(Accuracy)plt.show()

結果輸出

##### LEARNING CURVES #####Training scores: [[1. 1. 1. 1. 1. ] [1. 1. 0.998 0.998 0.998 ] [0.99875 0.9975 0.99875 0.99875 0.99875 ] [0.99818182 0.99545455 0.99909091 0.99818182 0.99818182]]Validation scores: [[0.69942197 0.69942197 0.69942197 0.69942197 0.70348837] [0.74855491 0.65028902 0.76878613 0.76589595 0.70348837] [0.70520231 0.78612717 0.52312139 0.76878613 0.77034884] [0.65028902 0.75433526 0.65317919 0.75433526 0.76744186]]

雖然訓練數據集的規模越小，彷彿訓練準確性越高，但是它們很容易導致過度擬合。如果選擇較大規模的訓練數據集，就會消耗更多的資源。因此，訓練數據集的規模選擇也是一個需要結合計算能力進行綜合考慮的問題。