python3機器學習經典實例-第五章構建推薦引擎21

04-27

構建機器學習流水線

scikit-learn庫中包含了構建機器學習流水線的方法。只需要指定函數，它就會構建一個組合對象，使數據通過整個流水線。這個流水線可以包括諸如預處理、特徵選擇、監督式學習、非監督式學習等函數。這一節將構建一個流水線，以便輸入特徵向量、選擇最好的k個特徵、用隨機森林分類器進行分類等。

詳細步驟代碼

導入必要的資料庫

from sklearn.datasets import samples_generatorfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.feature_selection import SelectKBest, f_regressionfrom sklearn.pipeline import Pipeline

生成一些數據示例

# generate sample dataX, y = samples_generator.make_classification( n_informative=4, n_features=20, n_redundant=0, random_state=5)

這一行代碼生成了一個20維的特徵向量，因為這是一個默認值。在這裡，你可以通過修改n_features參數來修改特徵向量的維數。

建立流水線的第一步是在數據點進一步操作之前選擇k個最好的特徵。這裡設置k的值為10：

# Feature selector selector_k_best = SelectKBest(f_regression, k=10)

建立流水線的第二步是用隨機森林分類器分類數據

# Random forest classifierclassifier = RandomForestClassifier(n_estimators=50, max_depth=4)

接下來可以創建流水線了。pipeline方法允許我們用預定義的對象來創建流水線：

# Build the machine learning pipelinepipeline_classifier = Pipeline([(selector, selector_k_best), (rf, classifier)])

還可以在流水線中為模塊指定名稱。上一行代碼中將特徵選擇器命名為selector，將隨機森林分類器命名為rf。你也可以任意選用其他名稱。

也可以更新這些參數，用上一步驟中命名的名稱設置這些參數。例如，如果希望在特徵選擇器中將k值設置為6，在隨機森林分類器中將n_estimators的值設置為25，可以用下面的代碼實現。注意，這些變數名稱已在上一步驟中給出。

# We can set the parameters using the names we assigned# earlier. For example, if we want to set k to 6 in the# feature selector and set n_estimators in the Random # Forest Classifier to 25, we can do it as shown belowpipeline_classifier.set_params(selector__k=6, rf__n_estimators=25)

接下來訓練分類器：接下來訓練分類器：評價分類器的性能：

# Training the classifierpipeline_classifier.fit(X, y)# Predict the outputprediction = pipeline_classifier.predict(X)print (" Predictions: ", prediction)# Print scoreprint (" Score:", pipeline_classifier.score(X, y)) # Print the selected features chosen by the selectorfeatures_status = pipeline_classifier.named_steps[selector].get_support()selected_features = []for count, item in enumerate(features_status): if item: selected_features.append(count)print (" Selected features (0-indexed):", , .join([str(x) for x in selected_features]))

結果輸出out

Predictions: [1 1 0 1 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 1 1 0 0 0 1 1 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1]Score: 0.94Selected features (0-indexed): 0, 5, 9, 10, 11, 15