決策樹演算法簡介及Spark MLlib調用

02-04

決策樹

演算法介紹：

決策樹以及其集成演算法是機器學習分類和回歸問題中非常流行的演算法。因其易解釋性、可處理類別特徵、易擴展到多分類問題、不需特徵縮放等性質被廣泛使用。樹集成演算法如隨機森林以及boosting演算法幾乎是解決分類和回歸問題中表現最優的演算法。

決策樹是一個貪心演算法遞歸地將特徵空間劃分為兩個部分，在同一個葉子節點的數據最後會擁有同樣的標籤。每次劃分通過貪心的以獲得最大信息增益為目的，從可選擇的分裂方式中選擇最佳的分裂節點。節點不純度有節點所含類別的同質性來衡量。工具提供為分類提供兩種不純度衡量（基尼不純度和熵），為回歸提供一種不純度衡量（方差）。

spark.ml支持二分類、多分類以及回歸的決策樹演算法，適用於連續特徵以及類別特徵。另外，對於分類問題，工具可以返回屬於每種類別的概率（類別條件概率），對於回歸問題工具可以返回預測在偏置樣本上的方差。

參數：

checkpointInterval:

類型：整數型。

含義：設置檢查點間隔（>=1），或不設置檢查點（-1）。

featuresCol:

類型：字元串型。

含義：特徵列名。

impurity:

類型：字元串型。

含義：計算信息增益的準則（不區分大小寫）。

labelCol:

類型：字元串型。

含義：標籤列名。

maxBins:

類型：整數型。

含義：連續特徵離散化的最大數量，以及選擇每個節點分裂特徵的方式。

maxDepth:

類型：整數型。

含義：樹的最大深度（>=0）。

minInfoGain:

類型：雙精度型。

含義：分裂節點時所需最小信息增益。

minInstancesPerNode:

類型：整數型。

含義：分裂後自節點最少包含的實例數量。

predictionCol:

類型：字元串型。

含義：預測結果列名。

probabilityCol:

類型：字元串型。

含義：類別條件概率預測結果列名。

rawPredictionCol:

類型：字元串型。

含義：原始預測。

seed:

類型：長整型。

含義：隨機種子。

thresholds:

類型：雙精度數組型。

含義：多分類預測的閥值，以調整預測結果在各個類別的概率。

示例：

下面的例子導入LibSVM格式數據，並將之劃分為訓練數據和測試數據。使用第一部分數據進行訓練，剩下數據來測試。訓練之前我們使用了兩種數據預處理方法來對特徵進行轉換，並且添加了元數據到DataFrame。

Scala:

import org.apache.spark.ml.Pipelinenimport org.apache.spark.ml.classification.DecisionTreeClassificationModelnimport org.apache.spark.ml.classification.DecisionTreeClassifiernimport org.apache.spark.ml.evaluation.MulticlassClassificationEvaluatornimport org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}nn// Load the data stored in LIBSVM format as a DataFrame.nval data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")nn// Index labels, adding metadata to the label column.n// Fit on whole dataset to include all labels in index.nval labelIndexer = new StringIndexer()n .setInputCol("label")n .setOutputCol("indexedLabel")n .fit(data)n// Automatically identify categorical features, and index them.nval featureIndexer = new VectorIndexer()n .setInputCol("features")n .setOutputCol("indexedFeatures")n .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.n .fit(data)nn// Split the data into training and test sets (30% held out for testing).nval Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))nn// Train a DecisionTree model.nval dt = new DecisionTreeClassifier()n .setLabelCol("indexedLabel")n .setFeaturesCol("indexedFeatures")nn// Convert indexed labels back to original labels.nval labelConverter = new IndexToString()n .setInputCol("prediction")n .setOutputCol("predictedLabel")n .setLabels(labelIndexer.labels)nn// Chain indexers and tree in a Pipeline.nval pipeline = new Pipeline()n .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))nn// Train model. This also runs the indexers.nval model = pipeline.fit(trainingData)nn// Make predictions.nval predictions = model.transform(testData)nn// Select example rows to display.npredictions.select("predictedLabel", "label", "features").show(5)nn// Select (prediction, true label) and compute test error.nval evaluator = new MulticlassClassificationEvaluator()n .setLabelCol("indexedLabel")n .setPredictionCol("prediction")n .setMetricName("accuracy")nval accuracy = evaluator.evaluate(predictions)nprintln("Test Error = " + (1.0 - accuracy))nnval treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]nprintln("Learned classification tree model:n" + treeModel.toDebugString)n

Java:

import org.apache.spark.ml.Pipeline;nimport org.apache.spark.ml.PipelineModel;nimport org.apache.spark.ml.PipelineStage;nimport org.apache.spark.ml.classification.DecisionTreeClassifier;nimport org.apache.spark.ml.classification.DecisionTreeClassificationModel;nimport org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;nimport org.apache.spark.ml.feature.*;nimport org.apache.spark.sql.Dataset;nimport org.apache.spark.sql.Row;nimport org.apache.spark.sql.SparkSession;nn// Load the data stored in LIBSVM format as a DataFrame.nDataset<Row> data = sparkn .read()n .format("libsvm")n .load("data/mllib/sample_libsvm_data.txt");nn// Index labels, adding metadata to the label column.n// Fit on whole dataset to include all labels in index.nStringIndexerModel labelIndexer = new StringIndexer()n .setInputCol("label")n .setOutputCol("indexedLabel")n .fit(data);nn// Automatically identify categorical features, and index them.nVectorIndexerModel featureIndexer = new VectorIndexer()n .setInputCol("features")n .setOutputCol("indexedFeatures")n .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.n .fit(data);nn// Split the data into training and test sets (30% held out for testing).nDataset<Row>[] splits = data.randomSplit(new double[]{0.7, 0.3});nDataset<Row> trainingData = splits[0];nDataset<Row> testData = splits[1];nn// Train a DecisionTree model.nDecisionTreeClassifier dt = new DecisionTreeClassifier()n .setLabelCol("indexedLabel")n .setFeaturesCol("indexedFeatures");nn// Convert indexed labels back to original labels.nIndexToString labelConverter = new IndexToString()n .setInputCol("prediction")n .setOutputCol("predictedLabel")n .setLabels(labelIndexer.labels());nn// Chain indexers and tree in a Pipeline.nPipeline pipeline = new Pipeline()n .setStages(new PipelineStage[]{labelIndexer, featureIndexer, dt, labelConverter});nn// Train model. This also runs the indexers.nPipelineModel model = pipeline.fit(trainingData);nn// Make predictions.nDataset<Row> predictions = model.transform(testData);nn// Select example rows to display.npredictions.select("predictedLabel", "label", "features").show(5);nn// Select (prediction, true label) and compute test error.nMulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator()n .setLabelCol("indexedLabel")n .setPredictionCol("prediction")n .setMetricName("accuracy");ndouble accuracy = evaluator.evaluate(predictions);nSystem.out.println("Test Error = " + (1.0 - accuracy));nnDecisionTreeClassificationModel treeModel =n (DecisionTreeClassificationModel) (model.stages()[2]);nSystem.out.println("Learned classification tree model:n" + treeModel.toDebugString());n

Python：

from pyspark.ml import Pipelinenfrom pyspark.ml.classification import DecisionTreeClassifiernfrom pyspark.ml.feature import StringIndexer, VectorIndexernfrom pyspark.ml.evaluation import MulticlassClassificationEvaluatornn# Load the data stored in LIBSVM format as a DataFrame.ndata = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")nn# Index labels, adding metadata to the label column.n# Fit on whole dataset to include all labels in index.nlabelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)n# Automatically identify categorical features, and index them.n# We specify maxCategories so features with > 4 distinct values are treated as continuous.nfeatureIndexer =n VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)nn# Split the data into training and test sets (30% held out for testing)n(trainingData, testData) = data.randomSplit([0.7, 0.3])nn# Train a DecisionTree model.ndt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")nn# Chain indexers and tree in a Pipelinenpipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])nn# Train model. This also runs the indexers.nmodel = pipeline.fit(trainingData)nn# Make predictions.npredictions = model.transform(testData)nn# Select example rows to display.npredictions.select("prediction", "indexedLabel", "features").show(5)nn# Select (prediction, true label) and compute test errornevaluator = MulticlassClassificationEvaluator(n labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")naccuracy = evaluator.evaluate(predictions)nprint("Test Error = %g " % (1.0 - accuracy))nntreeModel = model.stages[2]n# summary onlynprint(treeModel)n