Spark MLlib machine learning library (1) detailed explanation of decision tree and random forest case

Spark MLlib machine learning library (1) detailed explanation of decision tree and random forest case

1 Decision tree predicts forest vegetation

1.1 Covtype data set

Dataset download address: https://www.kaggle.com/datasets/uciml/forest-cover-type-dataset

This data set records forest vegetation types in different plots of land in Colorado, USA. Each sample contains several characteristics that describe each piece of land, including elevation, slope, distance to water sources, shading conditions and soil type, and gives the location. The known forest vegetation type corresponding to the block.

Naturally, we parse the data into a DataFrame, because DataFrame is Spark's abstraction for tabular data. It has a defined schema, including column names and column types.

package com.yyds

import org.apache.log4j.{
    
    Level, Logger}
import org.apache.spark.sql.{
    
    DataFrame, SparkSession}
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.functions._
import org.apache.spark.ml.classification.{
    
    DecisionTreeClassificationModel, DecisionTreeClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.ml.{
    
    Model, Pipeline, PipelineModel, Transformer}
import org.apache.spark.ml.tuning.{
    
    ParamGridBuilder, TrainValidationSplit, TrainValidationSplitModel}


object DecisionTreeTest {
    
    

  Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)


  def main(args: Array[String]): Unit = {
    
    


    // 构建SparkSession实例对象,通过建造者模式创建
    val spark: SparkSession = {
    
    
      SparkSession
        .builder()
        .appName(this.getClass.getSimpleName.stripSuffix("$"))
        .master("local[1]")
        .config("spark.sql.shuffle.partitions", "3")
        .getOrCreate()
    }


    // 利用Spark内置的读取CSV数据功能
    val dataWithHeader = spark.read
                              .option("inferSchema", "true") // 数据类型推断
                              .option("header", "true") // 表头解析
                              .csv("D:\\kaggle\\covtype\\covtype.csv")

    // 检查一下列名,可以清楚地看到,有些特征确实是数值型。
    // 但是“荒野区域(Wilderness_Area)”有些不同,因为它横跨4列,每列要么为 0,要么为 1。
    // 实际上荒野区域是一个类别型特征,而非数值型。采用了one-hot编码
    // 同样,Soil_Type(40列)也是one-hot编码。
    dataWithHeader.printSchema()
    dataWithHeader.show(10)
    
  }

}

解释一下one-hot编码

one-hot编码:一个有N个不同取值的类别型特征可以变成 N 个数值型特征,变换后的每个数值型特征的取值为 01。在这 N 个特征中,有且只有一个取值为 1,其他特征取值都为 0

比如,类别型特征“天气”可能的取值有“多云”“有雨”或“晴朗”。
在one-hot 编码中,它就变成了 3 个数值型特征:
多云用 1,0,0 表示,
有雨用 0,1,0 表示,
晴朗用 0,0,1 表示。


不过,这并不是将分类特性编码为数值的唯一方法。
另一种可能的编码方式是为类别型特征的每个可能取值分配一个不同数值,比如多云 1.0、有雨 2.0 等。目标“Cover_Type”本身也是
类别型值,用 1~7 编码。


在编码过程中,将类别型特征当成数值型特征时要小心。类别型特征值原本是没有大小顺序可言的,但被编码为数值之后,它们就“显得”有大小顺序
了。被编码后的特征若被视为数值,算法在一定程度上会假定有雨比多云大,而且大两倍,这样就可能导致不合理的结果。

1.2 Establish a decision tree model

1.2.1 The original features are combined into feature vectors

    // 划分训练集和测试集
    val Array(trainData, testData) = dataWithHeader.randomSplit(Array(0.9, 0.1))
    trainData.cache()
    testData.cache()    

    // 输入的DataFrame包含许多列,每列对应一个特征,可以用来预测目标列。
    // Spark MLlib 要求将所有输入合并成一列,该列的值是一个向量。
    // 我们可以利用VectorAssembler将特征转换为向量
    val inputCols: Array[String] = trainData.columns.filter(_ != "Cover_Type")
    val assembler: VectorAssembler = new VectorAssembler()
      .setInputCols(inputCols) // 除了目标列以外,所有其他列都作为输入特征,因此产生的DataFrame有一个新的“featureVector”
      .setOutputCol("featureVector")

    val assembledTrainData: DataFrame = assembler.transform(trainData)

    assembledTrainData.select(
      col("featureVector")
    ).show(10,truncate = false)
+----------------------------------------------------------------------------------------------------+
|featureVector                                                                                       |
+----------------------------------------------------------------------------------------------------+
|(54,[0,1,2,3,4,5,6,7,8,9,13,15],[1859.0,18.0,12.0,67.0,11.0,90.0,211.0,215.0,139.0,792.0,1.0,1.0])  |
|(54,[0,1,2,3,4,5,6,7,8,9,13,15],[1860.0,18.0,13.0,95.0,15.0,90.0,210.0,213.0,138.0,780.0,1.0,1.0])  |
|(54,[0,1,2,3,4,5,6,7,8,9,13,15],[1861.0,35.0,14.0,60.0,11.0,85.0,218.0,209.0,124.0,832.0,1.0,1.0])  |
|(54,[0,1,2,3,4,5,6,7,8,9,13,15],[1866.0,23.0,14.0,85.0,16.0,108.0,212.0,210.0,133.0,819.0,1.0,1.0]) |
|(54,[0,1,2,3,4,5,6,7,8,9,13,15],[1867.0,20.0,15.0,108.0,19.0,120.0,208.0,206.0,132.0,808.0,1.0,1.0])|
|(54,[0,1,2,3,4,5,6,7,8,9,13,15],[1868.0,27.0,16.0,67.0,17.0,95.0,212.0,204.0,125.0,859.0,1.0,1.0])  |
|(54,[0,1,2,3,4,5,6,7,8,9,13,18],[1871.0,22.0,22.0,60.0,12.0,85.0,200.0,187.0,115.0,792.0,1.0,1.0])  |
|(54,[0,1,2,3,4,5,6,7,8,9,13,15],[1871.0,36.0,19.0,134.0,26.0,120.0,215.0,194.0,107.0,797.0,1.0,1.0])|
|(54,[0,1,2,3,4,5,6,7,8,9,13,15],[1871.0,37.0,19.0,120.0,29.0,90.0,216.0,195.0,107.0,759.0,1.0,1.0]) |
|(54,[0,1,2,3,4,5,6,7,8,9,13,18],[1872.0,12.0,27.0,85.0,25.0,60.0,182.0,174.0,118.0,577.0,1.0,1.0])  |
+----------------------------------------------------------------------------------------------------+
  • The output doesn't look like a string of numbers because it shows the original representation of the vector, which is an instance of Sparse Vector, which saves storage space. Since most of the 54 values ​​are 0, it only stores non-zero values ​​and their indices.

  • VectorAssembler is 管道(Pipeline)a Transformer example in the current Spark MLlib API. VectorAssembler can convert one DataFrame into another DataFrame and can be combined with other Transformers into a pipeline. Later, we will connect these transformation operations into a real pipeline.

1.2.2 Building a decision tree

  val classifier = new DecisionTreeClassifier()
                        .setSeed(Random.nextLong())  // 随机数种子
                        .setLabelCol("Cover_Type")  // 目标列
                        .setFeaturesCol("featureVector") // 准换后的特征列
                        .setPredictionCol("prediction") // 预测列的名称
    
    // DecisionTreeClassificationModel本身就是一个转换器
    // 它可以将一个包含特征向量的 DataFrame 转换成一个包含特征向量及其预测结果的 DataFrame
    val model: DecisionTreeClassificationModel = classifier.fit(assembledTrainData)

    println(model.toDebugString)
DecisionTreeClassificationModel: uid=dtc_54cb31909b32, depth=5, numNodes=51, numClasses=8, numFeatures=54
  If (feature 0 <= 3048.5)
   If (feature 0 <= 2559.5)
    If (feature 10 <= 0.5)
     If (feature 0 <= 2459.5)
      If (feature 3 <= 15.0)
       Predict: 4.0
      Else (feature 3 > 15.0)
       Predict: 3.0
     Else (feature 0 > 2459.5)
      If (feature 17 <= 0.5)
       Predict: 2.0
      Else (feature 17 > 0.5)
       Predict: 3.0
    Else (feature 10 > 0.5)
     If (feature 9 <= 5129.0)
      Predict: 2.0
     Else (feature 9 > 5129.0)
      If (feature 5 <= 569.5)
       Predict: 2.0
      Else (feature 5 > 569.5)
       Predict: 5.0
       ......

Based on the output information of the above model representation, we can find some tree structures of the model. It consists of a series of nested decisions on features that compare feature values ​​to thresholds.

In the process of building a decision tree, 决策树能够评估输入特征的重要性. That is, they evaluate the contribution of each input feature to making a correct prediction. This information is easily obtained from the model.

// 把列名及其重要性(越高越好)关联成二元组,并按照重要性从高到低排列输出。 
// Elevation 似乎是绝对重要的特征;其他的大多数特征在预测植被类型时几乎没有任何作用!
model.featureImportances.toArray.zip(inputCols).sorted.reverse.foreach(println)
(0.8066003452907752,Elevation)
(0.04178573786315329,Horizontal_Distance_To_Hydrology)
(0.03280245080822316,Wilderness_Area1)
(0.030257284101934206,Soil_Type4)
(0.02562302586398405,Hillshade_Noon)
(0.023493741973492223,Soil_Type2)
(0.016910986928613186,Soil_Type32)
(0.011741228151910562,Wilderness_Area3)
(0.005884894981433861,Soil_Type23)
(0.0027811902118641293,Hillshade_9am)
(0.0021191138246161745,Horizontal_Distance_To_Roadways)
(0.0,Wilderness_Area4)
(0.0,Wilderness_Area2)
(0.0,Vertical_Distance_To_Hydrology)
(0.0,Soil_Type9)
......

1.2.3 Forecast

    // 比较一下模型预测值与正确的覆盖类型
    val predictions = model.transform(assembledTrainData)
    
    predictions
      .select("Cover_Type", "prediction", "probability")
      .show(10, truncate = false)

insert image description here

  • The output also contains a probabilitycolumn that gives an estimate of the model's accuracy for each possible output.

  • Although there are only 7 possible outcomes, the probability vector actually has 8 values. The values ​​at index 1~7 in the vector represent the probabilities of the result being 1~7 respectively. However, index 0 also has a value, which always shows a probability of "0.0". We can ignore it because 0 is not a valid result at all.

  • There are several hyperparameters that need to be adjusted in the implementation of the decision tree classifier. The default values ​​are used in this code.

1.2.4 Evaluation model training

     // 评估训练质量
    val evaluator = new MulticlassClassificationEvaluator()
                      .setLabelCol("Cover_Type")
                      .setPredictionCol("prediction")


    println("准确率:" + evaluator.setMetricName("accuracy").evaluate(predictions))
    println("f1值:" + evaluator.setMetricName("f1").evaluate(predictions))
准确率:0.7007016671765066
f1值:0.6810479157002327

混淆矩阵

A single accuracy rate can give a good summary of how good a classifier's output is, but sometimes 混淆矩阵(confusion matrix)it's more useful.

The confusion matrix is ​​an N×N table, where N represents the number of possible target values. Because our target value has 7 categories, it is a 7×7 matrix. Each row represents the true category of the data, and each column represents the predicted value in order. The entries in the i-th row and j-th column represent the total amount of data in the data that actually belongs to the i-th category but is predicted to be the j-th category. Therefore, correct predictions are calculated along the diagonal, while off-diagonal elements represent incorrect predictions.

   // 混淆矩阵,Spark 提供了用于计算混淆矩阵的代码;不幸的是,这个代码是基于操作 RDD的旧版 MLlib API 实现的
    val predictionRDD = predictions
                      .select("prediction", "Cover_Type")
                      .as[(Double,Double)] // 转换成 Dataset,需要导入隐式准换 import spark.implicits._
                      .rdd // 准换为rdd

    val multiclassMetrics = new MulticlassMetrics(predictionRDD)
    println("混淆矩阵:")
    println(multiclassMetrics.confusionMatrix)
混淆矩阵:
130028.0  55161.0   187.0    0.0    0.0  0.0  5175.0   
50732.0   196315.0  7163.0   53.0   0.0  0.0  762.0    
0.0       2600.0    29030.0  600.0  0.0  0.0  0.0      
0.0       0.0       1487.0   967.0  0.0  0.0  0.0      
12.0      7743.0    755.0    0.0    0.0  0.0  0.0      
0.0       3478.0    11812.0  387.0  0.0  0.0  0.0      
7923.0    193.0     60.0     0.0    0.0  0.0  10275.0  

More reps on the diagonal is good. But some classification errors did occur. For example, the classifier did not even predict any sample category as 5.

     //  当然,计算混淆矩阵之类,也可以直接使用 DataFrame API 中一些通用的操作,而不再需要依赖专门的方法。
    //  透视Pivot
    //  透视操作简单直接,逻辑如下
    //  1、按照不需要转换的字段分组,本例中是Cover_Type;
    //  2、使用pivot函数进行透视,透视过程中可以提供第二个参数来明确指定使用哪些数据项;
    //  3、汇总数字字段

    val confusionMatrix = predictions
      .groupBy("Cover_Type")
      .pivot("prediction", (1 to 7)) //透视可以视为一个聚合操作,通过该操作可以将一个(实际当中也可能是多个)具有不同值的分组列转置为各个独立的列
      .count()
      .na.fill(0.0)   // 用 0 替换 null
      .orderBy("Cover_Type")


    confusionMatrix.show()
+----------+------+------+-----+---+---+---+-----+
|Cover_Type|     1|     2|    3|  4|  5|  6|    7|
+----------+------+------+-----+---+---+---+-----+
|         1|130028| 55161|  187|  0|  0|  0| 5175|
|         2| 50732|196315| 7163| 53|  0|  0|  762|
|         3|     0|  2600|29030|600|  0|  0|    0|
|         4|     0|     0| 1487|967|  0|  0|    0|
|         5|    12|  7743|  755|  0|  0|  0|    0|
|         6|     0|  3478|11812|387|  0|  0|    0|
|         7|  7923|   193|   60|  0|  0|  0|10275|
+----------+------+------+-----+---+---+---+-----+

70% accuracy was achieved with default hyperparameters. The accuracy can be improved if other values ​​of the hyperparameters are tried during the decision tree construction process.

1.3 Hyperparameters of decision trees

The important hyperparameters of decision trees are as follows:

  • maximum depth

    • The maximum depth only limits the number of layers of the decision tree. It is the
      maximum number of series of judgments made by the classifier in order to classify the sample. 限制判断次数有利于避免对训练数据产生过拟合.
  • Maximum number of buckets

    • The decision tree algorithm is responsible for generating possible decision rules for each layer, which are like "weight ≥ 100" or "weight ≥ 500".

    • The decision always takes the same form: for numerical features, the decision takes the form of feature ≥ value; for categorical features, the decision takes the form of feature in (value 1, value 2,…). Therefore, the set of decision rules to try is actually a sequence of values ​​that can be embedded in the decision rule.

    • Spark MLlib's implementation refers to collections of decision rules as "bins". 桶的数目越多,需要的处理时间越多,但找到的决策规则可能更优.

  • impurity measure

    • Good rules divide the target values ​​of the training set data into relatively homogeneous or "pure" subsets.

    • Choosing the best rule means minimizing the impurity of the two subsets corresponding to the rule.

    • There are two commonly used measures of impurity:Gini不纯度(spark默认参数)或熵

  • minimum information gain

    • Helps avoid overfitting

1.4 Decision tree hyperparameter tuning

Which impurity measure results in a more accurate decision tree, or what is the appropriate maximum depth or number of buckets? We can let Spark try many combinations of these values.

First, it is necessary to build a pipeline that encapsulates the same two steps as above. Create VectorAssembler and DecisionTreeClassifier, and then string these two Transformers together, we can get 单独的Pipeline 对象a Pipeline object that can represent the previous two operations into one.

   val newAssembler = new VectorAssembler()
                        .setInputCols(inputCols)
                        .setOutputCol("featureVector")

    // 在这里我们先不设置超参数
    val newClassifier = new DecisionTreeClassifier()
                        .setSeed(Random.nextLong())
                        .setLabelCol("Cover_Type")
                        .setFeaturesCol("featureVector")
                        .setPredictionCol("prediction")

    // 组合为Pipeline
    val pipeline = new Pipeline().setStages(Array(newAssembler, newClassifier))

    // 使用 SparkML API 内建支持的 ParamGridBuilder 来测试超参数的组合
    val paramGrid = new ParamGridBuilder() // 4个超参数来说,每个超参数的两个值都要构建和评估一个模型,共计16种超参数组合,会训练出16个模型
                      .addGrid(newClassifier.impurity, Seq("gini", "entropy"))
                      .addGrid(newClassifier.maxDepth, Seq(1, 20))
                      .addGrid(newClassifier.maxBins, Seq(40, 300))
                      .addGrid(newClassifier.minInfoGain, Seq(0.0, 0.05))
                      .build()

    // 设定评估指标  准确率
    val multiclassEval = new MulticlassClassificationEvaluator()
                              .setLabelCol("Cover_Type")
                              .setPredictionCol("prediction")
                              .setMetricName("accuracy")

    // 这里也可以用 CrossValidator 执行完整的 k 路交叉验证,但是要额外付出 k 倍的代价,并且在大数据的情况下意义不大。
    // 所以在这里 TrainValidationSplit 就够用了
    val validator = new TrainValidationSplit()
                        .setSeed(Random.nextLong())
                        .setEstimator(pipeline)           // 管道
                        .setEvaluator(multiclassEval)     // 评估器
                        .setEstimatorParamMaps(paramGrid) // 超参数组合
                        .setTrainRatio(0.9)               // 训练数据实际上被TrainValidationSplit 划分成90%与10%的两个子集

    val validatorModel: TrainValidationSplitModel = validator.fit(trainData)

    // validator 的结果包含它找到的最优模型。
    val bestModel = validatorModel.bestModel

    // 打印最优模型参数
    // 手动从结果 PipelineModel 中提取 DecisionTreeClassificationModel 的实例,然后提取参数
    println(bestModel.asInstanceOf[PipelineModel].stages.last.extractParamMap)
{
    
    
	dtc_1d4212c56614-cacheNodeIds: false,
	dtc_1d4212c56614-checkpointInterval: 10,
	dtc_1d4212c56614-featuresCol: featureVector,
	dtc_1d4212c56614-impurity: entropy,
	dtc_1d4212c56614-labelCol: Cover_Type,
	dtc_1d4212c56614-leafCol: ,
	dtc_1d4212c56614-maxBins: 40,
	dtc_1d4212c56614-maxDepth: 20,
	dtc_1d4212c56614-maxMemoryInMB: 256,
	dtc_1d4212c56614-minInfoGain: 0.0,
	dtc_1d4212c56614-minInstancesPerNode: 1,
	dtc_1d4212c56614-minWeightFractionPerNode: 0.0,
	dtc_1d4212c56614-predictionCol: prediction,
	dtc_1d4212c56614-probabilityCol: probability,
	dtc_1d4212c56614-rawPredictionCol: rawPrediction,
	dtc_1d4212c56614-seed: 2458929424685097192
}

This contains a lot of information about fitting the model:

  • "Entropy" is most effective as a measure of impurity

  • The maximum depth of 20 is better than 1, which is what we expected.

  • It may be surprising that the best model only fit 40 bins, but it may also mean that 40 bins is "good enough" rather than fitting to 40 bins. 300 barrels are "better".

  • Finally, minInfoGain has a value of 0, which is better than a minimum value that is not zero, because it may mean that the model is more likely to underfit than overfit.

超参数和评估结果分别用 getEstimatorParamMaps 和 validationMetrics获得

    // 超参数和评估结果分别用 getEstimatorParamMaps 和 validationMetrics 获得。
    // 我们可以获取每一组超参数和其评估结果
 
    val paramsAndMetrics = validatorModel.validationMetrics.zip(validatorModel.getEstimatorParamMaps).sortBy(-_._1)
    
    
    paramsAndMetrics.foreach {
    
     case (metric, params) =>
      println(metric)
      println(params)
      println()
    }
0.9083158925519863
{
    
    
	dtc_5c5081d572b6-impurity: entropy,
	dtc_5c5081d572b6-maxBins: 40,
	dtc_5c5081d572b6-maxDepth: 20,
	dtc_5c5081d572b6-minInfoGain: 0.0
}
......
    // 这个模型在交叉验证集中达到的准确率是多少?最后,在测试集中能达到什么样的准确率?
    println("交叉验证集上最大准确率:" + validatorModel.validationMetrics.max)
    println("测试集上的准确率:" + multiclassEval.evaluate(bestModel.transform(testData)))
交叉验证集上最大准确率:0.9083158925519863
测试集上的准确率:0.9134603776838838

2 Random Forest

  • At each level of the decision tree, the algorithm does not consider all possible decision rules. If all possible decision rules were considered at each layer, the running time of the algorithm would be unimaginable. For a categorical feature with N values, there are a total of 2^N –2 possible decision rules (all subsets except the empty set and the full set). Even for a generally large N, this will create billions of candidate decision rules.

  • Decision trees also involve some randomness in the process of selecting rules; only a few randomly selected features are considered at a time, and only a
    random subset of the training data is considered. While sacrificing some accuracy, the speed is greatly improved, but it also means that the tree constructed by the decision tree algorithm is different every time.

  • But there should be not just one tree, but many, each giving reasonable, independent and distinct estimates of the correct target value. The collective average prediction of these trees should be closer to the correct answer than any individual prediction. It is because of the randomness in the decision tree construction process that there is this independence, which is 随机决策森林the key.

  • The prediction of a random decision forest is simply a weighted average of the predictions of all the decision trees.

    • For categorical objectives, this is the category with the most votes, or the maximum possible value after averaging the decision tree probabilities.
    • Random decision forests, like decision trees, also support regression problems, where the prediction made by the forest is the average of the predictions of each tree.
package com.yyds


import org.apache.log4j.{
    
    Level, Logger}
import org.apache.spark.sql.{
    
    DataFrame, SparkSession}
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.{
    
    Model, Pipeline, PipelineModel, Transformer}
import org.apache.spark.ml.tuning.{
    
    ParamGridBuilder, TrainValidationSplit, TrainValidationSplitModel}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.classification.RandomForestClassificationModel
import scala.util.Random


object ForestModelTest {
    
    

  Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

  def main(args: Array[String]): Unit = {
    
    


    // 构建SparkSession实例对象,通过建造者模式创建
    val spark: SparkSession = {
    
    
      SparkSession
        .builder()
        .appName(this.getClass.getSimpleName.stripSuffix("$"))
        .master("local[1]")
        .config("spark.sql.shuffle.partitions", "3")
        .getOrCreate()
    }


    // 利用Spark内置的读取CSV数据功能
    val dataWithHeader: DataFrame = spark.read
      .option("inferSchema", "true") // 数据类型推断
      .option("header", "true") // 表头解析
      .csv("D:\\kaggle\\covtype\\covtype.csv")


    // 划分训练集和测试集
    val Array(trainData, testData) = dataWithHeader.randomSplit(Array(0.9, 0.1))
    trainData.cache()
    testData.cache()

    // 输入的特征列
    val inputCols: Array[String] = trainData.columns.filter(_ != "Cover_Type")

    
    val newAssembler = new VectorAssembler()
      .setInputCols(inputCols)
      .setOutputCol("featureVector")


    // 随机森林分类器
    val newClassifier = new RandomForestClassifier()
      .setSeed(Random.nextLong())
      .setLabelCol("Cover_Type")
      .setFeaturesCol("featureVector")
      .setPredictionCol("prediction")


    // 组合为Pipeline
    val pipeline = new Pipeline().setStages(Array(newAssembler, newClassifier))

    // 使用 SparkML API 内建支持的 ParamGridBuilder 来测试超参数的组合
    val paramGrid = new ParamGridBuilder() // 4个超参数来说,每个超参数的两个值都要构建和评估一个模型,共计16种超参数组合,会训练出16个模型
      .addGrid(newClassifier.impurity, Seq("gini", "entropy"))
      .addGrid(newClassifier.maxDepth, Seq(1, 20))
      .addGrid(newClassifier.maxBins, Seq(40, 300))
      .addGrid(newClassifier.numTrees, Seq(10, 20)) // 要构建的决策树的个数
      .build()

    // 设定评估指标  准确率
    val multiclassEval = new MulticlassClassificationEvaluator()
      .setLabelCol("Cover_Type")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")

    // 这里也可以用 CrossValidator 执行完整的 k 路交叉验证,但是要额外付出 k 倍的代价,并且在大数据的情况下意义不大。
    // 所以在这里 TrainValidationSplit 就够用了
    val validator = new TrainValidationSplit()
      .setSeed(Random.nextLong())
      .setEstimator(pipeline)           // 管道
      .setEvaluator(multiclassEval)     // 评估器
      .setEstimatorParamMaps(paramGrid) // 超参数组合
      .setTrainRatio(0.9)               // 训练数据实际上被TrainValidationSplit 划分成90%与10%的两个子集

    val validatorModel: TrainValidationSplitModel = validator.fit(trainData)

    // validator 的结果包含它找到的最优模型。
    val bestModel = validatorModel.bestModel

    // 打印最优模型参数
    // 手动从结果 PipelineModel 中提取 DecisionTreeClassificationModel 的实例,然后提取参数
    println(bestModel.asInstanceOf[PipelineModel].stages.last.extractParamMap)

    // 随机森林分类器有另外一个超参数:要构建的决策树的个数。
    // 与超参数 maxBins 一样,在某个临界点之前,该值越大应该就能获得越好的效果。然而,代价是构造多棵决策树的时间比建造一棵的时间要长很多倍。
    val forestModel = bestModel.asInstanceOf[PipelineModel].stages.last.asInstanceOf[RandomForestClassificationModel]

    // 我们对于特征的理解更加准确了
    println("特征重要性:")
    println(forestModel.featureImportances.toArray.zip(inputCols).sorted.reverse.foreach(println))


    // 这个模型在交叉验证集中达到的准确率是多少?最后,在测试集中能达到什么样的准确率?
    println("交叉验证集上最大准确率:" + validatorModel.validationMetrics.max)
    println("测试集上的准确率:" + multiclassEval.evaluate(bestModel.transform(testData)))


    // 预测
    // 得到的“最优模型”实际上是包含所有操作的整个管道,其中包括如何对输入进行转换以适于模型处理,以及用于预测的模型本身。
    // 它可以接受新的 DataFrame 作为输入。它与我们刚开始时使用的 DataFrame 数据的唯一区别就是缺少“Cover_Type”列
    bestModel
      .transform(testData.drop("Cover_Type"))
      .select("prediction")
      .show(10)

  }

}

Guess you like

Origin blog.csdn.net/qq_44665283/article/details/132306525