Depth understanding XGBoost: a distributed implementation

Source: Public number [Coggle data Science

EDITORIAL

This article will focus XGBoost Spark platform Scala version based on the realization lead us step by step through feature extraction, transformation and selection, XGBoost model training, Pipelines, model selection.

XGBoost brief review

XGBoost (Extreme Gradient Boosting) presented by Dr. Chen Tianqi University of Washington, beginning as a distributed (depth) machine learning one of the research community (DMLC) group research project. Later, because of learning Challenge shine Higgs (Higgs) machine, the industry is well-known, widely used in scientific applications data. At present, some major Internet companies such as Tencent, Alibaba and so have XGBoost apply to its business, in a variety of data science competitions XGBoost also become competitors who win weapon. XGBoost recommended, sort search, predict user behavior, click-through rate prediction, and achieved good results on product classification and other issues. Although these years neural networks (especially deep neural networks) are becoming increasingly popular, but XGBoost still limited training samples, short training time ,, the assistant at the lack of knowledge of the scene has a unique advantage. Compared to the depth of the neural network, XGBoost better able to handle the data table, and having a more interpretable, additionally has an easy parameter adjustment, input data invariance advantages.

XGBoost achieve Gradient Boosting compared to other implementations, XGBoost done a lot of optimization, on the model train speed and accuracy has improved significantly, its excellent features as follows.

1) The addition of the regularization term in the objective function, the complexity of the control model, to prevent over-fitting.

2) the objective function second-order Taylor expansion, and used the first derivative and second derivative.

3) to achieve approximate histograms parallelizable algorithm.

4) achieve a reduction and column sampling (random draws GBDT and forest).

5) to achieve a rapid histogram algorithm which is constructed based on the loss-guide Tree (draws LightGBM).

6) Problems with quantile achieved weights approximation algorithm (weighted quantile sketch).

7) can automatically learn values ​​split according to the direction of the missing samples for missing values.

8) presorted data, and stored in the form (block) of the block, it is conducive to parallel computing.

9) the use of cache-aware access, external memory block calculates other ways to enhance data access and computational efficiency.

10) Rabit achieved based on distributed computing, and integrated into the mainstream of big data platform.

11) In addition to the base as a CART classifier outer, XGBoost and also supports the linear classifier model LambdaMART sorting algorithms.

12) achieved a DART, introduced Dropout technology.

There are already a growing number of developers contributing to the open source community XGBoost. XGBoost to achieve a multi-lingual package, such as Python, Scala, Java and so on. Python users can XGBoost integrated with scikit-learn and achieve more efficient machine learning applications. In addition, XGBoost integrated into the Spark, Flink and other mainstream big data platform.

Distributed XGBoost

Perhaps in the race we rarely or never use distributed XGBoost version, but the data size of the explosive growth in the industrial sector data, stand-alone mode is difficult to meet the needs of users, XGBoost accordingly launched a distributed version, which is XGBoost so popular an important reason. This article will focus XGBoost Spark achieve platform-based lead you step by step through Spark versions of feature extraction, transformation and selection, as well as XGBoost model training, Pipelines, model selection.

1. Spark-based implementation platform

Spark is a versatile and efficient large data processing engine, which is based on the large data parallel computing frame memory. Because the Spark is calculated based on memory, it can guarantee real-time big data computing, Hadoop MapReduce efficiency compared to traditional upgrade a lot. Spark has a rich ecological environment, Spark as the core, covering support: data query and analysis of the structure of Spark SQL, distributed machine learning library MLlib, drawing a parallel computing framework GraphX, fault-tolerant computing framework Spark Streaming flow and so on. Since Spark is widely used in industry, a huge user base, so XGBoost launched XGBoost4J-Spark Spark to support the platform.

1.1 Spark architecture

As shown in FIG. 1, Spark consists essentially of the following components.

  • Client: Spark job submission client.
  • Driver: Spark job to accept a request to start SparkContext.
  • SparkContext: the context of the entire application, you can control the life cycle of the application.
  • ClusterManager: Cluster Manager, Application to allocate resources, including a variety of types, such as Spark comes Standalone, Meso YARN or the like.
  • Worker: Application cluster node to any executable code to run one or more Executor.
  • Executor: Application submitted Worker nodes in the process up and running tasks, responsible for the data stored in the memory or hard disk. Each has its own Application Executor tasks.

Seen from FIG. 1, Spark job submission process is as follows: First Client submitted after the application, Driver receives the request, start SparkContext. SparkContext connection ClusterManager, ClusterManager responsible for allocating resources for the application. Spark will get to perform a task in the cluster nodes Executor, the Executor is responsible to perform calculations and store data. Spark code of the application sent to the Executor, the final task will be to assign SparkContext Executor to execute.

Figure 1 Spark architecture

Spark In applications, the entire process performed on the logical converted to RDD (Resilient Distributed Dataset, elasticity distributed data set) DAG (Directed Acyclic Graph, directed acyclic graph). RDD is Spark basic operation unit, the follow-up will be described in detail. Spark task will be scheduled for the conversion of DAG in the form of workflow, distribution and distributed. Figure 2 shows an example of the entire process by performing the DAG Spark.

FIG 2 Spark perform the entire process of DAG

In FIG. 2, Transformations RDD is a type of operation, including map, flatMap, filter, etc., such operations are performed in delay, i.e. the conversion from one to another RDD RDD is not performed immediately, but rather the record operation until operation encountered Actions class will really start the calculation process is calculated. Actions class operation will return results RDD or write data storage system is calculated to start Spark triggered motivation. Action after operator triggering the operator to generate a record of all RDD, the Spark of a dependency between tasks RDD cut into different stages (Stage), is then calculated by the scheduler schedules tasks RDD. FIG 2 A ~ E represent different RDD, RDD squares represent different partitions. Spark data is first read into memory by the HDFS, and formed RDD A RDD C. RDD A into RDD B, RDD C perform map operations into RDD D, RDD B and RDD E perform join operations into RDD F. RDD B and RDD E RDD F connected into the process performs the operation Shuffle, and finally through the RDD F function output and saved to saveAsSequenceFile HDFS.

1.2 eet

Spark introduced the concept of RDD, RDD abstract data distributed memory, a fault tolerant, parallel data structure is Spark basic data structure, all calculations are based on the structure, by RDD and RDD Spark upper design operating algorithm .

RDD as a data structure, essentially a read-only partition record set, logic may think of it as a distributed array elements in the array may be any data structure. RDD may comprise a plurality of partitions, each partition is a subset of the data set. RDD may be interdependent, formed by Spark scheduling order dependencies is formed by the operation of the entire program of RDD Spark.

RDD has two operating operator: conversion (transformation) and action (actions).

1. Conversion

Delay switching operation is performed, i.e., the conversion from one to another RDD RDD, not immediately executed, but only the record operation until it encounters the operation will actually started Actions class calculation process. Conversion operation includes a plurality of operation map, flatMap, mapPartitions the like, commonly used following conversion operations are described.

  • map: performing a user RDD each element of the original custom function to generate a new RDD. In RDD any original element in the new RDD has one and only one corresponding element.
  • flatMap: map and similar elements in the original RDD new elements generated by the function, and the resulting set of each element in the RDD combined into one set.
  • mapPartitions: Get iterator each partition, the operation of the entire element iterator (i.e., the entire partition element) in the function.
  • union: The RDD two combined operation is not performed to the combined weight, keep all the elements. Prerequisite for this operation is the need to ensure that the same data element type RDD.
  • filter: filtering the elements, each element of the applied function, the return value is True elements are retained.
  • sample: RDD elements for sampling acquires a subset of all the elements.
  • cache: RDD elements from the disk cache to memory, equivalent to persist (MEMORY_ONLY).
  • persist: for RDD data cache, data cache is determined by the parameters StorageLevel where, as DISK_ONLY represents only the disk cache, MEMORY_AND_DISK represent both memory and disk caching.
  • groupBy: The RDD elements generated by the corresponding function key, and then the elements are grouped by a key.
  • reduceByKey: each of the plurality of value data corresponding to key operation of the statute is user-defined.
  • join: the SQL equivalent of the connector, as the key returns to the RDD two connection conditions.

2. Action

Action operation will return results or RDD data written to the storage system, is triggered Spark starts calculated motives. Action operations include foreach, collect and so on. The following actions for common operations are introduced.

  • foreach: RDD for each element are calling user-defined function operation, returns Unit.
  • collect: For distributed RDD, return Array array of scala.
  • count: Returns the number of elements in RDD.
  • saveAsTextFile: storing data to a specified directory in a text of HDFS.

DataSet is a collection of distributed data, which is added after the Spark 1.6 interface, which not only has the advantage of RDD, but also has the advantage Spark SQL execution engine optimization. DataFrame is a distributed set of data having column names can be approximated as a relational database table, but DataFrame can be constructed from a variety of sources, such as a structured data file, Hive tables, RDD like. DataFrame API can be used in Scala, Java, Python, and R. Following only a few common API (API can refer to more relevant information [illustration]).

  • select (cols: Column *): Select satisfy the expression column, returns a new DataFrame. Which, cols is the list of column names or expressions.
  • filter (condition: Column): line was filtered through a given condition.
  • count (): Returns the number of DataFrame line.
  • describe (cols: String *): calculated numeric column statistics, including the number, mean, standard deviation, minimum, maximum.
  • groupBy (cols: Column *): grouped by specifying a column, the data can be polymerized by the polymerization function groups.
  • join (right: Dataset [_]): another DataFrame for join operations.
  • withColumn (colName: String, col: Column): Add a column or replace the column with the same name, returns the new DataFrame.

1.3 XGBoost4J-Spark

With the Spark is widely used in industry, has accumulated a large number of users, more and more enterprises to build their own Spark as the core platform to support data mining analysis class computing, interactive real-time query evaluation, so XGBoost4J-Spark came into being students. This section describes how to implement machine learning by Spark, how well applied XGBoost4J-Spark Spark machine learning processing pipeline.

XGBoost4J-Spark implemented in jvm-package, so when you call XGBoost4J, simply add the following dependence in the pom.xml file in the project to:

<dependency>
  <groupId>ml.dmlc</groupId>
  <artifactId>xgboost4j-spark</artifactId>
  <version>0.7</version>
</dependency>

Figure 3 shows how XGBoost4J-Spark Spark machine learning process is applied to the pipeline framework. Spark by first loading the data as RDD, DataFrame or DataSet. If the load type is DataFrame / DataSet, it can be further processed by Spark SQL thereof, such as removing some of the specified column and the like. Spark MLlib completed by the library project features, it provides several ways for users to choose characteristics of the project, this step is a machine learning process a very important step, because good characteristics can determine the upper limit of machine learning. After completion features, can be fed to generate training data for training XGBoost4J-Spark, the process parameters can be tuned by Spark MLlib, optimal model. Forecast prediction sets obtained after training model, to predict the final result obtained. In order to avoid repetition of training each model can be saved trained models can be directly loaded at the time of use. In addition, after the training is completed, XGBoost4J-Spark can rank the importance of the features. Finally, a data products to related business.

FIG 3 XGBoost4J-Spark training flowchart

And version 0.70 or later XGBoost4J-Spark allows users to use low-level and advanced memory abstraction in Spark that RDD and DataFrame / DataSet, while the low version (version 0.60) only supports RDD way. DataFrame / DataSet can be approximated as a database table, contains not only data, but also contains a table structure, it is structured data. The user can easily use DataFrame / DataSet API Spark provide its operation, may be treated to customize function (UDF) by a user, for example, you can easily select the desired characteristics for forming a new DataFrame / DataSet through the select function. The following example of the structure of data stored in a JSON file and parse through an API to DataFrame Spark, and two lines of code to Scala XGBoost model train.

1.val df = spark.read.json("data.json")  
2.//调用 XGBoost API 训练DataFrame类型的训练集
3.val xgboostModel = XGBoost.trainWithDataFrame(  
4.      df, paramMap, numRound, nWorkers, useExternalMemory) 

The code is XGBoost4J-Spark 0.7x version implementation code, XGBoost4J-Spark 0.8x and above part of the API is subject to change. Training code is as follows:

1.val xgbClassifier = new XGBoostClassifier(paramMap).  
2.                    setFeaturesCol("features").  
3.                    setLabelCol("label")  
4.val xgbClassificationModel = xgbClassifier.fit(df)  

By way of example the following brief XGBoost4J-Spark some common API, refer to other official documents.
First, loading the data set, it can be read by the Spark, such as an external file is loaded, Spark SQL like.
Then, set the parameters of the model, may be distributed according to the specific problems and the adjusted model parameters:

1.val paramMap = Map(  
2.    "eta" -> 0.1f,   
3.    "num_class" -> 3,   
4.    "max_depth" -> 3,   
5.    "objective" -> "multi:softmax")  

Model training invocation will not repeat them here, here meaning the training function of the parameters.

  • trainingData: training set RDD.
  • params: model training parameters.
  • round: model number of iterations.
  • nWorkers: the number of training XGBoost node, if set to 0, the number of partitions will XGBoost training set as RDD nWorkers number.
  • obj: user-defined objective function, the default is Null.
  • eval: Evaluation of user-defined functions, the default is Null.
  • useExternalMemory: whether to use external memory cache, if set to True, you can save the cost of running XGBoost of RAM.
  • missing: data set as the value of the default value (note here XGBoost will be missing as the default value, the value before the training will be missing as null).

After completion of the training model, model files can be saved to predict the time for use. Hadoop model is saved as a file stored on HDFS. 0.7 version of this function can be achieved by saveModelAsHadoopFile, calling examples are as follows:

xgboostModel.saveModelAsHadoopFile("/tmp/bst.model")  

0.8 and above can be achieved by directly save function as follows:

xgboostModel.write.overwrite().save("/tmp/bst.model")  

XGBoost can be trained before a good model files directly loaded for use, 0.7x version of the code is as follows:

val model = XGBoost.loadModelFromHadoopFile("/tmp/bst.model")

0.8 and above, as follows:

val model = XGBoostClassificationModel.load("/tmp/bst.model")

Here is a classification model, if the regression model was:

val model = XGBoostRegressionModel.load("/tmp/bst.model")

The prediction set incoming trained models to predict, 0.7x version of the RDD data to predict the type of code, as follows:

val predicts = model.predict(test)  

0.8 and above to directly predict DataSet type data, as follows:

val predicts = model.transform(test)  

Spark trained model can also be downloaded to the local, and load predicted by local XGBoost (Python, Java or Scala). This is to achieve a distributed model through massive training samples, improve the accuracy of the model, and the model of distributed training can be predicted by a single call to improve model predictions speed.

The user can only operate on data sets DataFrame / DataSet API, and the machine may be provided by MLlib Spark feature learning packet processing. MLlib Spark is built on machine learning library, a general-purpose learning algorithms and tools classes. Feature can be easily extracted and converted by MLlib. MLlib also provides a wealth of algorithms, including classification, regression, clustering, collaborative filtering, dimension reduction, users can use these algorithms and XGBoost used in combination depending on the application scenario. Further, MLlib also provides a model selection tool, the user can select the best model by API defined automatic parameter search process.

Feature extraction, transformation and selection

Before the training set into XGBoost4J-Spark training feature may first be processed through MLlib, including feature extraction, transformation and selection. This is a model training before performing a very important step, but not required, the user can choose according to application scenarios.

In MLlib, the feature extraction methods are mainly the following three.

  • TF-IDF: term frequency - inverse document frequency, is a common text preprocessing step. The importance of words as was the number of times it appears in the file proportional increase, but also with how often it appears in the corpus inversely proportional to the decline.
  • Word2Vec: the document in which each word is mapped to a single vector and a fixed length.
  • CountVectorizer: indicates how many times each word appears in the document vectors.
    Feature Transform Spark plays an important role in machine learning pipeline, widely used in various machine learning scenarios. MLlib various features provides a method of transformation, only selected common methods are described herein.

(1)StringIndexer

StringIndexer string column is encoded label tab index column. Index value of [0, numLabels], sorted by label frequency. As shown in Table 1, category data as the original column, categoryIndex through the column as StringIndexer encoding. a most frequent (encoded as 0.0), followed by C (encoded as 1.0), b (encoded as 2.0).

Table 1 StringIndexer coding

Calling code is very simple, just the following two lines can be realized:

1.val indexer = new StringIndexer()  
2.              .setInputCol("category")  
3.              .setOutputCol("categoryIndex")  
4.  
5.val indexed = indexer.fit(df).transform(df)  

(2)OneHotEncoder

OneHotEncoder The indexes are mapped to a tag a binary vector, a maximum of only a single value, it can be converted StringIndexer front column vector generated index. OneHotEncoder mainly used in class features, such as gender, nationality and so on. Class characteristic model can not be directly applied to machine learning, since even after the string to StringIndexer by numerical features, often the default data model is continuous, and is ordered; however, feature category number is not ordered, each number represents just one category.

OneHotEncoder may be combined StringIndexer used, as follows:

1.val indexer = new StringIndexer()  
2.              .setInputCol("category")  
3.              .setOutputCol("categoryIndex")  
4.             .fit(df)  
5.val indexed = indexer.transform(df)  
6.  
7.val encoder = new OneHotEncoder()  
8.             .setInputCol("categoryIndex")  
9.             .setOutputCol("categoryVec")  
10.  
11.val encoded = encoder.transform(indexed)  

(3)Normalizer

Normalizer multi-line input vectors can be converted into a unified form. Parameter p (the default is 2) to the specified regular p-norm of operation is used. Regularization operation may standardize input data and the effect of improving the late model.

1.val normalizer = new Normalizer()  
2.                .setInputCol("features")  
3.                .setOutputCol("normFeatures")  
4.                .setP(1.0)  
5.  
6.val l1NormData = normalizer.transform(dataFrame)

(4)StandardScaler

Vector processing StandardScaler data, normalized such that each feature has a uniform and standard deviation (or) with zero mean. It has the following parameters:

1) withStd: The default value is true, the standard deviation unified way.

2) withMean: The default is false. This method will produce a dense output, it does not apply to a sparse input.

1.val scaler = new StandardScaler()  
2.            .setInputCol("features")  
3.            .setOutputCol("scaledFeatures")  
4.            .setWithStd(true)  
5.            .setWithMean(false)  
6.  
7.// 通过拟合StandardScaler计算汇总统计信息
8.val scalerModel = scaler.fit(dataFrame)  
9.  
10.// 标准化特征 
11.val scaledData = scalerModel.transform(dataFrame)  

(5)MinMaxScaler

MinMaxScaler converted by re-adjusting the size of the column to form Vector within a predetermined range, usually [0, 1]. Its parameters are the following two.

1) min: 0.0 by default, all features of the boundary after conversion.

2) max: defaults to 1.0, wherein the lower boundary for the conversion after all.

1.val scaler = new MinMaxScaler()  
2.            .setInputCol("features")  
3.            .setOutputCol("scaledFeatures")  
4.  
5.// 计算统计信息,生成MinMaxScalerModel
6.val scalerModel = scaler.fit(dataFrame)  
7.  
8.// 重新缩放每个特征至[min, max]范围
9.val scaledData = scalerModel.transform(dataFrame)  

(6) SQLTransformer

SQLTransformer implemented based on conversion characteristics defined in the SQL statement, such as "SELECT ... FROM__THIS __...", wherein "__THIS__" represents the input data set underlying table.

1.val df = spark.createDataFrame(  
2.  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF("id", "v1", "v2")  
3.  
4.val sqlTrans = new SQLTransformer().setStatement(  
5.  "SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")  
6.  
7.sqlTrans.transform(df)  

(7)VectorAssembler

VectorAssembler given list of columns into a single column vector. It may be the original series and features obtained by the features of other converters into a single feature vector, such as logistic regression to train a machine learning algorithm and a decision tree.

1.val assembler = new VectorAssembler()  
2.               .setInputCols(Array("hour", "mobile", "userFeatures"))  
3.               .setOutputCol("features")  
4.  
5.val output = assembler.transform(dataset) 

In addition to several methods described above, MLlib also offers other features transform methods, such as for feature points Bucketizer barrel for dimension reduction of PCA and so on, here is no longer introduced one by one, the reader interested available Related information [illustration], based on reasonable scenarios select the appropriate method to change the characteristics of the transition.

Feature selection means by eliminating redundant or irrelevant features, thus to decrease the number of features to improve the accuracy of the model, the purpose of reducing running time. MLlib provides a feature selection method are several.

  • VectorSlicer: a feature vector output from the new feature vector, the new feature vector to a subset of the original feature vectors, feature extraction is useful in a column vector.
  • RFormula: R model specified by the selected formula column.
  • ChiSqSelector: Chi-Squared feature selection, data is applied to class distinction.

XGBoost model training

XGBoost model training is performed before, the data set by MLlib feature extraction, transformation, selection, the data set can be made more representative of features to reduce the noise by the model, to improve the model accuracy. In addition, a simplified model of the real select the relevant features, assist in understanding the process of data generation. The following describes how the feature extraction MLlib described by way of example, transform, select and combine XGBoost, iris data set used here. Out to achieve the following specific 0.8x versions:

1.import ml.dmlc.xgboost4j.scala.spark.{TrackerConf, XGBoostClassificationModel, 
   XGBoostClassifier, XGBoostRegressionModel, XGBoostRegressor}  
2.import org.apache.spark.ml.feature.StringIndexer  
3.import org.apache.spark.ml.feature.VectorAssembler  
4.import org.apache.spark.sql.types.{DoubleType, StringType, StructField, 
   StructType}  
5.  
6.// 读取数据集,生成DataFrame  
7.val schema = new StructType(Array(  
8.  StructField("sepal length", DoubleType, true),  
9.  StructField("sepal width", DoubleType, true),  
10.  StructField("petal length", DoubleType, true),  
11.  StructField("petal width", DoubleType, true),  
12.  StructField("class", StringType, true)))  
13.val df = spark.read.schema(schema).csv("{HDFS PATH}/iris.txt")  
14.  
15.// 定义StringIndexer,将字符串类型列class转为数值型列label  
16.val indexer = new StringIndexer()  
17.  .setInputCol("class")  
18.  .setOutputCol("label")  
19.  
20.// 对前述定义的列进行转换,并去掉原来的classz字段  
21.val labelTransformed = indexer.fit(df).transform(df).drop("class")  
22.  
23.// 对特征进行vectorAssembler,生成features列  
24.val vectorAssembler = new VectorAssembler().  
25.  setInputCols(Array("sepal length", "sepal width", "petal length", 
     "petal width")).  
26.  setOutputCol("features")  
27.val xgbInput = vectorAssembler.transform(labelTransformed).select
   ("features", "label")  
28.  
29.// 定义训练参数  
30.val paramMap = Map(  
31.    "eta" -> 0.1f,   
32.    "num_class" -> 3,   
33.    "max_depth" -> 3,   
34.    "objective" -> "multi:softmax",  
35.    "num_round" -> 10,  
36.    "num_workers" -> 1)  
37.  
38.// 训练模型  
39.val xgbClassifier = new XGBoostClassifier(paramMap).setFeaturesCol("features").
   setLabelCol("label")  
40.val xgbClassificationModel = xgbClassifier.fit(xgbInput)  

Pipelines

The main MLlib Pipeline inspired scikit-learn program, intended to be more easily combined into a single tube or a plurality of algorithms workflow-based DataFrame higher level API library to the user to more easily build complex machine learning workflow applications. Pipeline can integrate a plurality of tasks, such as the conversion feature, model training, and other parameters. Here are a few important concepts.

  • DataFrame: compared with RDD, DataFrame further including schema information, it can be approximated as database tables.
  • Transformer: Transformer can be seen as a DataFrame converted into another DataFrame algorithm. For example, the model can be seen as a Transformer, DataFrame convert it became prediction set predicted DataFrame results.
  • Estimator: one kind may be adapted to generate DataFrame Transformer algorithms, operates in DataFrame data and generate a Transformer.
  • Pipeline: Transformer and can be connected to a plurality of workflow Estimator form of machine learning.
  • Parameter: Setting the parameters of Transformer and Estimator.

Pipeline is a sequence formed of a plurality of stages, each stage is a Transformer or Estimator. These stages are executed in order, when the input data DataFrame Pipeline, data is converted by the corresponding rules at each stage. In the Transformer stage, DataFrame call transform () method. In Phase Estimator of call DataFrame Fit () method produces a Transformer, and then call the Transformer's transform ().

MLlib feature extraction allows the user to form a complete the Pipeline / conversion / selection, model training, prediction data. XGBoost can also be integrated into the workflow Spark of machine learning as the Pipeline. Transformer XGBoost below by way of example and describes how to combine the processing characteristics Spark of Pipeline configuration. 0.8.x versions of the codes are as follows:

1.import ml.dmlc.xgboost4j.scala.spark.{TrackerConf, XGBoostClassificationModel, 
   XGBoostClassifier, XGBoostRegressionModel, XGBoostRegressor}   
2.import ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator  
3.import org.apache.spark.ml.feature.StringIndexer  
4.import org.apache.spark.ml.feature.VectorAssembler  
5.import org.apache.spark.sql.types.{DoubleType, StringType, StructField, 
   StructType}  
6.import org.apache.spark.ml.Pipeline  
7.  
8.// 读取数据集,生成DataFrame  
9.val schema = new StructType(Array(  
10.  StructField("sepal length", DoubleType, true),  
11.  StructField("sepal width", DoubleType, true),  
12.  StructField("petal length", DoubleType, true),  
13.  StructField("petal width", DoubleType, true),  
14.  StructField("class", StringType, true)))  
15.val df = spark.read.schema(schema).csv("{HDFS PATH}/iris.txt")  
16.  
17.// 定义StringIndexer,将字符串类型列class转为数值型列label  
18.val indexer = new StringIndexer().  
19.   setInputCol("class").  
20.   setOutputCol("label")  
21.  
22.// 对特征进行vectorAssembler,生成features列  
23.val vectorAssembler = new VectorAssembler().  
24.  setInputCols(Array("sepal length", "sepal width", "petal length", 
     "petal width")).  
25.  setOutputCol("features")  
26.  
27.// 定义训练参数  
28.val paramMap = Map(  
29.    "eta" -> 0.1f,   
30.    "num_class" -> 3,   
31.    "max_depth" -> 3,   
32.    "objective" -> "multi:softmax",  
33.    "num_round" -> 10,  
34.    "num_workers" -> 1)  
35.  
36.// 定义模型  
37.val xgbClassifier = new XGBoostClassifier(paramMap).
      setFeaturesCol("features").setLabelCol("label")  
38.  
39.// 构建pipeline           
40.val pipeline = new Pipeline().setStages(Array(indexer, vectorAssembler, 
   xgbClassifier))  
41.val model = pipeline.fit(df)  
42.  
43.// 预测  
44.val predict = model.transform(df)  

Model selection

Machine learning model selection is a very important task, namely to find the specific issues through the data model and the optimal parameters, also known as ultra-parameter adjustment. Model selection can be done in a separate Estimator (such as logistic regression), and may Pipeline comprising a plurality of algorithms or other steps to complete. A user can adjust the parameters of the whole Pipeline, rather than a single adjustment of each element Pipeline. MLlib support CrossValidator and TrainValidationSplit two model selection tool.

(1)CrossValidator

That is cross-validation, the data set into a plurality of elements, respectively the training and testing sets. For example, set the value of k is 3, CrossValidator will generate three sets of data, each set of data is 2/3 as the training set for training, 1/3 were tested as a test set. 3 average evaluation criteria data set to train the model calculation CrossValidator. After determining the optimum parameters, CrossValidator optimal parameters used to re-fit the entire data set to give the final model.

(2)Train-Validation Split

In addition CrossValidator, MLlib further provides for super Train-Validation Split parameter adjustment. CrossValidator different and are, Train-Validation Split authentication only once, rather than k times. Train-Validation Split computational cost compared to the lower CrossValidator, but when the training data set is not large enough, the results of reliability is not high. Train-Validation Split by trainRatio parameter dataset into two parts. For example, provided trainRatio = 0.75, TrainValidation Split 75% of the data will be used for training, 25% of the data for the test.

Select the model to determine the optimal parameters is one of the key steps to maximize XGBoost model. By adjusting the parameters manually is a time-consuming and tedious process. The latest version of XGBoost4J-Spark parameters that can be tuned by selecting Tools MLlib model, which greatly improves the efficiency of the tuning machine learning process parameters. Here an example to illustrate how to use the tool to model selection MLlib XGBoost tuning parameters. 0.8x versions of the codes are as follows:

1.import org.apache.spark.ml.tuning.ParamGridBuilder  
2.import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator  
3.import org.apache.spark.ml.tuning.TrainValidationSplit  
4.  
5.// 创建xgbClassifier      
6.val xgbClassifier = new XGBoostClassifier(paramMap).
      setFeaturesCol("features").setLabelCol("label")   
7.  
8.// 设定参数调优时参数的范围    
9.val paramGrid = new ParamGridBuilder().    
10.       addGrid(xgbEstimator.maxDepth, Array(5, 6)).    
11.       addGrid(xgbEstimator.eta, Array(0.1, 0.4)).   
12.       build()    
13.  
14.// 构建TrainValidationSplit,设置trainRatio=0.8,即80%的数据用于训练,20%的数据用于测试    
15.val tv = new TrainValidationSplit().    
16.       setEstimator(xgbEstimator).    
17.       setEvaluator(new MulticlassClassificationEvaluator().
                       setLabelCol("label")).    
18.       setEstimatorParamMaps(paramGrid).    
19.       setTrainRatio(0.8)      
20.val model = tv.fit(xgbInput)

Using the above-described exemplary Train-Validation Split and RegressionEvaluator MLlib of the eta and maxDepth XGBoost two parameters are adjusted RegressionEvaluator selected models defined minimum cost function value as the best model.

By XGBoost4J-Spark, the user can build a more efficient Spark based data processing pipeline. The pipeline may well use DataFrame / DataSet API for processing structured data, and also has a strong XGBoost as machine learning models. Further, XGBoost4J-Spark Spark MLlib such XGBoost and seamless connection, such that the feature extraction / conversion / tuning parameters and the selection work easier than before.

Written in the last

Based on this review, the XGBoost Scala version of the Spark platform achieved, while MLlib was simple to learn and presentation. Believe me, sooner or later will be distributed XGBoost used.

Published 18 original articles · won praise 588 · Views 1.03 million +

Guess you like

Origin blog.csdn.net/hellozhxy/article/details/105288993