8.4 Classification and Regression

First, logistic regression classifier

Logistic regression (logistic regression) is a statistical study of classical classification, belonging to the log-linear model. logistic regression of the dependent variable can be binary, but also can be multi-classification.

Task Description: with iris data set (iris) an example for analysis (iris Download: http: //dblab.xmu.edu.cn/blog/wp-content/uploads/2017/03/iris.txt)

iris feature in the iris as a data source, the data set contains 150 data sets, divided into three categories, each category of data 50, each of the data contains four attributes in data mining, data classification is very commonly used test set ,Training set. For ease of understanding, this is mainly two attributes (petal length and width) after be classified.

1. Use two logistic regression to solve binary classification

First, let's take one of the two types of data, were analyzed by two binary logistic regression

(1) introducing required packages

import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.linalg.{Vector,Vectors}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.{Pipeline,PipelineModel}
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer,HashingTF, Tokenizer}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary, LogisticRegression}
import org.apache.spark.sql.functions;

(2) the read data, a brief analysis

scala> import spark.implicits._
import spark.implicits._
 
scala> case class Iris(features: org.apache.spark.ml.linalg.Vector, label: String)
defined class Iris
 
scala> val data = spark.sparkContext.textFile("file:///usr/local/spark/iris.txt").map(_.split(",")).map(p => I
ris(Vectors.dense(p(0).toDouble,p(1).toDouble,p(2).toDouble, p(3).toDouble), p(4
).toString())).toDF()
data: org.apache.spark.sql.DataFrame = [features: vector, label: string]

Scale> data.show ()
+-----------------+-----------+
| features| label|
+-----------------+-----------+
| [5.1,3.5,1.4,0.2] | Iris-silky |
| [4.9,3.0,1.4,0.2] | Iris-silky |
| [4.7,3.2,1.3,0.2] | Iris-silky |
| [4.6,3.1,1.5,0.2] | Iris-silky |
| [5.0,3.6,1.4,0.2] | Iris-silky |
| [5.4,3.9,1.7,0.4] | Iris-silky |
| [4.6,3.4,1.4,0.3] | Iris-silky |
| [5.0,3.4,1.5,0.2] | Iris-silky |
| [4.4,2.9,1.4,0.2] | Iris-silky |
| [4.9,3.1,1.5,0.1] | Iris-silky |
| [5.4,3.7,1.5,0.2] | Iris-silky |
| [4.8,3.4,1.6,0.2] | Iris-silky |
| [4.8,3.0,1.4,0.1] | Iris-silky |
| [4.3,3.0,1.1,0.1] | Iris-silky |
| [5.8,4.0,1.2,0.2] | Iris-silky |
| [5.7,4.4,1.5,0.4] | Iris-silky |
| [5.4,3.9,1.3,0.4] | Iris-silky |
| [5.1,3.5,1.4,0.3] | Iris-silky |
| [5.7,3.8,1.7,0.3] | Iris-silky |
| [5.1,3.8,1.5,0.3] | Iris-silky |
+-----------------+-----------+
only showing top 20 rows

Because we are dealing with it is two classification problems, so we do not need all of the three types of data, our data selected from two categories.

First of all just get registered as a data table iris, after registering to this table, we can query data through sql statement.  

scala> data.createOrReplaceTempView("iris")
 
scala> val df = spark.sql("select * from iris where label != 'Iris-setosa'")
df: org.apache.spark.sql.DataFrame = [features: vector, label: string]
 
scala> df.map(t => t(1)+":"+t(0)).collect().foreach(println)
Iris-versicolor:[7.0,3.2,4.7,1.4]
Iris-versicolor:[6.4,3.2,4.5,1.5]
Iris-versicolor:[6.9,3.1,4.9,1.5]
……
……

(3) Construction of the pipeline ML

a., respectively, and wherein acquiring the tag columns column, index, and renamed.

scala> val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(df)
labelIndexer: org.apache.spark.ml.feature.StringIndexerModel = strIdx_e53e67411169
scala> val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").fit(df)
featureIndexer: org.apache.spark.ml.feature.VectorIndexerModel = vecIdx_53b988077b38

b. Next, we set the data were randomly divided into a training set and a test set, wherein the training set 70%  

scala> val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3))
trainingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, label: string]
testData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, label: string]

c. Then, we set the parameters of the logistic, here we unified method setter to set up, it can also be used to set ParamMap (specific spark mllib can view the official website). Here we set the number of cycles is 10, the regularization term is 0.3, etc.  

scala> val lr = new LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_692899496c23

D. Here we set a labelConverter, which aims to predict the category reconverted to the character.  

scala> val labelConverter = new IndexToString().setInputCol("prediction").setOut
putCol("predictedLabel").setLabels(labelIndexer.labels)
labelConverter: org.apache.spark.ml.feature.IndexToString = idxToStr_c204eafabf57

e. Construction of pipeline, set the stage, and then call the fit () to train the model  

scala> val lrPipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, lr, labelConverter))
lrPipeline: org.apache.spark.ml.Pipeline = pipeline_eb1b201af1e0
 
scala> val lrPipelineModel = lrPipeline.fit(trainingData)
lrPipelineModel: org.apache.spark.ml.PipelineModel = pipeline_eb1b201af1e0

It is the essence of a f.pipeline Estimator, when the pipeline calls fit () when it created a PipelineModel, essentially a Transformer. Then you can call this PipelineModel transform () to predict a new generation of DataFrame, namely the use of model training to get the test set to verify.  

scala> val lrPredictions = lrPipelineModel.transform(testData)
lrPredictions: org.apache.spark.sql.DataFrame = [features: vector, label: string ... 6 more fields]

g. Finally, we can predict the output, select the columns to select which output, collect data acquisition for all rows, with each row foreach to print out. Wherein classification values ​​printed sequentially represent the true value of the classification and characteristics of the line data, the predicted probability of belonging to different categories, the predicted  

scala> lrPredictions.select("predictedLabel", "label", "features", "probability").collect().foreach { case Row(predictedLabel: String, label: String, features: Vector, prob: Vector) => println(s"($label, $features) --> prob=$prob, predicted Label=$predictedLabel")}
(Iris-virginica, [4.9,2.5,4.5,1.7]) --> prob=[0.4796551461409372,0.5203448538590628], predictedLabel=Iris-virginica
(Iris-versicolor, [5.1,2.5,3.0,1.1]) --> prob=[0.5892626391059901,0.41073736089401], predictedLabel=Iris-versicolor
(Iris-versicolor, [5.5,2.3,4.0,1.3]) --> prob=[0.5577310241453046,0.4422689758546954], predictedLabel=Iris-versicolor

(4) Evaluation Model  

Creating a MulticlassClassificationEvaluator instance, the column names and column names on predicted classification of real classification set by the setter method; then calculate the prediction accuracy and error rates  

scala> val evaluator = new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")
evaluator: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_a80353e4211d
 
scala> val lrAccuracy = evaluator.evaluate(lrPredictions)
lrAccuracy: Double = 1.0
 
scala> println("Test Error = " + (1.0 - lrAccuracy))
Test Error = 0.0

Then we can get our logistic model is trained by model. As already said model is a PipelineModel, so we can get the model by calling its stages, as follows:  

scala> val lrModel = lrPipelineModel.stages(2).asInstanceOf[LogisticRegressionModel]
lrModel: org.apache.spark.ml.classification.LogisticRegressionModel = logreg_692899496c23
 
scala> println("Coefficients: " + lrModel.coefficients+"Intercept: "+lrModel.intercept+"numClasses: "+lrModel.numClasses+"numFeatures: "+lrModel.numFeatures)
Coefficients: [-0.0396171957643483,0.0,0.0,0.07240315639651046]Intercept: -0.23127346342015379numClasses: 2numFeatures: 4

2. Use a number of logistic regression to solve binary classification

3. Use a number of logistic regression to solve the multi-classification problems

 

Second, the decision tree classifier

  Decision tree (decision tree) is a basic classification and regression method, here introduces the decision tree for classification. Decision tree model was a tree structure, where each node represents a test on the internal properties, each branch represents a test output, each leaf node represents a category. Learning using the training data, build a decision tree model based on the principle of minimizing the loss of function; predicting, on the new data, decision tree classification model

  Decision tree learning typically comprises three steps: feature selection, and generating a decision tree pruning tree.

1. Feature Selection

Characterized in that the selected features having selected classification ability of the training data, which can improve the efficiency of the decision tree learning. Feature selection criterion is generally information gain (or gain ratio information, the Gini index, etc.), each gain calculation information for each feature, and to compare their sizes, select the maximum gain (maximum gain ratio information, minimum Gini index) feature

 

2. Decision Tree 

Start from the root node, compute all possible features of the node information gain, selects the maximum gain characteristic information as a feature node, the child node to establish the characteristics of different values, then the sub-node is called recursively the method of the above, a decision tree; until little or no information are increasing all features may be selected so far, to give a final stop condition Decision tree need to terminate the growth process. Generally the minimum conditions are: when all of the following nodes belong to the same category are recorded, or when all of the records have the same attribute values. These two conditions are necessary conditions to stop the decision tree, it is the lowest conditions. In the practical application of the general hope that a decision tree to stop growing advance, define the minimum amount of data contained in the leaf node, in order to prevent over-fitting problems caused due to excessive growth.

Guess you like

Origin www.cnblogs.com/nxf-rabbit75/p/12046144.html