Pyspark machine learning: model evaluation (use of ml.Evaluation package)

Pyspark:V3.2.1

  This blog mainly introduces the use of pyspark.ml.Evaluation package.

1 Overview

  The evaluation classes in the pyspark.ml.Evaluation package mainly include the following types.

kind effect
Evaluator Base class for evaluators. But the _evaluate method in this class is not concretely implemented, and all other evaluation classes inherit from the subclass JavaEvaluator of this class.
BinaryClassificationEvaluator Binary Classification Model Evaluator
RegressionEvaluator Estimators for regression models
MulticlassClassificationEvaluator Multi-Classification Model Evaluator
MultilabelClassificationEvaluator Multi-label classification model evaluator
Clustering Evaluator Cluster Model Evaluator
RankingEvaluator Sorting Learning Evaluator

2 use cases

  This part is mainly to show how to evaluate the performance of different types of models, and will not focus on the training of the model. Here, sklearn.datasets is used to generate the data required for model training.

2.1 Classification Model Evaluator
2.1.1 Two classifications

Here we first introduce the parameters in the BinaryClassificationEvaluator class. The main parameters are as follows:

parameter effect
rawPredictionCol Specify the column name to save the raw Prediction.
labelCol Specifies the name of the column that holds the actual label value.
metricName Specifies the evaluation metric. Acceptable evaluation indicators for this model are: areaUnderROC (default value), areaUnderPR
weightCol Column names specifying the weights for each sample. not necessary
numBins Number of bins. ROC curve, PR curve calculation will be

Examples are as follows:

import os
from pyspark.sql import SparkSession
from pyspark.ml.classification import *
from sklearn.datasets import *
from pyspark.ml.evaluation import *
from pyspark.ml.linalg import Vectors

os.environ['SPARK_HOME'] ='/Users/sherry/documents/spark/spark-3.2.1-bin-hadoop3.2'
spark=SparkSession.builder.appName('ml').getOrCreate()

#构建二分类数据集
X,y=make_classification(n_samples=500,n_features=5,n_redundant=1,
                        n_informative=4,n_classes=2)
data_2C=[[Vectors.dense(x_item),int(y_item)] for x_item,y_item in zip(X,y)]
#构建DataFrame
data_2C=spark.createDataFrame(data_2C,['features','label'])
#训练模型
LR=LogisticRegression(featuresCol='features',labelCol='label',
                      predictionCol='prediction')
LR_model_2C=LR.fit(data_2C)
data_2C=LR_model_2C.transform(data_2C)
#模型评估
#BinaryClassificationEvaluator各个参数都有默认值,所以这里都使用默认值
evaluator=BinaryClassificationEvaluator()
roc=evaluator.evaluate(data_2C)
print(roc) #0.87
2.1.2 Multi-classification

The parameters in the MulticlassClassificationEvaluator class are roughly the same as those in the BinaryClassificationEvaluator class, but there is no numBins parameter. In addition, its newly added parameters mainly include the following:

parameter effect
metricLabel The specified parameters are the following categories in the indicator calculation: truePositiveRateByLabel, falsePositiveRateByLabel, precisionByLabel, recallByLabel, fMeasureByLabel. The specified category value must be greater than or equal to 0.
beta β \beta when calculating the specified indicators weightedFMeasure and fMeasureByLabelbeta value. This value can receive a value greater than 0 as a parameter, and the default value is 1.
probabilityCol column names specifying the conditional probabilities of the predicted classes
eps

In addition, the acceptable evaluation indicators for the metricName parameter mainly include: f1, accuracy, weightedPrecision, weightedRecall, weightedTruePositiveRate, weightedFalsePositiveRate, weightedFMeasure, truePositiveRateByLabel, falsePositiveRateByLabel, precisionByLabel, recallByLabel, fMeasureByLabel, logLoss 、hammingLoss

Examples are as follows:

#构造三分类数据
X,y=make_classification(n_samples=500,n_features=4,n_classes=3,n_informative=4,n_redundant=0)
data_2C=[[Vectors.dense(x_item),int(y_item)] for x_item,y_item in zip(X,y)]
#构建DataFrame
data_2C=spark.createDataFrame(data_2C,['features','label'])
#训练模型
LR=LogisticRegression(featuresCol='features',labelCol='label',
                      predictionCol='prediction')
LR_model_2C=LR.fit(data_2C)
data_2C=LR_model_2C.transform(data_2C)
#模型评估
evaluator=MulticlassClassificationEvaluator(metricLabel=1,
                                            metricName='truePositiveRateByLabel')
tpr_label=evaluator.evaluate(data_2C)
print(tpr_label) #0.721
2.1.3 Multi-label classification

For multi-label classification, please refer to the blog: https://blog.csdn.net/yeshang_lady/article/details/128837245 . The MultilabelClassificationEvaluator class is still in the experimental stage (there is no class in ml.Classification that can perform label classification training). At this stage, only the results of multi-label classification can be evaluated.
The acceptable evaluation indicators for the metricName parameter in this class are: subsetAccuracy, accuracy, hammingLoss, precision, recall, f1Measure, precisionByLabel, recallByLabel, f1MeasureByLabe, microPrecision, microRecall, microF1Measure .
Its specific usage is as follows:

import os
from pyspark.sql import SparkSession
from pyspark.ml.classification import *
from sklearn.datasets import *
from pyspark.ml.evaluation import *
from pyspark.ml.linalg import Vectors
import pandas as pd
import numpy as np
from skmultilearn.problem_transform import BinaryRelevance #基于问题转化
from sklearn.tree import DecisionTreeClassifier

X,y=make_multilabel_classification(n_samples=1000,n_features=20,
                                   n_classes=5,n_labels=2,
                                   allow_unlabeled=False)
classifier=BinaryRelevance(DecisionTreeClassifier(max_depth=5))
classifier.fit(X,y)
y_pred=classifier.predict(X).toarray()

#为了满足MultilabelClassificationEvaluator,需要对y和y_pred的形式进行处理
y=np.where(y==1)
y_pred=np.where(y_pred==1)
data=[([],[]) for _ in range(X.shape[0])]
for idx,val in enumerate(y[0]):
    data[val][0].append(float(y[1][idx]))
for idx,val in enumerate(y_pred[0]):
    data[val][1].append(float(y_pred[1][idx]))

spark=SparkSession.builder.appName('ml').getOrCreate()
data=spark.createDataFrame(data,['label','prediction'])

evaluator=MultilabelClassificationEvaluator(metricName='accuracy')
evaluator.evaluate(data) #0.726

Guess you like

Origin blog.csdn.net/yeshang_lady/article/details/127856065