Pyspark machine learning: model evaluation (use of ml.Evaluation package)

Pyspark:V3.2.1

This blog mainly introduces the use of pyspark.ml.Evaluation package.

1 Overview

The evaluation classes in the pyspark.ml.Evaluation package mainly include the following types.

kind	effect
Evaluator	Base class for evaluators. But the _evaluate method in this class is not concretely implemented, and all other evaluation classes inherit from the subclass JavaEvaluator of this class.
BinaryClassificationEvaluator	Binary Classification Model Evaluator
RegressionEvaluator	Estimators for regression models
MulticlassClassificationEvaluator	Multi-Classification Model Evaluator
MultilabelClassificationEvaluator	Multi-label classification model evaluator
Clustering Evaluator	Cluster Model Evaluator
RankingEvaluator	Sorting Learning Evaluator

2 use cases

This part is mainly to show how to evaluate the performance of different types of models, and will not focus on the training of the model. Here, sklearn.datasets is used to generate the data required for model training.

2.1 Classification Model Evaluator

2.1.1 Two classifications

Here we first introduce the parameters in the BinaryClassificationEvaluator class. The main parameters are as follows:

parameter	effect
rawPredictionCol	Specify the column name to save the raw Prediction.
labelCol	Specifies the name of the column that holds the actual label value.
metricName	Specifies the evaluation metric. Acceptable evaluation indicators for this model are: areaUnderROC (default value), areaUnderPR
weightCol	Column names specifying the weights for each sample. not necessary
numBins	Number of bins. ROC curve, PR curve calculation will be

Examples are as follows:

import os
from pyspark.sql import SparkSession
from pyspark.ml.classification import *
from sklearn.datasets import *
from pyspark.ml.evaluation import *
from pyspark.ml.linalg import Vectors

os.environ['SPARK_HOME'] ='/Users/sherry/documents/spark/spark-3.2.1-bin-hadoop3.2'
spark=SparkSession.builder.appName('ml').getOrCreate()

#构建二分类数据集
X,y=make_classification(n_samples=500,n_features=5,n_redundant=1,
                        n_informative=4,n_classes=2)
data_2C=[[Vectors.dense(x_item),int(y_item)] for x_item,y_item in zip(X,y)]
#构建DataFrame
data_2C=spark.createDataFrame(data_2C,['features','label'])
#训练模型
LR=LogisticRegression(featuresCol='features',labelCol='label',
                      predictionCol='prediction')
LR_model_2C=LR.fit(data_2C)
data_2C=LR_model_2C.transform(data_2C)
#模型评估
#BinaryClassificationEvaluator各个参数都有默认值，所以这里都使用默认值
evaluator=BinaryClassificationEvaluator()
roc=evaluator.evaluate(data_2C)
print(roc) #0.87

2.1.2 Multi-classification

The parameters in the MulticlassClassificationEvaluator class are roughly the same as those in the BinaryClassificationEvaluator class, but there is no numBins parameter. In addition, its newly added parameters mainly include the following:

parameter	effect
metricLabel	The specified parameters are the following categories in the indicator calculation: truePositiveRateByLabel, falsePositiveRateByLabel, precisionByLabel, recallByLabel, fMeasureByLabel. The specified category value must be greater than or equal to 0.
beta	$\beta$ when calculating the specified indicators weightedFMeasure and fMeasureByLabel $beta$ value. This value can receive a value greater than 0 as a parameter, and the default value is 1.
probabilityCol	column names specifying the conditional probabilities of the predicted classes
eps

In addition, the acceptable evaluation indicators for the metricName parameter mainly include: f1, accuracy, weightedPrecision, weightedRecall, weightedTruePositiveRate, weightedFalsePositiveRate, weightedFMeasure, truePositiveRateByLabel, falsePositiveRateByLabel, precisionByLabel, recallByLabel, fMeasureByLabel, logLoss 、hammingLoss

Examples are as follows:

#构造三分类数据
X,y=make_classification(n_samples=500,n_features=4,n_classes=3,n_informative=4,n_redundant=0)
data_2C=[[Vectors.dense(x_item),int(y_item)] for x_item,y_item in zip(X,y)]
#构建DataFrame
data_2C=spark.createDataFrame(data_2C,['features','label'])
#训练模型
LR=LogisticRegression(featuresCol='features',labelCol='label',
                      predictionCol='prediction')
LR_model_2C=LR.fit(data_2C)
data_2C=LR_model_2C.transform(data_2C)
#模型评估
evaluator=MulticlassClassificationEvaluator(metricLabel=1,
                                            metricName='truePositiveRateByLabel')
tpr_label=evaluator.evaluate(data_2C)
print(tpr_label) #0.721

2.1.3 Multi-label classification

For multi-label classification, please refer to the blog: https://blog.csdn.net/yeshang_lady/article/details/128837245 . The MultilabelClassificationEvaluator class is still in the experimental stage (there is no class in ml.Classification that can perform label classification training). At this stage, only the results of multi-label classification can be evaluated.
The acceptable evaluation indicators for the metricName parameter in this class are: subsetAccuracy, accuracy, hammingLoss, precision, recall, f1Measure, precisionByLabel, recallByLabel, f1MeasureByLabe, microPrecision, microRecall, microF1Measure .
Its specific usage is as follows:

import os
from pyspark.sql import SparkSession
from pyspark.ml.classification import *
from sklearn.datasets import *
from pyspark.ml.evaluation import *
from pyspark.ml.linalg import Vectors
import pandas as pd
import numpy as np
from skmultilearn.problem_transform import BinaryRelevance #基于问题转化
from sklearn.tree import DecisionTreeClassifier

X,y=make_multilabel_classification(n_samples=1000,n_features=20,
                                   n_classes=5,n_labels=2,
                                   allow_unlabeled=False)
classifier=BinaryRelevance(DecisionTreeClassifier(max_depth=5))
classifier.fit(X,y)
y_pred=classifier.predict(X).toarray()

#为了满足MultilabelClassificationEvaluator,需要对y和y_pred的形式进行处理
y=np.where(y==1)
y_pred=np.where(y_pred==1)
data=[([],[]) for _ in range(X.shape[0])]
for idx,val in enumerate(y[0]):
    data[val][0].append(float(y[1][idx]))
for idx,val in enumerate(y_pred[0]):
    data[val][1].append(float(y_pred[1][idx]))

spark=SparkSession.builder.appName('ml').getOrCreate()
data=spark.createDataFrame(data,['label','prediction'])

evaluator=MultilabelClassificationEvaluator(metricName='accuracy')
evaluator.evaluate(data) #0.726