Pyspark:V3.2.1
This blog mainly introduces the use of pyspark.ml.Evaluation package.
1 Overview
The evaluation classes in the pyspark.ml.Evaluation package mainly include the following types.
kind | effect |
---|---|
Evaluator | Base class for evaluators. But the _evaluate method in this class is not concretely implemented, and all other evaluation classes inherit from the subclass JavaEvaluator of this class. |
BinaryClassificationEvaluator | Binary Classification Model Evaluator |
RegressionEvaluator | Estimators for regression models |
MulticlassClassificationEvaluator | Multi-Classification Model Evaluator |
MultilabelClassificationEvaluator | Multi-label classification model evaluator |
Clustering Evaluator | Cluster Model Evaluator |
RankingEvaluator | Sorting Learning Evaluator |
2 use cases
This part is mainly to show how to evaluate the performance of different types of models, and will not focus on the training of the model. Here, sklearn.datasets is used to generate the data required for model training.
2.1 Classification Model Evaluator
2.1.1 Two classifications
Here we first introduce the parameters in the BinaryClassificationEvaluator class. The main parameters are as follows:
parameter | effect |
---|---|
rawPredictionCol | Specify the column name to save the raw Prediction. |
labelCol | Specifies the name of the column that holds the actual label value. |
metricName | Specifies the evaluation metric. Acceptable evaluation indicators for this model are: areaUnderROC (default value), areaUnderPR |
weightCol | Column names specifying the weights for each sample. not necessary |
numBins | Number of bins. ROC curve, PR curve calculation will be |
Examples are as follows:
import os
from pyspark.sql import SparkSession
from pyspark.ml.classification import *
from sklearn.datasets import *
from pyspark.ml.evaluation import *
from pyspark.ml.linalg import Vectors
os.environ['SPARK_HOME'] ='/Users/sherry/documents/spark/spark-3.2.1-bin-hadoop3.2'
spark=SparkSession.builder.appName('ml').getOrCreate()
#构建二分类数据集
X,y=make_classification(n_samples=500,n_features=5,n_redundant=1,
n_informative=4,n_classes=2)
data_2C=[[Vectors.dense(x_item),int(y_item)] for x_item,y_item in zip(X,y)]
#构建DataFrame
data_2C=spark.createDataFrame(data_2C,['features','label'])
#训练模型
LR=LogisticRegression(featuresCol='features',labelCol='label',
predictionCol='prediction')
LR_model_2C=LR.fit(data_2C)
data_2C=LR_model_2C.transform(data_2C)
#模型评估
#BinaryClassificationEvaluator各个参数都有默认值,所以这里都使用默认值
evaluator=BinaryClassificationEvaluator()
roc=evaluator.evaluate(data_2C)
print(roc) #0.87
2.1.2 Multi-classification
The parameters in the MulticlassClassificationEvaluator class are roughly the same as those in the BinaryClassificationEvaluator class, but there is no numBins parameter. In addition, its newly added parameters mainly include the following:
parameter | effect |
---|---|
metricLabel | The specified parameters are the following categories in the indicator calculation: truePositiveRateByLabel, falsePositiveRateByLabel, precisionByLabel, recallByLabel, fMeasureByLabel. The specified category value must be greater than or equal to 0. |
beta | β \beta when calculating the specified indicators weightedFMeasure and fMeasureByLabelbeta value. This value can receive a value greater than 0 as a parameter, and the default value is 1. |
probabilityCol | column names specifying the conditional probabilities of the predicted classes |
eps |
In addition, the acceptable evaluation indicators for the metricName parameter mainly include: f1, accuracy, weightedPrecision, weightedRecall, weightedTruePositiveRate, weightedFalsePositiveRate, weightedFMeasure, truePositiveRateByLabel, falsePositiveRateByLabel, precisionByLabel, recallByLabel, fMeasureByLabel, logLoss 、hammingLoss
Examples are as follows:
#构造三分类数据
X,y=make_classification(n_samples=500,n_features=4,n_classes=3,n_informative=4,n_redundant=0)
data_2C=[[Vectors.dense(x_item),int(y_item)] for x_item,y_item in zip(X,y)]
#构建DataFrame
data_2C=spark.createDataFrame(data_2C,['features','label'])
#训练模型
LR=LogisticRegression(featuresCol='features',labelCol='label',
predictionCol='prediction')
LR_model_2C=LR.fit(data_2C)
data_2C=LR_model_2C.transform(data_2C)
#模型评估
evaluator=MulticlassClassificationEvaluator(metricLabel=1,
metricName='truePositiveRateByLabel')
tpr_label=evaluator.evaluate(data_2C)
print(tpr_label) #0.721
2.1.3 Multi-label classification
For multi-label classification, please refer to the blog: https://blog.csdn.net/yeshang_lady/article/details/128837245 . The MultilabelClassificationEvaluator class is still in the experimental stage (there is no class in ml.Classification that can perform label classification training). At this stage, only the results of multi-label classification can be evaluated.
The acceptable evaluation indicators for the metricName parameter in this class are: subsetAccuracy, accuracy, hammingLoss, precision, recall, f1Measure, precisionByLabel, recallByLabel, f1MeasureByLabe, microPrecision, microRecall, microF1Measure .
Its specific usage is as follows:
import os
from pyspark.sql import SparkSession
from pyspark.ml.classification import *
from sklearn.datasets import *
from pyspark.ml.evaluation import *
from pyspark.ml.linalg import Vectors
import pandas as pd
import numpy as np
from skmultilearn.problem_transform import BinaryRelevance #基于问题转化
from sklearn.tree import DecisionTreeClassifier
X,y=make_multilabel_classification(n_samples=1000,n_features=20,
n_classes=5,n_labels=2,
allow_unlabeled=False)
classifier=BinaryRelevance(DecisionTreeClassifier(max_depth=5))
classifier.fit(X,y)
y_pred=classifier.predict(X).toarray()
#为了满足MultilabelClassificationEvaluator,需要对y和y_pred的形式进行处理
y=np.where(y==1)
y_pred=np.where(y_pred==1)
data=[([],[]) for _ in range(X.shape[0])]
for idx,val in enumerate(y[0]):
data[val][0].append(float(y[1][idx]))
for idx,val in enumerate(y_pred[0]):
data[val][1].append(float(y_pred[1][idx]))
spark=SparkSession.builder.appName('ml').getOrCreate()
data=spark.createDataFrame(data,['label','prediction'])
evaluator=MultilabelClassificationEvaluator(metricName='accuracy')
evaluator.evaluate(data) #0.726