sparkml_ actual whole process _LogisticRegression
2.1 loading data
created converter, the evaluator
birth place using the one-hot encoded
to create an evaluator
VectorAssembler accepts the following types of input columns: All numeric types, Boolean type and vector type
2.3 creates a pipe fitting model
fitting model training set randomSplit test set divided
2.4 evaluation model
2.5 model save save pipeline
Loading model
# 2.1 加载数据
import pyspark.sql.types as typ
labels = [
('INFANT_ALIVE_AT_REPORT', typ.IntegerType()),
('BIRTH_PLACE', typ.StringType()),
('MOTHER_AGE_YEARS', typ.IntegerType()),
('FATHER_COMBINED_AGE', typ.IntegerType()),
('CIG_BEFORE', typ.IntegerType()),
('CIG_1_TRI', typ.IntegerType()),
('CIG_2_TRI', typ.IntegerType()),
('CIG_3_TRI', typ.IntegerType()),
('MOTHER_HEIGHT_IN', typ.IntegerType()),
('MOTHER_PRE_WEIGHT', typ.IntegerType()),
('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),
('MOTHER_WEIGHT_GAIN', typ.IntegerType()),
('DIABETES_PRE', typ.IntegerType()),
('DIABETES_GEST', typ.IntegerType()),
('HYP_TENS_PRE', typ.IntegerType()),
('HYP_TENS_GEST', typ.IntegerType()),
('PREV_BIRTH_PRETERM', typ.IntegerType())
]
schema = typ.StructType([
typ.StructField(e[0], e[1], False) for e in labels
])
births = spark.read.csv('births_transformed.csv.gz',
header=True,
schema=schema)
# 在这里我们指定DataFrame的schema,限制数据集只有17列
Create transformers
import pyspark.ml.feature as ft
births = births \
.withColumn( 'BIRTH_PLACE_INT',
births['BIRTH_PLACE'] \
.cast(typ.IntegerType()))
# 创建转换器、评估器
import pyspark.ml.feature as ft
births = births.withColumn('BIRTH_PLACE_INT', births['BIRTH_PLACE']\
.cast(typ.IntegerType()))
# birth place使用one-hot编码
encoder = ft.OneHotEncoder(inputCol='BIRTH_PLACE_INT',
outputCol='BIRTH_PLACE_VEC')
# 创建单一的列将所有特征整合在一起
featuresCreator = ft.VectorAssembler(
inputCols=[col[0] for col in labels[2:]] + [encoder.getOutputCol()],
outputCol='features'
)
# 创建一个评估器
import pyspark.ml.classification as cl
logistic = cl.LogisticRegression(maxIter=10,
regParam=0.01,
featuresCol=featuresCreator.getOutputCol(),
labelCol='INFANT_ALIVE_AT_REPORT')
Having done this, we can now create our first Transformer.
featuresCreator = ft.VectorAssembler(
inputCols=[
col[0]
for col
in labels[2:]] + \
[encoder.getOutputCol()],
outputCol='features'
)
# VectorAssembler(inputCols=None, outputCol=None, handleInvalid=‘error’)
# 特征转换器,将多个列合并为一个向量列。
# VectorAssembler接受以下输入列类型:所有数值类型、布尔类型和向量类型。在每一行中,输入列的值将按照指定的顺序连接到一个向量中。
# 创建一个评估器
import pyspark.ml.classification as cl
logistic = cl.LogisticRegression(maxIter=10,
regParam=0.01,
featuresCol=featuresCreator.getOutputCol(),
labelCol='INFANT_ALIVE_AT_REPORT')
Create an estimator
In this example we will (once again) us the Logistic Regression model.
import pyspark.ml.classification as cl
Once loaded, let's create the model.
logistic = cl.LogisticRegression(
maxIter=10,
regParam=0.01,
labelCol='INFANT_ALIVE_AT_REPORT')
Create a pipeline
All that is left now is to creat a Pipeline and fit the model. First, let's load the Pipeline from the package.
# 2.3 创建一个管道、拟合模型
# 在前面我们已经创建了数据转换器和评估器,现在我们可以通过管道将他们串联起来并方便的进行模型拟合了。
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[
encoder,
featuresCreator,
logistic
])
# 拟合模型 randomSplit划分测试集 训练集
birth_train, birth_test = births.randomSplit([0.7,0.3],seed=123)
# 拟合模型
model = pipeline.fit(birth_train)
test_model = model.transform(birth_test)
test_model.take(1)
Fit the model
Conventiently, DataFrame API has the .randomSplit(...) method.
Here's what the test_model looks like.
Model performance
Obviously, we would like to now test how well our model did.
# 2.4 评估模型
# 在前面我们将数据分为了两部分并通过管道方便的对训练集进行了拟合以及对测试集进行了测试。
# 现在我们可以通过测试集的结果来对模型拟合效果进行评估了。
# 评估模型性能
import pyspark.ml.evaluation as ev
evaluator = ev.BinaryClassificationEvaluator(
rawPredictionCol='probability',
labelCol='INFANT_ALIVE_AT_REPORT'
)
print(evaluator.evaluate(test_model, {evaluator.metricName:'areaUnderROC'}))
print(evaluator.evaluate(test_model, {evaluator.metricName:'areaUnderPR'}))
0.7187355793173213
0.6819691176245866
Saving the model
PySpark allows you to save the Pipeline definition for later use.
# 2.5 保存模型 保存管道
# PySpark不仅允许保存训练好的模型,还可以保存管道结构及所有转换器和评估器的定义
# 保存模型pipeline
pipelinePath = './infant_oneHotEncoder_Logistic_Pipeline'
pipeline.write().overwrite().save(pipelinePath)
# 重载模型pipeline 重新加载 运行
loadedPipeline = Pipeline.load(pipelinePath)
loadedPipeline.fit(birth_train).transform(birth_test).take(1)
[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', ...]
# 保存模型
from pyspark.ml import PipelineModel
modelPath = './infant_oneHotEncoder_LogisticPipelineModel'
model.write().overwrite().save(modelPath)
# 载入模型
loadedPipelineModel = PipelineModel.load(modelPath)
test_reloadedModel = loadedPipelineModel.transform(birth_test)
test_reloadedModel.take(1)
[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=12, ...]
So, you can load it up later and use straight away to .fit(...) and predict.