sparkml_实战全流程_LogisticRegression(一)

sparkml_实战全流程_LogisticRegression
2.1 加载数据
创建转换器、评估器
birth place使用one-hot编码
创建一个评估器
VectorAssembler接受以下输入列类型:所有数值类型、布尔类型和向量类型
2.3 创建一个管道、拟合模型
拟合模型 randomSplit划分测试集训练集
2.4 评估模型
2.5 保存模型保存管道
载入模型
# 2.1 加载数据
import pyspark.sql.types as typ

labels = [
    ('INFANT_ALIVE_AT_REPORT', typ.IntegerType()),
    ('BIRTH_PLACE', typ.StringType()),
    ('MOTHER_AGE_YEARS', typ.IntegerType()),
    ('FATHER_COMBINED_AGE', typ.IntegerType()),
    ('CIG_BEFORE', typ.IntegerType()),
    ('CIG_1_TRI', typ.IntegerType()),
    ('CIG_2_TRI', typ.IntegerType()),
    ('CIG_3_TRI', typ.IntegerType()),
    ('MOTHER_HEIGHT_IN', typ.IntegerType()),
    ('MOTHER_PRE_WEIGHT', typ.IntegerType()),
    ('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),
    ('MOTHER_WEIGHT_GAIN', typ.IntegerType()),
    ('DIABETES_PRE', typ.IntegerType()),
    ('DIABETES_GEST', typ.IntegerType()),
    ('HYP_TENS_PRE', typ.IntegerType()),
    ('HYP_TENS_GEST', typ.IntegerType()),
    ('PREV_BIRTH_PRETERM', typ.IntegerType())
]

schema = typ.StructType([
    typ.StructField(e[0], e[1], False) for e in labels
])

births = spark.read.csv('births_transformed.csv.gz', 
                        header=True, 
                        schema=schema)
# 在这里我们指定DataFrame的schema，限制数据集只有17列
Create transformers
import pyspark.ml.feature as ft

births = births \
    .withColumn(       'BIRTH_PLACE_INT', 
                births['BIRTH_PLACE'] \
                    .cast(typ.IntegerType()))
# 创建转换器、评估器
import  pyspark.ml.feature as ft

births = births.withColumn('BIRTH_PLACE_INT', births['BIRTH_PLACE']\
    .cast(typ.IntegerType()))

# birth place使用one-hot编码
encoder = ft.OneHotEncoder(inputCol='BIRTH_PLACE_INT',
                           outputCol='BIRTH_PLACE_VEC')

# 创建单一的列将所有特征整合在一起
featuresCreator = ft.VectorAssembler(
    inputCols=[col[0] for col in labels[2:]] + [encoder.getOutputCol()],
    outputCol='features'
)

# 创建一个评估器
import pyspark.ml.classification as cl

logistic = cl.LogisticRegression(maxIter=10,
                                regParam=0.01,
                                featuresCol=featuresCreator.getOutputCol(),
                                labelCol='INFANT_ALIVE_AT_REPORT')
Having done this, we can now create our first Transformer.

featuresCreator = ft.VectorAssembler(
    inputCols=[
        col[0] 
        for col 
        in labels[2:]] + \
    [encoder.getOutputCol()], 
    outputCol='features'
)

# VectorAssembler(inputCols=None, outputCol=None, handleInvalid=‘error’)
# 特征转换器，将多个列合并为一个向量列。
# VectorAssembler接受以下输入列类型:所有数值类型、布尔类型和向量类型。在每一行中，输入列的值将按照指定的顺序连接到一个向量中。


# 创建一个评估器
import pyspark.ml.classification as cl

logistic = cl.LogisticRegression(maxIter=10,
                                regParam=0.01,
                                featuresCol=featuresCreator.getOutputCol(),
                                labelCol='INFANT_ALIVE_AT_REPORT')
Create an estimator
In this example we will (once again) us the Logistic Regression model.



import pyspark.ml.classification as cl
Once loaded, let's create the model.

logistic = cl.LogisticRegression(
    maxIter=10, 
    regParam=0.01, 
    labelCol='INFANT_ALIVE_AT_REPORT')
Create a pipeline
All that is left now is to creat a Pipeline and fit the model. First, let's load the Pipeline from the package.


# 2.3 创建一个管道、拟合模型
# 在前面我们已经创建了数据转换器和评估器，现在我们可以通过管道将他们串联起来并方便的进行模型拟合了。

from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[
        encoder, 
        featuresCreator, 
        logistic
    ])


# 拟合模型   randomSplit划分测试集 训练集
birth_train, birth_test = births.randomSplit([0.7,0.3],seed=123)

# 拟合模型
model = pipeline.fit(birth_train)
test_model = model.transform(birth_test)

test_model.take(1)
Fit the model
Conventiently, DataFrame API has the .randomSplit(...) method.

Here's what the test_model looks like.

Model performance
Obviously, we would like to now test how well our model did.

# 2.4 评估模型
# 在前面我们将数据分为了两部分并通过管道方便的对训练集进行了拟合以及对测试集进行了测试。
# 现在我们可以通过测试集的结果来对模型拟合效果进行评估了。

# 评估模型性能
import pyspark.ml.evaluation as ev

evaluator = ev.BinaryClassificationEvaluator(
    rawPredictionCol='probability',
    labelCol='INFANT_ALIVE_AT_REPORT'
)

print(evaluator.evaluate(test_model, {evaluator.metricName:'areaUnderROC'}))
print(evaluator.evaluate(test_model, {evaluator.metricName:'areaUnderPR'}))

0.7187355793173213
0.6819691176245866
Saving the model
PySpark allows you to save the Pipeline definition for later use.

# 2.5 保存模型    保存管道
# PySpark不仅允许保存训练好的模型，还可以保存管道结构及所有转换器和评估器的定义

# 保存模型pipeline
pipelinePath = './infant_oneHotEncoder_Logistic_Pipeline'
pipeline.write().overwrite().save(pipelinePath)

# 重载模型pipeline   重新加载  运行
loadedPipeline = Pipeline.load(pipelinePath)
loadedPipeline.fit(birth_train).transform(birth_test).take(1)

[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', ...]

# 保存模型
from pyspark.ml import PipelineModel

modelPath = './infant_oneHotEncoder_LogisticPipelineModel'
model.write().overwrite().save(modelPath)

# 载入模型
loadedPipelineModel = PipelineModel.load(modelPath)
test_reloadedModel = loadedPipelineModel.transform(birth_test)
test_reloadedModel.take(1)

[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE='1', MOTHER_AGE_YEARS=12, ...]
So, you can load it up later and use straight away to .fit(...) and predict.
御剑归一
发布了273 篇原创文章 · 获赞 1 · 访问量 4689
私信关注
sparkml_实战全流程_LogisticRegression(一)

猜你喜欢