sparkml_实战全流程_LogisticRegression(三)

上面使用的使用K-Fold来进行超参调优,K-Fold交叉验证往往非常耗时,使用1-Fold的交叉验证(即将数据集按比例分为训练集合验证集)能大大缩短时间
参考:
https://www.jianshu.com/p/20456b512fa7

# 上面使用的使用K-Fold来进行超参调优,K-Fold交叉验证往往非常耗时,
# 使用1-Fold的交叉验证(即将数据集按比例分为训练集合验证集)能大大缩短时间。
# ChiSqSelector选出  5个特征, 降低模型复杂度

selector = ft.ChiSqSelector(
    numTopFeatures=5, 
    featuresCol=featuresCreator.getOutputCol(), 
    outputCol='selectedFeatures',
    labelCol='INFANT_ALIVE_AT_REPORT'
)
# 创建转换器,评估器,管道

logistic = cl.LogisticRegression(
    labelCol='INFANT_ALIVE_AT_REPORT',
    featuresCol='selectedFeatures'
)

pipeline = Pipeline(stages=[encoder,featuresCreator,selector])
data_transformer = pipeline.fit(births_train)


tvs = tune.TrainValidationSplit(
    estimator=logistic, 
    estimatorParamMaps=grid, 
    evaluator=evaluator
)


# 训练模型
tvsModel = tvs.fit(
    data_transformer \
        .transform(births_train)
)

data_train = data_transformer \
    .transform(births_test)
results = tvsModel.transform(data_train)

print(evaluator.evaluate(results, 
     {evaluator.metricName: 'areaUnderROC'}))
print(evaluator.evaluate(results, 
     {evaluator.metricName: 'areaUnderPR'}))

0.6111344483529891
0.5735913338089571
发布了273 篇原创文章 · 获赞 1 · 访问量 4685

猜你喜欢

转载自blog.csdn.net/wj1298250240/article/details/103947847