数据下载:
数据为kaggle上的关于波士顿房价预测的数据,地址:https://www.kaggle.com/c/boston-housing/data
也可在这里下载:https://github.com/ffzs/dataset/tree/master/boston
数据准备:
相关参数:
CRIM-- 城镇人均犯罪率。
ZN - 占地面积超过25,000平方英尺的住宅用地比例。
INDUS - 每个城镇非零售业务的比例。
CHAS - Charles River虚拟变量(如果河流经过则= 1;否则为0)。
NOX - 氮氧化物浓度(每千万份)。
RM - 每间住宅的平均房间数。
AGE - 1940年以前建造的自住单位比例。
DIS - 加权平均值到五个波士顿就业中心的距离。
RAD - 径向高速公路的可达性指数。
TAX - 每10,000美元的全额物业税率。
PTRATIO - 城镇的学生与教师比例。
BLACK - 1000(Bk - 0.63)²其中Bk是城镇黑人的比例。
LSTAT - 人口较低的地位(百分比)。
MEDV - 自住房屋的中位数价值1000美元。这是目标变量。
数据集包括两个部分,train 和 test ,test中没有相关medv的数据,都用22.77代替后,将train 和 test中数据合并,方便评估模型。
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('learn_regression').master('local[1]').getOrCreate()
# 数据导入
df_train = spark.read.csv('file:///home/ffzs/python-projects/learn_spark/boston/train.csv', header=True, inferSchema=True, encoding='utf-8')
df_test = spark.read.csv('file:///home/ffzs/python-projects/learn_spark/boston/test.csv', header=True, inferSchema=True, encoding='utf-8')
# 表合并
from pyspark.sql.functions import lit
df_test = df_test.withColumn('medv', lit(22.77))
df0 = df_train.union(df_test).sort('ID')
df0.show(3)
+---+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+-----+
| ID| crim| zn|indus|chas| nox| rm| age| dis|rad|tax|ptratio| black|lstat| medv|
+---+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+-----+
| 1|0.00632|18.0| 2.31| 0|0.538|6.575|65.2| 4.09| 1|296| 15.3| 396.9| 4.98| 24.0|
| 2|0.02731| 0.0| 7.07| 0|0.469|6.421|78.9|4.9671| 2|242| 17.8| 396.9| 9.14| 21.6|
| 3|0.02729| 0.0| 7.07| 0|0.469|7.185|61.1|4.9671| 2|242| 17.8|392.83| 4.03|22.77|
+---+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+-----+
使用 feature 模块的 VectorAssembler
将 特征合并,然后按7:3
分为训练集和测试集:
from pyspark.ml.feature import VectorAssembler
def feature_converter(df):
vecAss = VectorAssembler(inputCols=df0.columns[1:-1], outputCol='features')
df_va = vecAss.transform(df)
return df_va
train_data, test_data = feature_converter(df0).select(['features', 'medv']).randomSplit([7.0, 3.0], 101)
决策树回归
pyspark.ml.regression.DecisionTreeRegressor(featuresCol='features', labelCol='label', predictionCol='prediction', maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity='variance', seed=None, varianceCol=None)
部分参数
fit(dataset, params=None)方法
Impurity: 信息增益计算准则,支持选项:variance
maxBins: 连续特征离散化的最大分箱个数, >=2并且>=任何分类特征的分类个数
maxDepth: 最大树深
minInfoGain: 分割节点所需最小信息增益
minInstancesPerNode: 分割后每个子节点最小实例个数
训练模型:
from pyspark.ml.regression import DecisionTreeRegressor
dt = DecisionTreeRegressor(maxDepth=5, varianceCol="variance", labelCol='medv')
dt_model = dt.fit(train_data)
dt_model.featureImportances
SparseVector(13, {0: 0.0503, 2: 0.011, 4: 0.0622, 5: 0.1441, 6: 0.1852, 7: 0.0262, 8: 0.0022, 9: 0.0886, 10: 0.0142, 12: 0.4159})
测试:
result = dt_model.transform(test_data)
result.show(3)
+--------------------+-----+------------------+------------------+
| features| medv| prediction| variance|
+--------------------+-----+------------------+------------------+
|[0.03237,0.0,2.18...| 33.4| 34.12833333333334|29.509013888888756|
|[0.08829,12.5,7.8...| 22.9|21.195135135135136| 4.446162819576342|
|[0.14455,12.5,7.8...|22.77|22.425999999999995|0.5578440000003866|
+--------------------+-----+------------------+------------------+
only showing top 3 rows
模型评估:
from pyspark.ml.evaluation import RegressionEvaluator
dt_evaluator = RegressionEvaluator(labelCol='medv', metricName="rmse", predictionCol='prediction')
rmse = dt_evaluator.evaluate(result)
print('测试数据的均方根误差(rmse):{}'.format(rmse))
测试数据的均方根误差(rmse):6.555920141221407
梯度提升树回归 (Gradient-boosted tree regression)
pyspark.ml.regression.GBTRegressor(featuresCol='features', labelCol='label', predictionCol='prediction', maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, subsamplingRate=1.0, checkpointInterval=10, lossType='squared', maxIter=20, stepSize=0.1, seed=None, impurity='variance')
部分参数
fit(dataset,params=None)方法
lossType: GBT要最小化的损失函数,可选:squared, absolute
maxIter: 最大迭代次数
stepSize: 每次优化迭代的步长
subsamplingRate:用于训练每颗决策树的训练数据集的比例,区间[0,1]
训练模型:
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(maxIter=10, labelCol='medv', maxDepth=3)
gbt_model = gbt.fit(train_data)
测试:
result = gbt_model.transform(test_data)
result.show(3)
+--------------------+-----+------------------+
| features| medv| prediction|
+--------------------+-----+------------------+
|[0.03237,0.0,2.18...| 33.4| 31.98716729056085|
|[0.08829,12.5,7.8...| 22.9|22.254258637918248|
|[0.14455,12.5,7.8...|22.77|20.066468254729102|
+--------------------+-----+------------------+
only showing top 3 rows
模型评估:
from pyspark.ml.evaluation import RegressionEvaluator
gbt_evaluator = RegressionEvaluator(labelCol='medv', metricName="rmse", predictionCol='prediction')
rmse = gbt_evaluator.evaluate(result)
print('测试数据的均方根误差(rmse):{}'.format(rmse))
测试数据的均方根误差(rmse):5.624145397622545
随机森林回归
pyspark.ml.regression.RandomForestRegressor(featuresCol='features', labelCol='label', predictionCol='prediction', maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity='variance', subsamplingRate=1.0, seed=None, numTrees=20, featureSubsetStrategy='auto')
部分参数
fit(dataset,params=None)方法
featureSubsetStrategy: 每棵树的节点上要分割的特征数量,可选:auto, all, onethird, sqrt, log2,(0.0,1.0],[1-n]
impurity: 信息增益计算的准则,可选:variance
maxBins: 连续特征离散化最大分箱个数。
maxDepth: 树的最大深度
minInfoGain: 树节点分割特征所需最小的信息增益
minInstancesPerNode: 每个结点所需最小实例个数
numTrees: 训练树的个数
subsamplingRate: 学习每颗决策树所需样本比例
训练模型:
from pyspark.ml.regression import RandomForestRegressor
rf = RandomForestRegressor(numTrees=10, maxDepth=5, seed=101, labelCol='medv')
rf_model = rf.fit(train_data)
测试:
result = rf_model.transform(test_data)
result.show(3)
+--------------------+-----+------------------+
| features| medv| prediction|
+--------------------+-----+------------------+
|[0.03237,0.0,2.18...| 33.4| 30.12804440796982|
|[0.08829,12.5,7.8...| 22.9|21.338106353716338|
|[0.14455,12.5,7.8...|22.77|19.764914032872827|
+--------------------+-----+------------------+
only showing top 3 rows
模型评估:
from pyspark.ml.evaluation import RegressionEvaluator
rf_evaluator = RegressionEvaluator(labelCol='medv', metricName="rmse", predictionCol='prediction')
rmse = rf_evaluator.evaluate(result)
print('测试数据的均方根误差(rmse):{}'.format(rmse))
测试数据的均方根误差(rmse):5.268739233773331
线性回归(LinearRegression)
pyspark.ml.regression.LinearRegression(featuresCol='features', labelCol='label', predictionCol='prediction', maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-06, fitIntercept=True, standardization=True, solver='auto', weightCol=None, aggregationDepth=2, loss='squaredError', epsilon=1.35)
学习目标是通过正规化最小化指定的损失函数。这支持两种损失:
- squaredError (a.k.a 平方损失)
- huber (对于相对较小的误差和相对大的误差的绝对误差的平方误差的混合,我们从训练数据估计比例参数)
支持多种类型的正则化:
- None:OLS
- L2:ridge回归
- L1:Lasso回归
- L1+L2:elastic回归
注意:与huber loss匹配仅支持none和L2正规化。
部分参数
aggregationDepth: 树聚合的深度, >=2
elasticNtParam: ElasticNet混合参数,在[0,1]范围内,alpha=0为L2, alpha=1为L1
fit(dataset,params=None)方法
fitIntercept: 是否拟合截距
maxIter: 最大迭代次数
regParam:正则化参数 >=0
solver: 优化算法,没设置或空则使用”auto”
standardization: 是否对拟合模型的特征进行标准化
Summary属性
coefficientStandardErrors
devianceResiduals: 加权残差
explainedVariance: 返回解释的方差回归得分,explainedVariance=1−variance(y−(̂ y))/variance(y)
meanAbsoluteError: 返回均值绝对误差
meanSquaredError: 返回均值平方误
numInstances: 预测的实例个数
pValues: 系数和截距的双边P值,只有用”normal”solver才可用
predictions: 模型transform方法返回的预测
r2: R方
residuals: 残差
rootMeanSquaredError: 均方误差平方根
tValues: T统计量
训练模型:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(maxIter=10, elasticNetParam=0.8, regParam=0.3, labelCol='medv')
lr_model = lr.fit(train_data)
# 模型指标
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
RMSE: 5.457496
r2: 0.432071
测试:
result = lr_model.transform(test_data)
+--------------------+-----+------------------+
| features| medv| prediction|
+--------------------+-----+------------------+
|[0.03237,0.0,2.18...| 33.4|27.066314856077966|
|[0.08829,12.5,7.8...| 22.9|23.721352298735898|
|[0.14455,12.5,7.8...|22.77|21.388248900632398|
+--------------------+-----+------------------+
only showing top 3 rows
模型评估:
from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(labelCol='medv', metricName="r2", predictionCol='prediction')
r2 = lr_evaluator.evaluate(result)
print('R平方(r2):{:.3}'.format(r2))
R平方(r2):0.469
test_evaluation = lr_model.evaluate(test_data)
print('RMSE:{:.3}'.format(test_evaluation.rootMeanSquaredError))
print('r2:{:.3}'.format(test_evaluation.r2))
RMSE:5.7
r2:0.469