pyspark 实例化模型报错 features doesn't exist

使用pyspark做机器学习,实例化模型对象时,需要指定输入featuresCol的名称。其中,featuresCol是由数据的X构成的“单列”,aka 'vector'。

否则会报错:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/spark/spark-2.4.4/python/pyspark/ml/base.py", line 132, in fit
    return self._fit(dataset)
  File "/data/spark/spark-2.4.4/python/pyspark/ml/wrapper.py", line 295, in _fit
    java_model = self._fit_java(dataset)
  File "/data/spark/spark-2.4.4/python/pyspark/ml/wrapper.py", line 292, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/data/spark/spark-2.4.4/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/data/spark/spark-2.4.4/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'Field "features" does not exist.

Stack Overflow上,desertnaut :

Spark dataframes are not used like that in Spark ML; all your features need to be vectors in a single column, usually named features. Here is how you can do it using the 5 rows you have provided as an example:

spark.version
# u'2.2.0'

from pyspark.sql import Row
from pyspark.ml.linalg import Vectors

# your sample data:
temp_df = spark.createDataFrame([Row(V4366=0.0, V4460=0.232, V4916=-0.017, V1495=-0.104, V1639=0.005, V1967=-0.008, V3049=0.177, V3746=-0.675, V3869=-3.451, V524=0.004, V5409=0), Row(V4366=0.0, V4460=0.111, V4916=-0.003, V1495=-0.137, V1639=0.001, V1967=-0.01, V3049=0.01, V3746=-0.867, V3869=-2.759, V524=0.0, V5409=0), Row(V4366=0.0, V4460=-0.391, V4916=-0.003, V1495=-0.155, V1639=-0.006, V1967=-0.019, V3049=-0.706, V3746=0.166, V3869=0.189, V524=0.001, V5409=0), Row(V4366=0.0, V4460=0.098, V4916=-0.012, V1495=-0.108, V1639=0.005, V1967=-0.002, V3049=0.033, V3746=-0.787, V3869=-0.926, V524=0.002, V5409=0), Row(V4366=0.0, V4460=0.026, V4916=-0.004, V1495=-0.139, V1639=0.003, V1967=-0.006, V3049=-0.045, V3746=-0.208, V3869=-0.782, V524=0.001, V5409=0)])

trainingData=temp_df.rdd.map(lambda x:(Vectors.dense(x[0:-1]), x[-1])).toDF(["features", "label"])
trainingData.show()
# +--------------------+-----+ 
# |            features|label|
# +--------------------+-----+
# |[-0.104,0.005,-0....|    0| 
# |[-0.137,0.001,-0....|    0|
# |[-0.155,-0.006,-0...|    0|
# |[-0.108,0.005,-0....|    0|
# |[-0.139,0.003,-0....|    0|
# +--------------------+-----+

也就是说,需要把全部输入特征转化为一个‘vector’,使用的方法可以是

from pyspark.ml.linalg import Vectors
trainingData=temp_df.rdd.map(lambda x:(Vectors.dense(x[0:-1]), x[-1])).toDF(["features", "label"])

Stack Overflow 上,  100+award回答

Personally I would go with Python UDF and wouldn't bother with anything else:

发布了28 篇原创文章 · 获赞 5 · 访问量 4059

猜你喜欢

转载自blog.csdn.net/authorized_keys/article/details/102747280