When using numpy in pyspark.sql.functions.udf, Py4JJavaError appears

Background: pyspark defines functions through udf to assist in adding new columns

The reason for the error: UDF cannot return numpy type

For example:

df.head()

Row(artist=‘Martha Tilston’, auth=‘Logged In’, firstName=‘Colin’, gender=‘M’, userId=‘30’, hour=8)

# 切割时间,每6个小时为一组
get_6hour = udf(lambda x: np.floor(x/6),IntegerType())
df.withColumn('6hour',get_6hour(df.hour)).head()

Py4JJavaError: An error occurred while calling o2099.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 207.0 failed 1 times, most recent failure: Lost task 0.0 in stage 207.0 (TID 9384, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)……

Positive solution: (change the return type in udf to int)

get_6hour = udf(lambda x: int(np.floor(x/6)),IntegerType())
df.withColumn('6hour',get_6hour(df.hour)).head()

Row(artist=‘Martha Tilston’, auth=‘Logged In’, firstName=‘Colin’, gender=‘M’, userId=‘30’, hour=8, 6hour=1)

If your problem is solved, please like + follow~

Guess you like

Origin blog.csdn.net/weixin_45281949/article/details/104324158