spark中使用udf执行filter

有时我们需要使用filter执行过滤操作,使用下面的语句则会报错:

new_user_rdd = user_rdd.filter(lambdax:begin<=datetime.strptime(x['finish_time'])<=end)

TypeError: condition should be string or Column

一个解决方法是:

from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
from datetime import datetime

begin = datetime.strptime('2017-10-01 00:00:00', '%Y-%m-%d %H:%M:%S')
end = datetime.strptime('2017-12-31 23:59:59', '%Y-%m-%d %H:%M:%S')

new_user_rdd = new_user_rdd1.filter(udf(lambda target: begin<=datetime.strptime(target, '%Y-%m-%d %H:%M:%S')<=end, 
            BooleanType())(new_user_rdd1['finish_time']))

猜你喜欢

转载自blog.csdn.net/iqqiqqiqqiqq/article/details/78960216