有时我们需要使用filter执行过滤操作,使用下面的语句则会报错:
new_user_rdd = user_rdd.filter(lambdax:begin<=datetime.strptime(x['finish_time'])<=end)
TypeError: condition should be string or Column
一个解决方法是:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
from datetime import datetime
begin = datetime.strptime('2017-10-01 00:00:00', '%Y-%m-%d %H:%M:%S')
end = datetime.strptime('2017-12-31 23:59:59', '%Y-%m-%d %H:%M:%S')
new_user_rdd = new_user_rdd1.filter(udf(lambda target: begin<=datetime.strptime(target, '%Y-%m-%d %H:%M:%S')<=end,
BooleanType())(new_user_rdd1['finish_time']))