Use udf to execute filter in spark

Sometimes we need to use filter to perform filtering operations, and the following statement will report an error:

new_user_rdd = user_rdd.filter(lambdax:begin<=datetime.strptime(x['finish_time'])<=end)

TypeError: condition should be string or Column

A workaround is:

from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
from datetime import datetime

begin = datetime.strptime('2017-10-01 00:00:00', '%Y-%m-%d %H:%M:%S')
end = datetime.strptime('2017-12-31 23:59:59', '%Y-%m-%d %H:%M:%S')

new_user_rdd = new_user_rdd1.filter(udf(lambda target: begin<=datetime.strptime(target, '%Y-%m-%d %H:%M:%S')<=end, 
            BooleanType())(new_user_rdd1['finish_time']))

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325900923&siteId=291194637