需求明确
针对这个数据框,我们需要根据id分组,对每一个用户id来说:
- 如果既有点击又有曝光,那就只保存所有点击
- 如果全是曝光,那就保存所有曝光
即处理后理想状态如下:
假数据构造
di = [{
'id': 'a1', 'event': 'display', 'hour': '8'},
{
'id': 'a1', 'event': 'click', 'hour': '9'},
{
'id': 'a1', 'event': 'display', 'hour': '11'},
{
'id': 'a1', 'event': 'display', 'hour': '12'},
{
'id': 'a1', 'event': 'click', 'hour': '13'},
{
'id': 'a2', 'event': 'display', 'hour': '8'},
{
'id': 'a2', 'event': 'display', 'hour': '9'},
{
'id': 'a2', 'event': 'display', 'hour': '10'},
{
'id': 'a2', 'event': 'display', 'hour': '11'},
{
'id': 'a3', 'event': 'display', 'hour': '8'},
{
'id': 'a3', 'event': 'click', 'hour': '9'},
{
'id': 'a3', 'event': 'display', 'hour': '18'}]
df = ss.createDataFrame(di)
df.show()
+-------+----+---+
| event|hour| id|
+-------+----+---+
|display| 8| a1|
| click| 9| a1|
|display| 11| a1|
|display| 12| a1|
| click| 13| a1|
|display| 8| a2|
|display| 9| a2|
|display| 10| a2|
|display| 11| a2|
|display| 8| a3|
| click| 9| a3|
|display| 18| a3|
+-------+----+---+
分组处理
# 对单个用户,有单击全保留点击,没点击全保留曝光
def row_advert(row):
uid = row[0]
events, hours = row[1], row[2]
resEvent, resHour = [], []
for i in range(len(events)):
if events[i] in ['click']:
resEvent.append(events[i])
resHour.append(hours[i])
if len(resEvent) == 0:
resEvent = events
resHour = hours
tups = (
uid,
','.join(str(i) for i in resEvent), # 此处要返回逗号分割得到字符串,方便spark一行转多行
','.join(str(i) for i in resHour)
)
return tups
dfgp = df.groupBy('id').agg(
fn.collect_list('event').alias('event'),
fn.collect_list('hour').alias('hour')
).rdd.map(row_advert).toDF(schema=['id', 'event', 'hour'])
dfgp.show(truncate=False)
+---+-------------------------------+---------+
|id |event |hour |
+---+-------------------------------+---------+
|a3 |click |9 |
|a2 |display,display,display,display|8,9,10,11|
|a1 |click,click |9,13 |
+---+-------------------------------+---------+
一行转多行
上述分组处理后的结果,我们需要将每行展成多行
# 给df增加一列连续自增id,用于拼接
def flat(l):
for k in l:
if not isinstance(k, (list, tuple)):
yield k
else:
yield from flat(k)
def mkdf_tojoin(df, ss):
schema = df.schema.add(StructField("tmpid", LongType()))
rdd = df.rdd.zipWithIndex()
rdd = rdd.map(lambda x: list(flat(x)))
df = ss.createDataFrame(rdd, schema)
return df
dfE = dfgp.withColumn('event', fn.explode(fn.split(dfgp.event, ','))).select(['id', 'event'])
dfE.show()
# 拼接自增id,因为df可能乱序了,join的时候会匹配不到应该匹配的数据
dfE = mkdf_tojoin(dfE, ss)
dfE.show()
+---+-------+
| id| event|
+---+-------+
| a3| click|
| a2|display|
| a2|display|
| a2|display|
| a2|display|
| a1| click|
| a1| click|
+---+-------+
+---+-------+-----+
| id| event|tmpid|
+---+-------+-----+
| a3| click| 0|
| a2|display| 1|
| a2|display| 2|
| a2|display| 3|
| a2|display| 4|
| a1| click| 5|
| a1| click| 6|
+---+-------+-----+
dfH = dfgp.withColumn('hour', fn.explode(fn.split(dfgp.hour, ','))).select(['id', 'hour'])
dfH.show()
dfH = mkdf_tojoin(dfH, ss)
dfH.show()
+---+----+
| id|hour|
+---+----+
| a3| 9|
| a2| 8|
| a2| 9|
| a2| 10|
| a2| 11|
| a1| 9|
| a1| 13|
+---+----+
+---+----+-----+
| id|hour|tmpid|
+---+----+-----+
| a3| 9| 0|
| a2| 8| 1|
| a2| 9| 2|
| a2| 10| 3|
| a2| 11| 4|
| a1| 9| 5|
| a1| 13| 6|
+---+----+-----+
结果拼接
dfres = dfE.join(dfH, on=['id', 'tmpid'], how='left').drop('tmpid')
dfres.show()
+---+-------+----+
| id| event|hour|
+---+-------+----+
| a2|display| 9|
| a1| click| 9|
| a2|display| 11|
| a3| click| 9|
| a2|display| 8|
| a2|display| 10|
| a1| click| 13|
+---+-------+----+
记录下,如果还感到迷糊的朋友可以看下面文章: