pyspark groupBy分组处理后转为DF

需求明确

在这里插入图片描述
针对这个数据框,我们需要根据id分组,对每一个用户id来说:

  • 如果既有点击又有曝光,那就只保存所有点击
  • 如果全是曝光,那就保存所有曝光

即处理后理想状态如下:

在这里插入图片描述

假数据构造

di = [{
    
    'id': 'a1', 'event': 'display', 'hour': '8'}, 
      {
    
    'id': 'a1', 'event': 'click', 'hour': '9'}, 
      {
    
    'id': 'a1', 'event': 'display', 'hour': '11'}, 
      {
    
    'id': 'a1', 'event': 'display', 'hour': '12'},
      {
    
    'id': 'a1', 'event': 'click', 'hour': '13'},
      {
    
    'id': 'a2', 'event': 'display', 'hour': '8'},
      {
    
    'id': 'a2', 'event': 'display', 'hour': '9'},
      {
    
    'id': 'a2', 'event': 'display', 'hour': '10'},
      {
    
    'id': 'a2', 'event': 'display', 'hour': '11'},
      {
    
    'id': 'a3', 'event': 'display', 'hour': '8'},
      {
    
    'id': 'a3', 'event': 'click', 'hour': '9'},
      {
    
    'id': 'a3', 'event': 'display', 'hour': '18'}]
df = ss.createDataFrame(di)
df.show()
+-------+----+---+
|  event|hour| id|
+-------+----+---+
|display|   8| a1|
|  click|   9| a1|
|display|  11| a1|
|display|  12| a1|
|  click|  13| a1|
|display|   8| a2|
|display|   9| a2|
|display|  10| a2|
|display|  11| a2|
|display|   8| a3|
|  click|   9| a3|
|display|  18| a3|
+-------+----+---+

分组处理

# 对单个用户,有单击全保留点击,没点击全保留曝光
def row_advert(row):
    uid = row[0]
    events, hours = row[1], row[2]
    resEvent, resHour = [], []
    
    for i in range(len(events)):
        if events[i] in ['click']:
            resEvent.append(events[i])
            resHour.append(hours[i])
    if len(resEvent) == 0:
        resEvent = events
        resHour = hours
    
    tups = (
        uid,
        ','.join(str(i) for i in resEvent),		# 此处要返回逗号分割得到字符串,方便spark一行转多行
        ','.join(str(i) for i in resHour)
    )
    return tups
dfgp = df.groupBy('id').agg(
 fn.collect_list('event').alias('event'),
 fn.collect_list('hour').alias('hour')
).rdd.map(row_advert).toDF(schema=['id', 'event', 'hour'])

dfgp.show(truncate=False) 
+---+-------------------------------+---------+
|id |event                          |hour     |
+---+-------------------------------+---------+
|a3 |click                          |9        |
|a2 |display,display,display,display|8,9,10,11|
|a1 |click,click                    |9,13     |
+---+-------------------------------+---------+

一行转多行

上述分组处理后的结果,我们需要将每行展成多行

# 给df增加一列连续自增id,用于拼接
def flat(l):
    for k in l:
        if not isinstance(k, (list, tuple)):
            yield k
        else:
            yield from flat(k)
            
def mkdf_tojoin(df, ss):
    schema = df.schema.add(StructField("tmpid", LongType()))
    rdd = df.rdd.zipWithIndex()
    rdd = rdd.map(lambda x: list(flat(x)))
    df = ss.createDataFrame(rdd, schema)
    return df
dfE = dfgp.withColumn('event', fn.explode(fn.split(dfgp.event, ','))).select(['id', 'event'])
dfE.show()

# 拼接自增id,因为df可能乱序了,join的时候会匹配不到应该匹配的数据
dfE = mkdf_tojoin(dfE, ss)
dfE.show()
+---+-------+
| id|  event|
+---+-------+
| a3|  click|
| a2|display|
| a2|display|
| a2|display|
| a2|display|
| a1|  click|
| a1|  click|
+---+-------+

+---+-------+-----+
| id|  event|tmpid|
+---+-------+-----+
| a3|  click|    0|
| a2|display|    1|
| a2|display|    2|
| a2|display|    3|
| a2|display|    4|
| a1|  click|    5|
| a1|  click|    6|
+---+-------+-----+
dfH = dfgp.withColumn('hour', fn.explode(fn.split(dfgp.hour, ','))).select(['id', 'hour'])
dfH.show()

dfH = mkdf_tojoin(dfH, ss)
dfH.show()
+---+----+
| id|hour|
+---+----+
| a3|   9|
| a2|   8|
| a2|   9|
| a2|  10|
| a2|  11|
| a1|   9|
| a1|  13|
+---+----+

+---+----+-----+
| id|hour|tmpid|
+---+----+-----+
| a3|   9|    0|
| a2|   8|    1|
| a2|   9|    2|
| a2|  10|    3|
| a2|  11|    4|
| a1|   9|    5|
| a1|  13|    6|
+---+----+-----+

结果拼接

dfres = dfE.join(dfH, on=['id', 'tmpid'], how='left').drop('tmpid')
dfres.show()
+---+-------+----+
| id|  event|hour|
+---+-------+----+
| a2|display|   9|
| a1|  click|   9|
| a2|display|  11|
| a3|  click|   9|
| a2|display|   8|
| a2|display|  10|
| a1|  click|  13|
+---+-------+----+

记录下,如果还感到迷糊的朋友可以看下面文章:

https://blog.csdn.net/qq_42363032/article/details/118542608

https://blog.csdn.net/qq_42363032/article/details/118298108

Guess you like

Origin blog.csdn.net/qq_42363032/article/details/120077498