pyspark 将df分组处理后,再转回df(一行转多行)

需求描述

在spark里,对df分组是横向分组的,大家可以看这个:https://blog.csdn.net/qq_42363032/article/details/118298108

spark分组实例图:

注:此处分组没有聚合

在这里插入图片描述
需求:现在想将这个df变为正常纵向的df

实现代码

from pyspark import SparkContext, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as fn
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
from pyspark.sql.types import *
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline, PipelineModel

sc = SparkContext()
ss = SparkSession(sc).builder.appName('toolsAPI').getOrCreate()
sc.setLogLevel('ERROR')

groupby分组添加处理逻辑

这里以我的demo为例,对于每个分组的逻辑就是将两个列表岔开,比如:

A = ['2021-01-01', '2021-01-02', '2021-01-03']

B = [1, 2, 3]

现在想要一个列表C为:

C = [2, 3, -100]

实现:

# 根据id分组
da_gb = source_data.groupby('alpos_id').agg(
        fn.collect_list('date_time').alias('date_time'),
        fn.collect_list('ecpm').alias('ecpm')
    )

在这里插入图片描述

以RDD的形式返回,并且遍历每一行,每一行就是一个分组

def shifttime(data):
    ids = list(data.keys())[0]
    values = data.get(ids)
    date_s, ecpm_s, impressions_s, clicks_s, click_rate_s, revenue_s = values[0], values[1], values[2], values[3], values[4], values[5]

    # 搓开数据
    ecpm_tomorrow = []
    for i in range(1, len(ecpm_s)):
        ecpm_tomorrow.append(ecpm_s[i])
    ecpm_tomorrow.append(-1000)    # 最后一天的数据没有明天的数据,标记
	
	# ','.join是去掉列表的[],方便后续处理
    return (ids, ','.join(str(i) for i in date_s), ','.join(str(i) for i in ecpm_s), ','.join(str(i) for i in ecpm_tomorrow))

# 以RDD的形式返回,并且遍历每一行,每一行就是一个分组
daecrdds = da_gb.rdd.map(lambda data: ({
    
    data.alpos_id: [data.date_time, data.ecpm]}))
rddsnew = daecrdds.map(shifttime)	


resd = rddsnew.collect()[1]
print(resd)
[('12_887404713', 
'2021-05-01,2021-05-02,2021-05-03,2021-05-04,2021-05-05,2021-05-06,2021-05-07,2021-05-08,2021-05-09,2021-05-12,2021-05-13,2021-05-14,2021-05-15,2021-05-16,2021-05-18,2021-05-19,2021-05-20,2021-05-21,2021-05-22,2021-05-23,2021-05-24,2021-05-29,2021-05-30,2021-05-31,2021-06-01,2021-06-03,2021-06-04,2021-06-06,2021-06-07,2021-06-08,2021-06-09,2021-06-10,2021-06-11,2021-06-12,2021-06-13,2021-06-14,2021-06-19,2021-06-20,2021-06-21,2021-06-22,2021-06-23,2021-06-24,2021-06-26,2021-06-27,2021-06-28,2021-06-29,2021-06-30,2021-07-01,2021-07-02,2021-07-03,2021-07-04', 
'52.4,34.054054,82.333333,52.711864,87.419355,35.714286,45.357143,16.666667,16.153846,82.666667,390.0,162.307692,19.655172,12.727273,20.0,48.75,26.25,20.909091,205.0,50.0,10.0,69.230769,10.0,12.857143,10.0,51.25,258.75,15.483871,65.0,70.0,541.428571,60.0,20.0,32.0,95.0,2.857143,41.428571,76.666667,30.0,68.75,111.333333,10.0,44.736842,50.0,15.0,52.857143,20.0,35.0,34.285714,34.166667,31.111111', 
'34.054054,82.333333,52.711864,87.419355,35.714286,45.357143,16.666667,16.153846,82.666667,390.0,162.307692,19.655172,12.727273,20.0,48.75,26.25,20.909091,205.0,50.0,10.0,69.230769,10.0,12.857143,10.0,51.25,258.75,15.483871,65.0,70.0,541.428571,60.0,20.0,32.0,95.0,2.857143,41.428571,76.666667,30.0,68.75,111.333333,10.0,44.736842,50.0,15.0,52.857143,20.0,35.0,34.285714,34.166667,31.111111,-1000'
)]
dares = ss.createDataFrame(resd, ['ids', 'date_time_array', 'ecpm_array', 'ecpm_tomorrow_array'])
dares.show()
dares.printSchema()

在这里插入图片描述
在这里插入图片描述

将一行拆为多行

# 将df分组处理后转为df
dares = dares.withColumn('date_time', fn.explode(fn.split(dares.date_time_array, ',')))
dares.show()

在这里插入图片描述

多列的行拆多行

上面只是拆了一行,但是如果再拆其它行的话,会出现形如:2021-05-01 52.4、2021-05-01 34.05这样的情形,实际上我们理想的结果是:2021-05-01 52.4、2021-05-02 34.05

思路:每列的每行拆完之后增加一个连续自增id,拼接的时候按照这个拼接即可:

def flat(l):
    for k in l:
        if not isinstance(k, (list, tuple)):
            yield k
        else:
            yield from flat(k)

# 给df增加一列连续自增id,用于拼接
def mkdf_tojoin(df):
    schema = df.schema.add(StructField("tmpid", LongType()))
    rdd = df.rdd.zipWithIndex()
    rdd = rdd.map(lambda x: list(flat(x)))
    df = ss.createDataFrame(rdd, schema)
    return df
# 将df分组处理后转为df,将每列拼接
dares_date_time = dares.withColumn('date_time', fn.explode(fn.split(dares.date_time_array, ','))).select(['ids', 'date_time'])
dares_ecpm = dares.withColumn('ecpm', fn.explode(fn.split(dares.ecpm_array, ','))).select(['ids', 'ecpm'])
dares_ecpm_tomorrow = dares.withColumn('ecpm_tomorrow', fn.explode(fn.split(dares.ecpm_tomorrow_array, ','))).select(['ids', 'ecpm_tomorrow'])

# 增加连续自增id,用于拼接
dares_date_time = mkdf_tojoin(dares_date_time)
dares_ecpm = mkdf_tojoin(dares_ecpm)
dares_ecpm_tomorrow = mkdf_tojoin(dares_ecpm_tomorrow)

source_data = dares_date_time.join(dares_ecpm, on=['ids', 'tmpid'], how='left')\
    .join(dares_ecpm_tomorrow, on=['ids', 'tmpid'], how='left')


source_data = source_data.orderBy('tmpid')
source_data.show(30)

print(dares_date_time.count())
print(source_data.count())

在这里插入图片描述




关于pyspark dataframe 列的合并与拆分,单行转多行,参考:

https://blog.csdn.net/qq_42886289/article/details/97003898

Guess you like

Origin blog.csdn.net/qq_42363032/article/details/118542608