Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 1752 because the siz

The error is as follows:

Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 9384 because the size after growing exceeds size limitation 2147483632

insert image description here

The BufferHolder cannot be increased by size 9384 because the increased size exceeds the size limit of 2147483632

Reference link:

https://docs.microsoft.com/zh-cn/azure/databricks/kb/sql/cannot-grow-bufferholder-exceeds-size

Locate the location of the error:
My requirement is to 唯一键keep all clicks for those with clicks, and all exposures for those without clicks, so I use it collect_list. After the data is collected, the df transferred after the entire groupby will be very large, and it will be sent collect_listto udfThe row vector will also be very long, so it exceeds the cache size.

The key code of the problem is as follows:

schema_getdataCols = ['newid'] + data_origin_columns1
df_HDFS_gp = df_HDFS_A.groupBy('newid').agg(
    fn.collect_list('suuid').alias('suuid'),
    fn.collect_list('aid').alias('aid'),
    fn.collect_list('slotid').alias('slotid'),
    fn.collect_list('adfrom').alias('adfrom'),
    fn.collect_list('appkey').alias('appkey'),
    fn.collect_list('appname').alias('appname'),
    fn.collect_list('battery').alias('battery'),
    fn.collect_list('brand').alias('brand'),
    fn.collect_list('channel').alias('channel'),
    fn.collect_list('hardware').alias('hardware'),
    fn.collect_list('product').alias('product'),
    fn.collect_list('screensize').alias('screensize'),
    fn.collect_list('manufacturer').alias('manufacturer'),
    fn.collect_list('model').alias('model'),
    fn.collect_list('nettype').alias('nettype'),
    fn.collect_list('operator').alias('operator'),
    fn.collect_list('os').alias('os'),
    fn.collect_list('city').alias('city'),
    fn.collect_list('actname').alias('actname'),
).rdd.map(row_dataID_druid_ad_behavior).toDF(schema=schema_getdataCols)

Methods tried:

  • Increase BufferHolder cache size
    • .config('spark.kryoserializer.buffer.max', 5120) \
  • Split df into multiple data frames for subsequent processing

Final solution:

The idea is solidified, and when it comes to group processing, I just stare at groupby. According to requirements 有点击保留所有点击,没点击保留所有曝光, all key codes are as follows:

'''三元组拼接,划分label'''
# 先筛除有点击的数
df_have_click = df.filter(df['actname'] == 'ckads')

# 保留有点击的三元组id
click_ids = df_have_click.select('newid').collect()
click_ids = [i[0] for i in click_ids]
click_ids = list(set(click_ids))

# 筛除有曝光的数,排除有点击的三元组id
df_have_display = df.filter(df['actname'] == 'exads').filter(~df['newid'].isin(click_ids))

# 拼接
df_HDFS_res = df_have_click.unionAll(df_have_display)

problem solved.

Guess you like

Origin blog.csdn.net/qq_42363032/article/details/123306111