pyspark比较dropDuplicates()+count与count(Distinct) - 代码天地

pyspark比较dropDuplicates()+count与count(Distinct)

业界资讯 2023-09-30 00:40:56 阅读次数: 0

近期在统计一个半年之久的数据时，先是使用了countDisitnct("id")方法，计算速度很慢，还容易出现数据倾斜。但因为这个groupBy的key非常之多，有上亿条，根本无法用mapPartition来做中间计算，以为光这个key就足够报OOM错误了。

无奈之下还是得用count的方法硬计算，但又想找到更快的计算方式。多翻查找后，发现使用df.dropDuplicates(cols_).groupBy("").agg(count("id"))计算结果跟df.groupBy("").agg(countDistinct("id"))结果一样，有时计算速度还会稍微快一点。所以记录下。

原理：

1 distinct：返回一个不包含重复记录的DataFrame

返回当前DataFrame中不重复的Row记录。该方法和接下来的dropDuplicates()方法不传入指定字段时的结果相同。
示例 :

df.distinct()

2 dropDuplicates：根据指定字段去重

根据指定字段去重。类似于select distinct a, b操作
示例：

train.select('Age','Gender').dropDuplicates().show()

3 应用示例

# 1 countDistinct()

cate1 = action_info.groupBy("window_type", "first_cate_cd") \
            .agg(countDistinct("user_id").alias("user_cnt"))
cate1.show()

# 2 .dropDuplicates()
view_shop = action_info.select("user_id","first_cate_cd","shop_id")\
     .dropDuplicates().repartition(10000, col("user_id"),col("shop_id")) \
     .groupBy("first_cate_cd","shop_id") \
     .agg(count("user_id").alias("user_cnt"))

在数据量不大的时候看不出计算速度的差距的，数据量到了一定的量级后，计算速度也差不太多。

猜你喜欢

转载自blog.csdn.net/eylier/article/details/128719565

pyspark比较dropDuplicates()+count与count(Distinct)

count distinct

count(1) 与 count(*) 比较

count(1)与count(*)比较

count(1)、count(*)、count(column)、count(distinct column) 区别

SQL COUNT DISTINCT 函数

hive优化-count(distinct)

GORM distinct() + count() 的问题

python pandas 实现SQl的count(*),count(distinct **)

count(1),count(*)和count(列)的比较

spark 例子count(distinct 字段)

MongoDb Count+distinct+Group

SQL count与distinct的结合使用

模糊查询、limit、count、distinct

Hive之COUNT DISTINCT优化

count distinct groupby 小记录

mysql的 select count(distinct column)

count

MySQL count(1) count(*) 比较详解

MySQL COUNT(*) 和 COUNT(1) 比较

sql优化之count distinct vs. count group by

count(1)/count(*)/count(列)等性能比较

Count(*)、Count(1)、Count(0)的区别和执行效率比较

hive 多字段同时count(distinct)优化

Sql优化（二）快速计算Distinct Count

Hive SQL优化之 Count Distinct

sql之count distinct 空值null

Postgresql数据库count(distinct)优化

在Apache Kylin中使用Count Distinct

sql：mysql：count和distinct并用

今日推荐

Apache Doris 2.0.10 版本正式发布！

开源日报 | 大模型开战；大模型独角兽被曝卖身；周鸿祎建议谷歌开源所有产品；最大开源AI社区提供1000万美元共享GPU

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

基于大语言模型的开源知识库问答系统 MaxKB GitHub Star 数量突破 5,000 个！

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

周排行

阿里云短信服务平台注册

Windows下的字符串处理(1)

sqoop: mysql导入数据到hdfs, hive, hbase

commons.lang中常用的工具类

离线安装PostgreSQL11.6

使用PyTorch简单实现卷积神经网络模型

一文彻底搞定谱聚类

一道面试题引发的血案

One Chat for Mac(聊天工具)

TCP/IP的底层队列是如何实现的？

每日归档

更多

2024-05-17(34)

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)