pyspark compares dropDuplicates()+count and count(Distinct)

Recently, when I was counting data for a half-year period, I first used the countDisitnct("id") method, which was very slow in calculation and prone to data skew. But because this groupBy has so many keys, hundreds of millions of them, it is impossible to use mapPartition to do intermediate calculations, thinking that this key alone is enough to report an OOM error.

In desperation, we still have to use the count method to calculate hard, but we want to find a faster calculation method. After searching many times, I found that using df.dropDuplicates(cols_).groupBy("").agg(count("id")) results and df.groupBy("").agg(countDistinct("id")) results Same, sometimes the calculation speed is slightly faster. So record it.

principle:

1 distinct: Returns a DataFrame that does not contain duplicate records

Returns unique Row records in the current DataFrame. This method has the same result as the subsequent dropDuplicates() method when the specified field is not passed in.
Example:

df.distinct()

2 dropDuplicates: remove duplicates based on specified fields

Remove duplicates based on specified fields. Similar to select distinct a, b operation
example:

train.select('Age','Gender').dropDuplicates().show()

3 Application examples

# 1 countDistinct()

cate1 = action_info.groupBy("window_type", "first_cate_cd") \
            .agg(countDistinct("user_id").alias("user_cnt"))
cate1.show()
# 2 .dropDuplicates()
view_shop = action_info.select("user_id","first_cate_cd","shop_id")\
     .dropDuplicates().repartition(10000, col("user_id"),col("shop_id")) \
     .groupBy("first_cate_cd","shop_id") \
     .agg(count("user_id").alias("user_cnt"))

When the amount of data is small, there is no difference in computing speed. After the amount of data reaches a certain level, the computing speed is not much different.

Guess you like

Origin blog.csdn.net/eylier/article/details/128719565