The difference between reduceByKey and groupByKey

(1) When reduceByKeyt is used, Spark can combine the data to be output with a common key before moving data in each partition. With the help of the figure below, you can understand what happens in reduceByKey. Note how the same key is combined on the same machine before the data pair is moved (the lamdba function in reduceByKey). Then the lamdba function is called again on each zone to reduce all the values ​​into a final result. The whole process is as follows:
Insert picture description here

ReduceByKey

(2) When groupByKey is used, because it does not receive functions, spark can only move all key-value pairs first. The consequence is that the overhead between cluster nodes is very large, resulting in transmission delay . The whole process is as follows:
Insert picture description here
groupByKey

Therefore, reduceByKey is better than groupByKey when performing complex calculations on big data.
In addition, if it is only group processing, then the following functions should take precedence over groupByKey:

(1) CombineByKey combines data, but the data type after the combination is different from the type of the value at the time of input.

(2) foldByKey merges all values ​​of each key and uses it in the cascade function and "zero value".

Guess you like

Origin blog.csdn.net/weixin_43614067/article/details/106924926