The difference between reduceByKey and groupByKey of spark operators


Supplement: What is the difference between reduceByKey and groupByKey? 
[The most basic idea of ​​optimizing code]
(1) When using reduceByKeyt, Spark can combine the data to be output with a shared key before moving the data in each partition.
With the help of the figure below, you can understand what is happening in reduceByKey. 
Note how the same key is combined on the same machine before the pair is moved (lamdba function in reduceByKey).
Then the lamdba function is called again on each region to reduce all values ​​into a final result.
The whole process is as follows:



(2) When using groupByKey, since it does not receive functions, spark can only move all key-value pairs first,

        The consequence of this is that the overhead between cluster nodes is large, resulting in transmission delays.

  The whole process is as follows:



Therefore, reduceByKey is better than groupByKey when doing complex computations on big data.
In addition, if it is only group processing, the following functions should take precedence over groupByKey:
  (1) combineByKey combines data, but the data type after combination is not the same as the type of the value at the time of input.
  (2) foldByKey combines all the values ​​of each key, used in the cascade function and "zero value".





















































Supplement: What is the difference between reduceByKey and groupByKey? 
[The most basic idea of ​​optimizing code]
(1) When using reduceByKeyt, Spark can combine the data to be output with a shared key before moving the data in each partition.
With the help of the figure below, you can understand what is happening in reduceByKey. 
Note how the same key is combined on the same machine before the pair is moved (lamdba function in reduceByKey).
然后lamdba函数在每个区上被再次调用来将所有值reduce成一个最终结果。
整个过程如下:



(2)当采用groupByKey时,由于它不接收函数,spark只能先将所有的键值对(key-value pair)都移动,

        这样的后果是集群节点之间的开销很大,导致传输延时。

  整个过程如下:



因此,在对大数据进行复杂计算时,reduceByKey优于groupByKey。
另外,如果仅仅是group处理,那么以下函数应该优先于 groupByKey :
  (1)combineByKey 组合数据,但是组合之后的数据类型与输入时值的类型不一样。
  (2)foldByKey合并每一个 key 的所有值,在级联函数和“零值”中使用。
















































Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325729929&siteId=291194637