The difference between groupByKey and reduceByKey:

They are all subject to shuffle, and groupByKey will not be merged between method shuffle and shuffle as it is. ReduceByKey will merge before shuffle, which reduces the io transmission of shuffle, so it is more efficient.
Case:
object GroupyKeyAndReduceByKeyDemo {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.WARN)
    val config = new SparkConf().setAppName("GroupyKeyAndReduceByKeyDemo").setMaster("local")
    val sc = new SparkContext(config)
    val arr = Array("val config", "val arr")
    val socketDS = sc.parallelize(arr).flatMap(_.split(" ")).map((_, 1))
    //The difference between groupByKey and reduceByKey:
    //They all have to go through shuffle, groupByKey will not be merged between method shuffle and shuffle as it is,
    //reduceByKey will merge before shuffle, which reduces the io transfer of shuffle, so it is more efficient
    socketDS.groupByKey().map(tuple => (tuple._1, tuple._2.sum)).foreach(x => {
      println(x._1 + " " + x._2)
    })
    println("----------------------")
    socketDS.reduceByKey(_ + _).foreach(x => {
      println(x._1 + " " + x._2)
    })
    sc.stop()
  }
}


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325690360&siteId=291194637