The difference and use of Spark map, mapPartitions, mapPartitionsWithIndex operators

map

  • The elements in the RDD are mapped one by one according to the specified function rules to form a new RDD.

Function signature

Insert picture description here

Code example

val conf: SparkConf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5),2)
val newRDD: RDD[Int] = rdd.map(_*2)
newRDD.collect().foreach(println)
sc.stop()

mapPartitions

  • Use the partition as the unit to map the elements in the RDD according to the specified function rules.

Function signature

Insert picture description here

Code example

val conf: SparkConf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4), 2)
val newRDD: RDD[Int] = rdd.mapPartitions(datas => {
    
    
  datas.map(_ * 2)
})
newRDD.foreach(println)
sc.stop()

mapPartitionsWithIndex

  • MapPartitions with partition numbers.

Function signature

Insert picture description here

Code example

val conf: SparkConf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[*]")
val sc = new SparkContext(conf)
val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8), 4)
// 第二个分区元素*2,其他分区元素不变
val newRDD: RDD[Int] = rdd.mapPartitionsWithIndex {
    
    
  (index, datas) => {
    
    
    index match {
    
    
      case 1 => datas.map(_ * 2)
      case _ => datas
    }
  }
}
newRDD.collect().foreach(println)
sc.stop()

The difference between the three

  • The map processes one piece of data at a time.
  • mapPartitions processes the data of one partition at a time. Only after the data of the current partition is processed, the data in the original RDD partition will be released, which may lead to OOM.
  • mapPartitionsWithIndex processes the data of one partition at a time, the same as mapPartitions, but the difference is that mapPartitionsWithIndex has the original RDD partition number. This operator can be used when we want to process only the data of a certain partition.

scenes to be used

  • mapPartitons is suitable for use when the space memory is large or when the database is frequently connected to improve processing efficiency.
  • Map is suitable for situations where the memory is small.
  • mapPartitionsWithIndex is the same as mapPartitions, but it is more convenient to manipulate the data of the specified partition.

Guess you like

Origin blog.csdn.net/FlatTiger/article/details/115041713