[Spark] The difference between map and mapPartitions-code example


1. Simple example

1. Map example

First, let's take a simple small example, some data in the collection, to see how these data go through map.
The code is as follows (example):

  def main(args: Array[String]): Unit = {
    
    
    val sparkconf = new SparkConf().setMaster("local[*]").setAppName("test")
    var sc =  new SparkContext(sparkconf);
    val rdd = sc.makeRDD(List(1,2,3,4),1)
    //num只出现了一次,而且是按照顺序出现的,则num可以通过  _ 来表示
    val value = rdd.map(num=>{
    
    
      println(">>>>"+num);
      num
    })
    val value2 = value.map(num=>{
    
    
      println("-----"+num);
      num
    })
    value2.collect()
    sc.stop();
  }

Output:

>>>>1
-----1
>>>>2
-----2
>>>>3
-----3
>>>>4
-----4

As shown above, when a partition is up, all the data in the partition will be executed in turn, and then the next piece of data will be executed.
When set to two partitions.

val rdd = sc.makeRDD (List (1,2,3,4), 2)

The following output proves that the data of the two partitions are executed in parallel, and each partition is executed one by one. The data in the partition is also executed in sequence.

>>>>1
>>>>3
-----3
-----1
>>>>4
>>>>2
-----4
-----2

2. Examples of mapPartitions

  def main(args: Array[String]): Unit = {
    
    
    val sparkconf = new SparkConf().setMaster("local[*]").setAppName("test")
    var sc =  new SparkContext(sparkconf);
    val rdd = sc.makeRDD(List(1,2,3,4),2)
    val value = rdd.mapPartitions(iter=>{
    
    
      println(">>>>");
      iter
    })
    value.collect()
    sc.stop();
  }

The input parameters of mapPartitions and map are a little different. It mapis a numvalue, but it mapPartitionsis an iterator. This iterator represents all the data of a partition. For example, in 2 partitions, it only prints twice.

>>>>
>>>>

In this way, mapPartitions is the execution of one partition for one partition. When all data processing in the partition is completed, the data of the partition will be released. It seems that these two functions are very similar. In addition, the execution of these two partitions is out of order . If you only want to fetch the data of a certain partition, you can use it mapPartitionsWithIndex(index,iter)to specify the partition.
But what if you want to get the maximum value in the partition? The mapPartitionsrealization of this function has inherent advantages.

  def main(args: Array[String]): Unit = {
    
    
    val sparkconf = new SparkConf().setMaster("local[*]").setAppName("test")
    var sc =  new SparkContext(sparkconf);
    val rdd = sc.makeRDD(List(1,2,3,4),2)
    val value = rdd.mapPartitions(iter=>{
    
    
      List(iter.max).iterator
    })
    value.collect().foreach(println)
    sc.stop();
  }

to sum up

  • The data processing angle Map operator is the execution of one data per data in a partition, similar to a serial operation. The mapPartitionsoperator is partition batch operations units.
  • From a functional point of view Map , the main purpose of the operator is to transform and change the data in the data source. But it will not reduce or increase data.
    The MapPartitions operator needs to pass an iterator and return an iterator. The number of elements that are not required remains the same, so you can increase or decrease the data
  • Angle performance Map operator because similar serial operation, the performance is relatively low, but the mapPartitionsoperator similar to the batch
    processing, so the higher the performance. But the mapPartitionsoperator will take up memory for a long time, then this will result in memory might not be enough, memory overflow error occurs. So in the case of limited memory, it is not recommended. You can use mapoperation.

Guess you like

Origin blog.csdn.net/qq_30285985/article/details/110719384