Difference between foreachPartition and mapPartitions in Spark



There are two types of operation operations in spark: Transformation and Action, the differences are as follows:




Transformation: represents that the transformation operation is our calculation process, and the return is RDD[T], which can be a chain transformation and is delayed triggered of.

Action: Represents a specific behavior. The returned value is not an RDD type, it can be an object, or a numeric value, or it can be a Unit representing no return value, and the action will immediately trigger the execution of the job.




The official documentation method collection of Transformation is as follows:

````
map
filter
flatMap
mapPartitions
mapPartitionsWithIndex
sample
union
intersection
distinct
groupByKey
reduceByKey
aggregateByKey
sortByKey
join
cogroup
cartesian
pipe
coalesce
repartition
repartitionAndSortWithinPartitions

````



The official documentation method collection for Action is as follows:
````
reduce
collect
count
first
take
takeSample takeSample
takeOrdered
saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
countByKey
foreach

````


Combined with daily development, such as the commonly used count, collect, saveAsTextFile, they all belong to the action type, and the result value is either empty, a value, or an object object. Others such as map and filter return values ​​are of RDD type, so simply distinguish the two differences, you can use whether the return value is RDD[T] type to identify.


Then back to the topic, let's talk about the difference between foreachPartition and mapPartitions. Careful friends may find that foreachPartition does not appear in the above method list. The reason may be that the official document only lists the commonly used processing methods, but this does not affect For our use, first of all, we will look at what kind of operation foreachPartition should belong to according to the above distinction principle. The API of this method in the official website document is as follows:
````
public void foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> f)

Applies a function f to each partition of this RDD.

Parameters:
f - (undocumented)
````


From the above return value is empty, it can be seen that foreachPartition should belong to the action operation, and mapPartitions is in Transformation, so it is a transformation operation. In addition, the difference in application scenarios is that mapPartitions can obtain the return value and continue to do other things on the returned RDD. operation, and foreachPartition has no return value and is an action operation, so it is generally used at the end of the program, for example, to land data in a storage system such as mysql, es, or hbase, you can use it.


Of course, data can also be landed in Transformation, but it must rely on the action operation to trigger it, because the Transformation operation is executed lazily. If there is no action method to trigger, the Transformation operation will not be executed, which needs attention.



A foreachPartition example:

````scala
    val sparkConf = new SparkConf ()
     val sc = new SparkContext (sparkConf)
      sparkConf.setAppName("spark demo example ")
    val rdd=sc.parallelize(Seq(1,2,3,4,5),3)
    
    rdd.foreachPartition(partiton=>{
      // partiton.size cannot execute this method, otherwise there will be no data in the foreach method below,
      //because iterator can only be executed once
      partiton.foreach(line=>{
        //save(line) landing data
      })

    })
    
    sc.stop()
````


An example of mapPartitions:
````
     val sparkConf = new SparkConf ()
     val sc = new SparkContext (sparkConf)
      sparkConf.setAppName("spark demo example ")
    val rdd=sc.parallelize(Seq(1,2,3,4,5),3)

    rdd.mapPartitions (partiton => {
      //Only map can be used, not foreach, because foreach has no return value
      partiton.map(line=>{
        //save line
      }
      )
    })

    rdd.count()//Action is required to trigger execution
    sc.stop()
````



Finally, it should be noted that if the operation is of the iterator type, we cannot print the size of the iterator outside the loop. Once the size method is executed, it is equivalent to the iterato will be executed, so you will find a null value in the subsequent foreach, remember Iterator iterators can only be executed once.




Reference documentation:

http://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/rdd/RDD.html

https://spark.apache.org/docs/2.1.0/ rdd-programming-guide.html

If you have any questions, you can scan the code and follow the WeChat public account: I am a siege division (woshigcs), and I will leave a message in the background for consultation. Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326151608&siteId=291194637