There are two types of operation operations in spark: Transformation and Action, the differences are as follows:
Transformation: represents that the transformation operation is our calculation process, and the return is RDD[T], which can be a chain transformation and is delayed triggered of.
Action: Represents a specific behavior. The returned value is not an RDD type, it can be an object, or a numeric value, or it can be a Unit representing no return value, and the action will immediately trigger the execution of the job.
The official documentation method collection of Transformation is as follows:
```` map filter flatMap mapPartitions mapPartitionsWithIndex sample union intersection distinct groupByKey reduceByKey aggregateByKey sortByKey join cogroup cartesian pipe coalesce repartition repartitionAndSortWithinPartitions ````
The official documentation method collection for Action is as follows:
```` reduce collect count first take takeSample takeSample takeOrdered saveAsTextFile saveAsSequenceFile saveAsObjectFile countByKey foreach ````
Combined with daily development, such as the commonly used count, collect, saveAsTextFile, they all belong to the action type, and the result value is either empty, a value, or an object object. Others such as map and filter return values are of RDD type, so simply distinguish the two differences, you can use whether the return value is RDD[T] type to identify.
Then back to the topic, let's talk about the difference between foreachPartition and mapPartitions. Careful friends may find that foreachPartition does not appear in the above method list. The reason may be that the official document only lists the commonly used processing methods, but this does not affect For our use, first of all, we will look at what kind of operation foreachPartition should belong to according to the above distinction principle. The API of this method in the official website document is as follows:
```` public void foreachPartition(scala.Function1<scala.collection.Iterator<T>,scala.runtime.BoxedUnit> f) Applies a function f to each partition of this RDD. Parameters: f - (undocumented) ````
From the above return value is empty, it can be seen that foreachPartition should belong to the action operation, and mapPartitions is in Transformation, so it is a transformation operation. In addition, the difference in application scenarios is that mapPartitions can obtain the return value and continue to do other things on the returned RDD. operation, and foreachPartition has no return value and is an action operation, so it is generally used at the end of the program, for example, to land data in a storage system such as mysql, es, or hbase, you can use it.
Of course, data can also be landed in Transformation, but it must rely on the action operation to trigger it, because the Transformation operation is executed lazily. If there is no action method to trigger, the Transformation operation will not be executed, which needs attention.
A foreachPartition example:
````scala val sparkConf = new SparkConf () val sc = new SparkContext (sparkConf) sparkConf.setAppName("spark demo example ") val rdd=sc.parallelize(Seq(1,2,3,4,5),3) rdd.foreachPartition(partiton=>{ // partiton.size cannot execute this method, otherwise there will be no data in the foreach method below, //because iterator can only be executed once partiton.foreach(line=>{ //save(line) landing data }) }) sc.stop() ````
An example of mapPartitions:
```` val sparkConf = new SparkConf () val sc = new SparkContext (sparkConf) sparkConf.setAppName("spark demo example ") val rdd=sc.parallelize(Seq(1,2,3,4,5),3) rdd.mapPartitions (partiton => { //Only map can be used, not foreach, because foreach has no return value partiton.map(line=>{ //save line } ) }) rdd.count()//Action is required to trigger execution sc.stop() ````
Finally, it should be noted that if the operation is of the iterator type, we cannot print the size of the iterator outside the loop. Once the size method is executed, it is equivalent to the iterato will be executed, so you will find a null value in the subsequent foreach, remember Iterator iterators can only be executed once.
Reference documentation:
http://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/rdd/RDD.html
https://spark.apache.org/docs/2.1.0/ rdd-programming-guide.html
If you have any questions, you can scan the code and follow the WeChat public account: I am a siege division (woshigcs), and I will leave a message in the background for consultation. Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.