2023_Spark_Experiment 12: Use of Spark advanced operators

Master the use of Spark advanced operators in code
Similarity analysis
What the three functions have in common is the Transformation operator. Lazy operator.
Difference analysis
The map function processes data one by one, that is, the input parameters of map must contain one piece of data and other parameters you need to pass.
The mapPartitions function processes partition data together. In other words, the input of the mapPartitions function is an "iterator" composed of all the data of a partition. Then the function can process them one by one, and then output all the results according to the iterator. It can also be used in combination with yield for better results.
RDD's mapPartitions is a variant of map, both of which can perform parallel processing of partitions.
The main difference between the two is the different granularity of the call: the input transformation function of map is applied to each element in the RDD, while the input function of mapPartitions is applied to each partition.
The mapPartitionsWithIndex function is actually not much different from the mapPartitions function, because the mapPartitionsWithIndex function is called behind mapPartitions, but one parameter is closed. The function mapPartitionsWithIndex can get the partition index number;
Suppose an RDD has 10 elements, divided into 3 partitions. If you use the map method, the input function in the map will be called 10 times; if you use the mapPartitions method, its input function will be called only 3 times, once for each partition.
mapPartitionsWithIndex operates with partition subscripts.
1、mapPartitionWithIndex
 

import org.apache.spark.{SparkConf,SparkContext}
object mapPartitionWithIndex {

def getPartInfo:(Int,Iterator[Int]) => Iterator[String] = (index:Int,iter:Iterator[Int]) =>{
iter.map(x =>"[ PartId " + index +", elems: " + x + " ]")
}

def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("RddMapPartitionsWithIndexDemo")
val sc = new SparkContext(conf)

val rdd1 = sc.parallelize(List(1,2,3,4,5,9,6,7,8),numSlices =3 )
val rdd2 = rdd1.mapPartitionsWithIndex(getPartInfo)
rdd2.collect().foreach(println)

}

}

2、aggregate
First we create an RDD

scala> val rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24


scala> rdd1.collect
warning: there was one feature warning; re-run with -feature for details
res24: Array[Int] = Array(1, 2, 3, 4, 5)

This RDD has only 1 shard, containing 5 pieces of data: 1, 2, 3, 4, 5.
 
Then let's apply the aggregate method.
 
Before using aggregate, we should first define two functions to be used as input parameters for aggregate.
scala> 
// Entering paste mode (ctrl-D to finish)
def pfun1(p1: Int, p2: Int): Int = {
p1 * p2
}

// Exiting paste mode, now interpreting.
pfun1: (p1: Int, p2: Int)Int

scala>

 
scala> 
// Entering paste mode (ctrl-D to finish)
def pfun2(p3: Int, p4: Int): Int = {
p3 + p4
}

// Exiting paste mode, now interpreting.
pfun2: (p3: Int, p4: Int)Int

scala>
Then comes the second function. No more explanation.
 
Then we can finally start applying our aggregate method.
scala> rdd1.aggregate(3)(pfun1, pfun2)
res25: Int = 363

scala>
The output is  363  ! How is this result calculated?
 
First, our zeroValue, the initial value, is  3  .  Then through the introduction in the above section, we know that the pfun1 function will be applied first  . Because our RDD has only one shard, there will only be one pfun1 function call in the entire operation process. Its calculation process is as follows:
 
First, use the initial value 3 as the parameter p1 of pfun1, and then use the first value in the RDD, that is, 1, as the parameter p2 of pfun1. From this we can get the first calculated value as  3 * 1 = 3  . Then the result 3 is passed in as the p1 parameter, and the second value in the RDD, 2, is passed in as p2, and the second calculation result is  3 * 2 = 6  . By analogy, after the entire pfun1 function is executed, the result obtained is   3  *  1  *  2  *  3  *  4  *  5  =  360  . The application process of pfun1 is a bit like "sliding calculation in RDD".
After the first parameter function pfun1 of the aggregate method is executed, we get the result value 360. Therefore, at this time, the execution of the second parameter function pfun2 will begin.
 
The execution process of pfun2 is similar to that of pfun1. ZeroValue is also passed in as the parameter of the first operation. Here, zeroValue, which is  3, is  passed in as the p3 parameter, and then the result  360 of pfun1 is  passed in as the p4 parameter. , the calculated result is 363. Because pfun1 has only one result value, the entire aggregate process is calculated, and the final result value is  363  .

import org.apache.spark.{SparkConf, SparkContext}

object RddAggregateDemo {

def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("RddDemos")
val sc = new SparkContext(conf)
val rdd1 = sc.parallelize(List("12","23","345",""),numSlices = 2)
//下面代码等价与rdd1.aggregate("")(x+y) => math.min(x.length,y.length)),(x,y)=>x+y))
val result = rdd1.aggregate("")(func1,func2)
println(result)
}
//(x,y)=>math.min(x.Length,y.length)
def func1:(String,String) => String =(x:String,y:String) =>{
println("<x: " + x + ",x.len: " + x.length + ">,<y: " + y +",y.len: " + y.length + ">")
val ret =math.min(x.length,y.length).toString
println("func1 ret:" +ret)
ret
}
//(x+y) => x+y
def func2:(String,String)=>String=(x:String,y:String) =>{
println("========" +(x+y))
x+y
}
}

3、aggregateByKey
Create an RDD via a scala collection in a parallelized manner
scala> val pairRdd = sc.parallelize(List((“cat”,2),(“cat”,5),(“mouse”,4),(“cat”,12),(“dog”,12),(“mouse”,2)),2)
pairRdd This RDD has two areas. One area stores:
(“cat”,2),(“cat”,5),(“mouse”,4)
Stored in another partition is:
(“cat”,12),(“dog”,12),(“mouse”,2)
Then, execute the following statement
scala > pairRdd.aggregateByKey(100)(math.max(_ , _), _ + _ ).collect
result:
res0: Array[(String,Int)] = Array((dog,100),(cat,200),(mouse,200)
The following is a detailed explanation of the principle of executing the above statement:
aggregateByKey means: aggregate according to key
Step 1: Put the data with the same key in each partition together
Partition one
(“cat”,(2,5)),(“mouse”,4)
Partition 2
(“cat”,12),(“dog”,12),(“mouse”,2)
Step 2: Find the local maximum value
Apply the first function passed in to each partition, math.max(_, _). The function of this function is to find the maximum value of each key in each partition.
At this time, special attention should be paid to the 100 in aggregateByKe(100)(math.max(_, _),_+_), which is actually an initial value.
When finding the maximum value in partition one, 100 will be added to the value of each key. At this time, each partition will look like the following
Partition one
(“cat”,(2,5,100)),(“mouse”,(4,100))
Then after finding the maximum value it becomes:
(“cat”,100), (“mouse”,100)
Partition 2
(“cat”,(12,100)),(“dog”,(12.100)),(“mouse”,(2,100))
After finding the maximum value, it becomes:
(“cat”,100),(“dog”,100),(“mouse”,100)
Step 3: Overall aggregation
The results of the previous step are further synthesized. At this time, 100 will no longer participate.
The final result is:
(dog,100),(cat,200),(mouse,200)
The same Key values ​​in PairRDD are aggregated, and a neutral initial value is also used during the aggregation process. Similar to the aggregate function, the type of the return value of aggregateByKey does not need to be consistent with the type of value in the RDD. Because aggregateByKey performs an aggregation operation on values ​​in the same Key, the final type returned by the aggregateByKey' function is still PairRDD. The corresponding result is the Key and the aggregated value, while the aggregate function directly returns a non-RDD result.


import org.apache.spark.{SparkConf, SparkContext}

object RddAggregateByKeyDemo {
def main(args: Array[String]): Unit = {
val conf =new SparkConf().setMaster("local").setAppName("RddAggregateByKeyDemo")
val sc = new SparkContext(conf)
val pairRDD = sc.parallelize(List(("cat",2),("cat",5),("mouse",4),("cat",12),("dog",12),("mouse",2)),numSlices = 2)
val rdd = pairRDD.aggregateByKey(100)(func2,func2)
val resultArray = rdd.collect
resultArray.foreach(println)
sc.stop()

}
def func2(index:Int,iter:Iterator[(String,Int)]):Iterator[String] = {
iter.map(x=>"[partID:" + index + ",val: " +x +"]")
}

//(x,y)=>math.max(x+y)
def func1:(Int,Int) => Int =(x:Int,y:Int) =>{
println("<x: " + x + "," +",y:"+ y + ">")
val ret =math.max(x,y)
println("func1 max:" +ret)
ret
}
//(x+y) => x+y
def func2:(Int,Int)=>Int=(x:Int,y:Int) =>{
println("========func2 x :" + x + ",y:"+y)
println("========func2 ret =====" +(x+y))
x+y
}

}

Guess you like

Origin blog.csdn.net/pblh123/article/details/133084778