Must Read | Spark's repartitioning and sorting

Must Read | Spark's repartitioning and sorting

Talking about big data

Said yesterday, the use of mapPartitions skills. Everyone should know that the mapPartitions value performs map operations on the entire partition. And the partition of PairRDD is based on the physical block of hdfs by default, of course, if it is indivisible, it is the number of hdfs files. But we can also pass in the HashPartitioner to the partitionBy operator to repartition the RDD, and it will make the data with the same hashcode of the key fall into the same partition.

After spark 1.2, a high-quality operator repartitionAndSortWithinPartitions was introduced. This operator adds sort to Spark's Shuffle. If, followed by the mapPartitions operator, the operator is for the partitions that have been sorted by key, which is a bit like mr. Unlike groupbykey, data is not loaded into memory at a time, but is loaded from disk one record at a time using an iterator. This approach minimizes memory pressure.

repartitionAndSortWithinPartitions can also be used for secondary sorting.

Here is a simple example.


import org.apache.spark.Partitioner
 class KeyBasePartitioner(partitions: Int) extends Partitioner {

   override def numPartitions: Int = partitions

   override def getPartition(key: Any): Int = {
     val k = key.asInstanceOf[Int]
     Math.abs(k.hashCode() % numPartitions)
   }
 }

 import org.apache.spark.SparkContext._
     sc.textFile("file:///opt/hadoop/spark-2.3.1/README.md").flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).map(each=>(each._2,each._1))
     implicit val caseInsensitiveOrdering = new Ordering[Int] {
      override def compare(a: Int, b: Int) = b.compareTo(a)
     }
     // Sort by key, using 
 res7.repartitionAndSortWithinPartitions(new KeyBasePartitioner(3)).saveAsTextFile("file:///opt/output/")

As a result, you can see that each partition is valid.


mdhdeMacBook-Pro-3:output mdh$ pwd
/opt/output
mdhdeMacBook-Pro-3:output mdh$ ls
_SUCCESS        part-00000      part-00001      part-00002
mdhdeMacBook-Pro-3:output mdh$ head -n 10 part-00000 
(24,the)
(12,for)
(9,##)
(9,and)
(6,is)
(6,in)
(3,general)
(3,documentation)
(3,example)
(3,how)
mdhdeMacBook-Pro-3:output mdh$ head -n 10 part-00001
(16,Spark)
(7,can)
(7,run)
(7,on)
(4,build)
(4,Please)
(4,with)
(4,also)
(4,if)
(4,including)
mdhdeMacBook-Pro-3:output mdh$ head -n 10 part-00002
(47,)
(17,to)
(8,a)
(5,using)
(5,of)
(2,Python)
(2,locally)
(2,This)
(2,Hive)
(2,SparkPi)
mdhdeMacBook-Pro-3:output mdh$

The above is just a simple use. For an example of secondary sorting and efficient combination of mapPartitions, the wave tip will be updated to the planet in these two days.
【Finish】

Guess you like

Origin blog.51cto.com/15127544/2664759