Must Read | Spark's repartitioning and sorting
Talking about big data
Said yesterday, the use of mapPartitions skills. Everyone should know that the mapPartitions value performs map operations on the entire partition. And the partition of PairRDD is based on the physical block of hdfs by default, of course, if it is indivisible, it is the number of hdfs files. But we can also pass in the HashPartitioner to the partitionBy operator to repartition the RDD, and it will make the data with the same hashcode of the key fall into the same partition.
After spark 1.2, a high-quality operator repartitionAndSortWithinPartitions was introduced. This operator adds sort to Spark's Shuffle. If, followed by the mapPartitions operator, the operator is for the partitions that have been sorted by key, which is a bit like mr. Unlike groupbykey, data is not loaded into memory at a time, but is loaded from disk one record at a time using an iterator. This approach minimizes memory pressure.
repartitionAndSortWithinPartitions can also be used for secondary sorting.
Here is a simple example.
import org.apache.spark.Partitioner
class KeyBasePartitioner(partitions: Int) extends Partitioner {
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
val k = key.asInstanceOf[Int]
Math.abs(k.hashCode() % numPartitions)
}
}
import org.apache.spark.SparkContext._
sc.textFile("file:///opt/hadoop/spark-2.3.1/README.md").flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).map(each=>(each._2,each._1))
implicit val caseInsensitiveOrdering = new Ordering[Int] {
override def compare(a: Int, b: Int) = b.compareTo(a)
}
// Sort by key, using
res7.repartitionAndSortWithinPartitions(new KeyBasePartitioner(3)).saveAsTextFile("file:///opt/output/")
As a result, you can see that each partition is valid.
mdhdeMacBook-Pro-3:output mdh$ pwd
/opt/output
mdhdeMacBook-Pro-3:output mdh$ ls
_SUCCESS part-00000 part-00001 part-00002
mdhdeMacBook-Pro-3:output mdh$ head -n 10 part-00000
(24,the)
(12,for)
(9,##)
(9,and)
(6,is)
(6,in)
(3,general)
(3,documentation)
(3,example)
(3,how)
mdhdeMacBook-Pro-3:output mdh$ head -n 10 part-00001
(16,Spark)
(7,can)
(7,run)
(7,on)
(4,build)
(4,Please)
(4,with)
(4,also)
(4,if)
(4,including)
mdhdeMacBook-Pro-3:output mdh$ head -n 10 part-00002
(47,)
(17,to)
(8,a)
(5,using)
(5,of)
(2,Python)
(2,locally)
(2,This)
(2,Hive)
(2,SparkPi)
mdhdeMacBook-Pro-3:output mdh$
The above is just a simple use. For an example of secondary sorting and efficient combination of mapPartitions, the wave tip will be updated to the planet in these two days.
【Finish】