【spark】Partition

RDDs are elastic distributed datasets. Usually, RDDs are large and will be divided into multiple partitions and stored on different nodes.

So what are the benefits of partitioning?

Partitioning can reduce communication overhead between nodes, and correct partitioning can greatly speed up program execution.

Let's look at an example

First we need to understand a concept, partition is not equivalent to block.

Blocking is that we divide all data into many blocks for storage, which is called block.

As shown in Figure b above, the generated blocks, each block may contain the same range of data.

Partitioning is to separate the same range of data, as shown in Figure a

We can clearly see from this picture that we connect data with the same primary key.

The data that has been partitioned in order only needs to be joined according to the same primary key partition.

If the join is not performed by the partitioned block, multiple connection operations are performed additionally, and the same data is connected to different nodes, which greatly increases the communication overhead.

In some operations, join groupby, filer, etc. can benefit greatly from partitioning.

zoning principle

A partitioning principle for RDD partitioning is to make the number of partitions as much as possible equal to the number of CPU cores in the cluster. Too many partitions will not increase execution speed.

For example, our cluster has 10 cores, we divide it into 5 zones, each core performs a partition operation, and the remaining 5 cores are wasted.

If, we divide into 20 partitions, and one core executes one partition, the remaining 10 partitions will be queued.

Default number of partitions

For different Spark deployment modes (local mode, standalone mode, YARN mode, Mesos mode)

You can configure the default partition by setting the parameter value of spark.default.parallelism.

Of course, for different deployment modes, the number of default partitions must be different.

Local mode, the default is the number of CPUs of the local machine, if local[N] is set, the default is N. Generally use local[*] to use all CPU counts.

In YARN mode, the larger of the sum of all CPU cores in the cluster and 2 is used as the default value.

Mesos mode, the default partition is 8.

How to Manually Set Up Partitions

1. When creating an RDD: manually specify the number of partitions when calling the textFile and parallelize methods.

　　Syntax format sc.parallelize(path, partitionNum) sc.textFile(path, partitionNum)

//sc.parallelize(path，partitionNum)
val list = List("Hadoop","Spark","Hive");
val rdd1 = sc.parallelize(list,2);//Set up two partitions
val rdd2 = sc.parallelize(list);//The partition is not specified, the default is spark.default.parallelism

//sc.textFile(path，partitionNum)
val rdd3 = sc.textFile("file://+local file address",2);//Set up two partitions
val rdd4 = sc.textFile("file://+local file address");//The partition is not specified, the default is min (2, spark.default.parallelism)
val rdd5 = sc.textFile("file://+HDFS file address");//The partition is not specified, the default is the number of HDFS file segments

2. When a new RDD is obtained through the transformation operation: just call the repartition method.

　　Syntax val newRdd = oldRdd.repartition(1)

val list = List("Hadoop","Spark","Hive");
val rdd1 = sc.parallelize(list,2);//Set up two partitions
val newRdd1 = rdd1.repartition(3);//Repartition
println(newRdd1.partitions.size);//View the number of partitions

Partition function

We need to understand two rules when using partitions

(1) Only the RDD of the Key-Value type has partitions, and the value of the RDD partition of the non-Key-Value type is None

(2) The partition ID range of each RDD: 0~numPartitions-1, which determines which partition this value belongs to

Spark internally provides HashPartitioner and RangePartitioner two partition strategies.

1.HashPartitioner

principle:

　　For a given key, calculate its hashCode, divide it by the number of partitions and take the remainder. If the remainder is less than 0, use the remainder + the number of partitions, and the final returned value is the partition ID to which the key belongs.

grammar:

　　rdd.partitionBy (new spark.HashPartitioner (n))

Example:

object Main{
  def main(args:Array[String]): Unit ={
    val conf = new SparkConf ();
    val sc = new SparkContext(conf);
    val list = List((1,1),(1,2),(2,1),(2,2),(3,1),(3,2))//Note that it must be (k, v) Form
    val rdd = sc.parallelize(list);
    rdd.partitionBy (new spark.HashPartitioner (3)); //HashPartitioner
  }
}

Let's look at another example to illustrate how HashPartitioner partitions.

Note: Actually, the default partitioning method we use is actually the HashPartitioner partitioning method

2.RangePartitioner　

principle:

　　Determine the partition range according to the key value range and the number of partitions, and assign the keys in the range to the corresponding partitions.

grammar:

　　rdd.partitionBy (new RangePartitioner (n, rdd));

Example:

object Main{
  def main(args:Array[String]): Unit ={
    val conf = new SparkConf ();
    val sc = new SparkContext(conf);
    val list = List((1,1),(1,2),(2,1),(2,2),(3,1),(3,2))
    val rdd = sc.parallelize(list);
    val pairRdd = rdd.partitionBy(new RangePartitioner(3,rdd));//Divided into three areas according to the key
  }
}

3. User-defined partition

If the above two partitions cannot meet your requirements, we can define the partition class by ourselves.

　　Spark provides the corresponding interface, we just need to extend the Partitionerabstract class.

abstract class Partitioner extends Serializable {
  def numPartitions: Int //This method needs to return the number of partitions you want to create
  def getPartition(key: Any): Int //This function needs to calculate the input key, and then returns the partition ID of the key, the range must be 0 to numPartitions-1
}

　　After the definition is complete, it is called by the partitionBy() method.

Example:

Looking at such an example, we need to partition according to the last digit. We cannot meet the requirements with ordinary partitions, so at this time, we need to define the partition class by ourselves.

class UDPartitioner (numParts:Int) extends Partitioner {
  //override the number of partitions
  override def numPartitions = numParts;
  //Override the partition get function and return the key used by the partition
  override def getPartition(key: Any) : Int= {
    key.toString.toInt % 10;//Get the last digit and return it by dividing the key by 10 and taking the remainder.
  }
}
object Main{
  def main(args:Array[String]): Unit ={
    val conf = new SparkConf ();
    val sc = new SparkContext(conf);
    //Simulate the data of 5 partitions
    val data1 = sc.parallelize(1 to 10,5);
    //Note, RDD must be key-value, in order to use user-defined partition class, determine partition by key
    val data2 = data1.map((_,1));//Placeholder usage, equivalent to data.map(x => (x,1))
    //Transform to 10 partitions according to the tail number and write them to 10 files respectively
    data2.partitionBy(new UDPartitioner(10)).saveAsTextFile("file:///usr/local/output");
  }
}

In addition, we can also ensure the correct allocation of partitions by additionally defining the hashcode() method and equal() method in the function.

【spark】Partition

Guess you like