Research on SparkRDD Parallelism and Partition Algorithm Source Code

0 Preface

1 RDD parallelism and partition

1.1 Concept explanation

By default, Spark can split a job into multiple tasks and send them to the Executor node for parallel calculation. The number of tasks that can be calculated in parallel is called the degree of parallelism. This number can be specified when constructing the RDD. Remember, the number of tasks executed in parallel here does not refer to the number of split tasks, so don't get confused.

1.2 Data parallelism and partition algorithm when reading memory

1.2.1 Reading memory data parallelism algorithm

Source code of makeRDD

  def makeRDD[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    parallelize(seq, numSlices)
  }

From the source code of makeRDD, we can see that the bottom layer of makeRDD calls parallelize(seq, numSlices), which means that makeRDD is a package of parallelize. From the source code, you can know that makeRDD needs to pass in two parameters, one is the data source (sequence) created from memory, and the other is numSlices, which requires the degree of parallelism (the number of slices) you pass in. The default is numSlices: Int = defaultParallelism (The default degree of parallelism).

The number of partitions of the default RDD is the degree of parallelism, and setting the degree of parallelism is to set the number of partitions. However, the number of partitions is not necessarily the degree of parallelism. For example, the number of partitions and the degree of parallelism are not equal when resources are insufficient. The degree of parallelism can be modified through the second parameter of makeRDD. What is the default degree of parallelism? What are the core rules of data partitioning?

Click to view the source code:

(1) Click to view the source code at the makeRDD location

The code after entering is as follows:

(2) Continue to enter and view

It is found to be an abstract method, and an abstract method must have an implementation class: idea to view the method of the implementation class, crtl + alt + b

(3) The search results are as follows

 

(5) Finally found the source code as follows

tips: find the method of the implementation class in idea crtl + alt + b. Find the method after ctrl + f

The final core source code is as follows:

  override def defaultParallelism(): Int =
    scheduler.conf.getInt("spark.default.parallelism", totalCores)

It can be seen from the source code that if the specified parameters ("spark.default.parallelism") cannot be obtained, the default value totalCores will be used, and totalCores is the total number of machine cores available in the current environment. What does that mean? The number of cores is determined by the parameters set by the following code

val conf = new SparkConf().setAppName("word count").setMaster("local") 

The specific relationship is shown in the following figure:

The parameter configured in setMaster() is the number of cores available in the current environment. The current environment includes windows environment and linux environment, etc.

  • When it is loacal, it means there is only one core
  • When it is local[*], it means all the cores in the machine
  • When it is local[3], it means to use 3 cores, which can be specified by this parameter.

The specific sample code is as follows

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object testMakeRDD {

    def main(args: Array[String]): Unit = {

        //TODO spark从内存中创建RDD
        //默认RDD分区的数量就是并行度,设定并行度就是设定分区数量,但分区数量不一定就是并行度
        //资源不够的情况下分区数与并行度是不等的
        val conf = new SparkConf().setAppName("word count").setMaster("local[*]") //单核数
        //创建spark上下文
        val sc = new SparkContext(conf)
        //1.makeRDD的第一个参数:数据源
        //2.makeRDD的第二个参数:并行度(分区数量)
        val rdd: RDD[Int] = sc.makeRDD(List(1,2,3,4))
        //不传参数的话默认分区数是多少?
//        println(rdd.collect().mkString(","))
//        将RDD的处理数据保存到文件中
        rdd.saveAsTextFile("output")
        sc.stop()
    }

}

The code execution result is shown in the figure. Because the local windows environment has eight cores, 8 files are generated

 

1.2.2 Read memory data partition algorithm

The partition algorithm refers to the partition in which the data is finally allocated for calculation. One file is generated from one partition

1 test

object testFenQu {

    def main(args: Array[String]): Unit = {

        //TODO spark从内存中创建RDD

        val conf = new SparkConf().setAppName("word count").setMaster("local[*]")
        //创建spark上下文
        val sc = new SparkContext(conf)
//        val rdd1: RDD[Int] = sc.makeRDD(List(1,2,3,4),2)
//        rdd1.saveAsTextFile("output")

//        val rdd2: RDD[Int] = sc.makeRDD(List(1,2,3,4),4)
//        rdd2.saveAsTextFile("output")

//        val rdd3: RDD[Int] = sc.makeRDD(List(1,2,3,4),3)
 //       rdd3.saveAsTextFile("output")


        val rdd4: RDD[Int] = sc.makeRDD(List(1,2,3,4,5),3)
        rdd4.saveAsTextFile("output")
        sc.stop()
    }



}

Take the above code RDD4 as an example, and the results are as follows:

From the code we can see that there are 3 partitions, then 3 files will be generated

Open the file part-00000 and find that there is only 1.
Open the file part-00001 and find 2, 3.
Open the file part-00002 and find 4 and 5.

Why is it so?

2 View source code

 (1) Click makeRDD to view the source code

 (2) Click parallelize(seq, numSlices) to view further

(3) Step 3

(4) Step 4

The final data partition code is as follows:

private object ParallelCollectionRDD {
  /**
   * Slice a collection into numSlices sub-collections. One extra thing we do here is to treat Range
   * collections specially, encoding the slices as other Ranges to minimize memory cost. This makes
   * it efficient to run Spark over RDDs representing large sets of numbers. And if the collection
   * is an inclusive Range, we use inclusive range for the last slice.
   */
  def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
    if (numSlices < 1) {
      throw new IllegalArgumentException("Positive number of partitions required")
    }
    // Sequences need to be sliced at the same set of index positions for operations
    // like RDD.zip() to behave as expected
    def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
      (0 until numSlices).iterator.map { i =>
        val start = ((i * length) / numSlices).toInt
        val end = (((i + 1) * length) / numSlices).toInt
        (start, end)
      }
    }
    seq match {
      case r: Range =>
        positions(r.length, numSlices).zipWithIndex.map { case ((start, end), index) =>
          // If the range is inclusive, use inclusive range for the last slice
          if (r.isInclusive && index == numSlices - 1) {
            new Range.Inclusive(r.start + start * r.step, r.end, r.step)
          }
          else {
            new Range(r.start + start * r.step, r.start + end * r.step, r.step)
          }
        }.toSeq.asInstanceOf[Seq[Seq[T]]]
      case nr: NumericRange[_] =>
        // For ranges of Long, Double, BigInteger, etc
        val slices = new ArrayBuffer[Seq[T]](numSlices)
        var r = nr
        for ((start, end) <- positions(nr.length, numSlices)) {
          val sliceSize = end - start
          slices += r.take(sliceSize).asInstanceOf[Seq[T]]
          r = r.drop(sliceSize)
        }
        slices
      case _ =>
        val array = seq.toArray // To prevent O(n^2) operations for List etc
        positions(array.length, numSlices).map { case (start, end) =>
            array.slice(start, end).toSeq
        }.toSeq
    }
  }
}

By analyzing the above code, it can be known that the data created in the memory will eventually go to the case _ section of code. The method called by this code is:

The parameters passed in are the length of the array and the number of fragments

def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {

  (0 until numSlices).iterator.map { i =>

    val start = ((i * length) / numSlices).toInt

    val end = (((i + 1) * length) / numSlices).toInt

    (start, end) //返回数据的偏移量左闭右开

  }

}


-------------------------------------

case _ =>
        val array = seq.toArray // To prevent O(n^2) operations for List etc
        positions(array.length, numSlices).map { case (start, end) =>
            array.slice(start, end).toSeq
        }.toSeq

The code is explained as follows:

For example:

 Note: until is left closed and right opened

Summary: When the data in the memory can be divisible, it is basically evenly distributed. If it cannot be divisible, it will be distributed according to a certain algorithm.

1.3 Data partition algorithm when reading files

When reading file data, the data is sliced ​​and partitioned according to the rules of Hadoop file reading, and there are some differences between the slicing rules and the rules of data reading. The specific Spark core source code is as follows

public InputSplit[] getSplits(JobConf job, int numSplits)

    throws IOException {


    long totalSize = 0;                           // compute total size

    for (FileStatus file: files) {                // check we have valid files

      if (file.isDirectory()) {

        throw new IOException("Not a file: "+ file.getPath());

      }

      totalSize += file.getLen();

    }


    long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);

    long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.

      FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);

     

    ...

   

    for (FileStatus file: files) {

   

        ...

   

    if (isSplitable(fs, path)) {

          long blockSize = file.getBlockSize();

          long splitSize = computeSplitSize(goalSize, minSize, blockSize);


          ...


  }

  protected long computeSplitSize(long goalSize, long minSize,

                                       long blockSize) {

    return Math.max(minSize, Math.min(goalSize, blockSize));

  }

 

Partition test

object partition03_file {

    def main(args: Array[String]): Unit = {
        val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkCoreTest1")
        val sc: SparkContext = new SparkContext(conf)

        //1)默认分区的数量:默认取值为当前核数和2的最小值
        //val rdd: RDD[String] = sc.textFile("input")

        //2)输入数据1-4,每行一个数字;输出:0=>{1、2} 1=>{3} 2=>{4} 3=>{空}
        //val rdd: RDD[String] = sc.textFile("input/3.txt",3)

        //3)输入数据1-4,一共一行;输出:0=>{1234} 1=>{空} 2=>{空} 3=>{空} 
        val rdd: RDD[String] = sc.textFile("input/4.txt",3)

        rdd.saveAsTextFile("output")

        sc.stop()
    }
}

 

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/108155531