Spark calculate the number of partitions

The number of partitions determine the degree of parallelism, so its importance is self-evident in performance tuning.
1, if the number of partitions is too small, on the one hand it will cause the degree of parallelism is not enough, may cause cluster resources idle, never do justice to the entire cluster. On the other hand cause partition is too large, the possibility of data skew even higher.
2, if the number of partitions too, then spent a long time on task scheduling may be greater than the actual task execution time.

-------------------------------------------------- determine the number of partitions ---------------------------------------------- ----

在没有指定分区数的情况下,分区数默认被设置为defaultMinPartitions
def defaultMinPartitions: Int = math.min(defaultParallelism, 2) 也就是说,分区数不超过2个,
官方给的解释是为了保证在处理小文件时的性能。
这种情况更多地适用于从一个Scala集合来创建一个RDD,例如sc.parallelize(List(1,2,3,6))。

First, the data source for Hadoop

一般情况下,分区数可通过如下公式表示
分区数=max(sc.defaultParallelism, total_data_size / data_block_size)

Spark supports all hadoop I / O format, because it uses the same Hadoop InputFoarmat API Spark own and other formatting procedures. Accordingly, consistent and enter the partition Spark Hadoop / MapReduce input split mode by default.

Typically, the Spark hdfs for each block to create a partition (Note: if a particularly long line, is larger than a block size, then the final number of blocks will be less than the number of partitions). However, if you want to further split, Spark will be divided by line operation.

Note that, when the input file is compressed, the number of partitions but also depending on whether the compression format support division may be.

The following code fragment is acquired Hadoop MapReduce operations:

/**
   * Get the lower bound on split size imposed by the format.
   * @return the number of bytes of the minimal split for this format
   */
  protected long getFormatMinSplitSize() {
    return 1;
  }
/**
   * Get the minimum split size
   * @param job the job
   * @return the minimum number of bytes that can be in a split
   */
  public static long getMinSplitSize(JobContext job) {
    return job.getConfiguration().getLong(SPLIT_MINSIZE, 1L);
  }
  
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
long blockSize = file.getBlockSize();    
protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }
long splitSize = computeSplitSize(blockSize, minSize, maxSize);

Partitions = totalSize / splitSize

Second, the data source for the Non-Hadoop

1, is generated from one another RDD RDD, the number inherited from the parent partitions by partition number RDD Spark common switching operation.
2, when generating the RDD, the number depending on the specific partitions Spark Key-based operation by a switching operation may be.
Link in the article, to 12 and described in more detail, reference can be: understood Spark partition
3, include the case of several partitions final maximum size is determined by the connector
for S3, the attribute fs.s3n.block.size or fs.s3.block.size OK.
For Cassandra, determined by the attribute spark.cassandra.input.split.size_in_mb.
For Mongo, determined by the attribute spark.mongodb.input.partitionerOptions.partitionSizeMB.

Third, set the appropriate number of Partition

Usually set the size and amount of data 100-10K partitons, depending on cluster size. Empirical upper and lower
limit: setting the number of partitions of the cluster is 2-3 times the total number of available core.
Limit: time delay it takes for task scheduling should be less than the time it takes to perform the task. If the length of the mission is also less than the delay time task scheduling, indicating the partition too little data, applications spend more time on task scheduling, which is unreasonable.

注:可通过Spark UI来查看不同分区数下的任务的调度和执行时间,以此确定最佳的分区数。

Guess you like

Origin blog.csdn.net/weixin_43878293/article/details/89928046
Recommended