Increase spark parallelism

1 Spark Streaming increases task concurrency
Q: In Spark Streaming, what are the methods to increase task concurrency?
A: The number of s1 cores: the number of task threads, that is, --executor-cores
      s2 repartition
      s3 Streaming + Kafka, Direct mode, increase the number of partition partitions
      s4 Streaming + Kafka, Receiver mode, increase the number of Receivers
      s5 reduceByKey and reduceByKeyAndWindow passes in the second parameter



1 Spark Streaming increases task concurrency
Q: In Spark Streaming, what are the methods to increase task concurrency?
A: The number of s1 cores: the number of task threads, that is, --executor-cores
      s2 repartition
      s3 Streaming + Kafka, Direct mode, increase the number of partition partitions
      s4 Streaming + Kafka, Receiver mode, increase the number of Receivers
      s5 reduceByKey and reduceByKeyAndWindow passes in the second parameter

1.1 Parse

s1 & s2:
When RDD is being calculated, each partition will start a task, so the number of partitions of RDD determines the total task data.
The number of compute nodes (Executor) you apply for and the number of cores on each compute node determine the tasks you can execute in parallel at the same time.
eg:
RDD has 100 partitions, then 100 tasks will be generated during calculation. Your resource configuration is 10 computing nodes, each with 2 cores, and the number of tasks that can be parallelized at the same time is 20. To calculate this RDD, you need to 5 rounds.
If the computing resources remain the same, if you have 101 tasks, you need 6 rounds. In the last round, only one task is being executed, and the rest of the cores are idling.
If the resources remain unchanged and your RDD has only two partitions, then only 2 tasks are running at the same time, and the remaining 18 cores are idling, resulting in a waste of resources.
This is how to increase the parallelism of tasks by increasing the number of RDD partitions in Spark tuning.

s5:
If the number of parallel tasks used in any stage of the computation is not high enough, the cluster resources cannot be fully utilized. For example, for distributed reduce operations such as reduceByKey and reduceByKeyAndWindow, the default number of parallel tasks is determined by the spark.default.parallelism parameter. You can pass in the second parameter in operations such as reduceByKey to manually specify the parallelism of the operation, or you can adjust the global spark.default.parallelism parameter.

1.2 Can increasing the partition in kafka increase the parallelism of Spark in processing data?

s4:
In the Receiver method, the partitions in Spark are not related to the partitions in Kafka, so if we increase the number of partitions per topic, we just increase the number of threads to process topics consumed by a single Receiver. But this does not increase Spark's parallelism in processing data. However, in this mode, one Receiver corresponds to one partition, so the parallelism of Spark tasks can be increased by increasing the number of Receivers.

s3:
In the Direct mode, the partitions in Kafka and the partitions in RDD are in a one-to-one correspondence to read Kafka data in parallel. This mapping relationship is also more conducive to understanding and optimization.

Level of Parallelism. After specifying it, the default number of partitions is specified when performing reduce type operations. This parameter is usually indispensable in practical projects, and is generally determined according to the input and the size of each executor memory. Set the level of parallelism or the attribute spark.default.parallelism to change the parallelism level. Generally speaking, each CPU core can allocate 2~3 tasks



. The parallelism is the number of tasks in each stage in the Spark job, which also represents the Spark job. the degree of parallelism at each stage.
What if the parallelism is not adjusted and the parallelism is too low?
Suppose that we have allocated enough resources to our spark job in the spark-submit script, such as 50 executors, each executor has 10G memory, and each executor has 3 cpu
cores. The resource limit of the cluster or yarn queue has basically been reached.
The task is not set, or there are very few settings, for example, it is set, 100 tasks. There are 50 executors, and each executor has 3 cpu cores, that is to say, when any stage of your Application runs, there are a total of 150 cpu
cores, which can run in parallel. But you now have only 100 tasks, distribute them evenly, each executor is allocated 2 tasks, ok, then there are only 100 tasks running at the same time, and each executor will only run 2 tasks in parallel. The remaining cpu core of each executor is wasted.

The setting of a reasonable degree of parallelism should be large enough to fully utilize your cluster resources reasonably; for example, in the above example, the cluster has a total of 150 cpu cores and can run 150 tasks in parallel. Then you should set the parallelism of your Application to at least 150, in order to fully and effectively use your cluster resources to allow 150 tasks to be executed in parallel; and after the task is increased to 150, it can run in parallel at the same time, and you can also Reduce the amount of data to be processed by each task; for example, a total of 150G of data needs to be processed, if it is 100 tasks, each task calculates 1.5G of data; now it has increased to 150 tasks, which can be run in parallel, and each task is mainly It can handle 1G of data.


1. The number of tasks should be at least the same as the total number of cpu cores in the Spark application (ideally, for example, there are 150 cpu cores in total, 150 tasks are allocated, run together, and finish running at about the same time).

2. The official recommendation is that the number of tasks should be set to 2~3 times the total number of cpu cores in the spark application, such as 150 cpu cores, and the number of tasks should be set to 300~500.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326141830&siteId=291194637