Spark Tuning Overview

Divided into several parts:
development tuning, resource tuning, data skew tuning, shuffle tuning


Development tuning:

It mainly includes these aspects of RDD lineage design, rational use of operators, optimization of special operations, etc.

Avoid creating duplicate RDDs, reuse the same RDD as much as possible, one RDD contains another RDD, and perform persistent memory (serialization), disk (serialization) for RDDs that are used multiple times
Try to avoid using shuffle operators

During the shuffle process, the same key on each node will be written to the local disk file first, and then other nodes need to pull the same key in the disk file on each node through network transmission. Moreover, when the same key is pulled to the same node for aggregation operation, there may be too many keys processed on a node, resulting in insufficient memory storage, and then overflowing to the disk file. Therefore, during the shuffle process, a large number of IO operations of reading and writing disk files and network transmission operations of data may occur. Disk IO and network data transfer are also the main reasons for poor shuffle performance.

Use high-performance functional operators
  • Shuffle operation using map-side pre-aggregation The
    so-called map-side pre-aggregation refers to performing an aggregation operation on the same key locally on each node, similar to the local combiner in MapReduce. After map-side pre-aggregation, each node will have only one same key locally, because multiple identical keys are aggregated. When other nodes pull the same key on all nodes, it will greatly reduce the amount of data that needs to be pulled, thereby reducing disk IO and network transmission overhead.

  • Use reduceByKey/aggregateByKey instead of groupByKey
    because reduceByKey and aggregateByKey operators both use user-defined functions to pre-aggregate the same key local to each node. The groupByKey operator does not perform pre-aggregation, and the full amount of data will be distributed and transmitted between nodes in the cluster, resulting in relatively poor performance.

  • Using mapPartitions instead of the common map
    mapPartitions class of operators, one function call will process all the data of a partition, instead of one function call processing one, the performance will be relatively higher. But sometimes, using mapPartitions will have OOM (out of memory) problems. Because a single function call will process all the data of a partition, if there is not enough memory, too many objects cannot be recovered during garbage collection, and an OOM exception may occur. So be careful when using this type of operation!

  • The principle of using foreachPartitions instead of foreach
    is similar to "use mapPartitions instead of map", and it is also a function call to process all the data of a partition, rather than a function call to process a piece of data. In practice, it is found that the operator of the foreachPartitions class is very helpful to improve the performance. For example, in the foreach function, all the data in the RDD is written to MySQL, then if it is a common foreach operator, it will be written one data at a time, and each function call may create a database connection, which is bound to be frequent. The performance of creating and destroying database connections is very low; but if the foreachPartitions operator is used to process the data of a partition at one time, then for each partition, just create a database connection, and then perform a batch insert operation. At this time, the performance is relatively high.

  • Use repartitionAndSortWithinPartitions instead of repartition and sort operations

Using Kryo to optimize serialization performance is higher than default java serialization performance
Optimize data structures
  • Try to use strings instead of objects, use primitive types (such as Int, Long) instead of strings, and use arrays instead of collection types, so as to reduce memory usage as much as possible, thereby reducing GC frequency and improving performance.

Resource tuning

The key point at runtime is that
Spark divides stages according to shuffle operators. If a shuffle class operator (such as reduceByKey, join, etc.) is executed in our code, then a stage boundary will be divided at the operator. It can be roughly understood that the code before the execution of the shuffle operator will be divided into one stage, and the code executed after the shuffle operator will be divided into the next stage. Therefore, when a stage is first executed, each of its tasks may pull all the keys that need to be processed by itself from the node where the task of the previous stage is located, and then use all the same keys that are pulled. We write our own operator functions to perform aggregation operations (such as the functions received by the reduceByKey() operator). This process is called shuffle.

Therefore, the memory of the Executor is mainly divided into three parts:
the first one is to let the task execute the code we wrote, which by default accounts for 20% of the total memory of the Executor;
the second one is to let the task pull the previous stage through the shuffle process After the output of the task, it is used for aggregation and other operations. By default, it accounts for 20% of the total memory of the Executor;
the third block is used for RDD persistence, which accounts for 60% of the total memory of the Executor by default.

The execution speed of the task is directly related to the number of CPU cores of each Executor process. A CPU core can only execute one thread at a time. The multiple tasks assigned to each Executor process are run concurrently by multiple threads in the form of one thread per task. If the number of CPU cores is sufficient and the number of tasks allocated is reasonable, then in general, these task threads can be executed quickly and efficiently.

After understanding the basic principles of Spark job operation, it is easy to understand the parameters related to resources.

  • num-executors
    sets how many Executor processes the Spark job will use to execute. Recommendation: about 50~100 Executor processes

  • executor-memory
    sets the memory for each Executor process. Inspection suggestion: The memory setting of each Executor process is 4G~8G. It is more appropriate to multiply num-executors by executor-memory, which cannot exceed the maximum memory of the queue, generally 1/3~1/2

  • executor-cores
    sets the number of CPU cores for each Executor process. Recommendation: It is more appropriate to set the number of CPU cores of Executor to 2~4

  • If driver-memory
    uses the collect operator to pull all the data of the RDD to the Driver for processing, it must ensure that the memory of the Driver is large enough, otherwise there will be an OOM memory overflow problem

  • spark.default.parallelism

== This parameter is used to set the default number of tasks for each stage. This parameter is extremely important, if not set may directly affect the performance of your Spark job. ==
The default number of tasks for Spark jobs is 500~1000. If this parameter is not set, Spark will set the number of tasks according to the number of blocks in the underlying HDFS. The default is that one HDFS block corresponds to one task. Generally speaking, the number of default settings of Spark is too small (for example, dozens of tasks). If the number of tasks is too small, the parameters of the Executor you set earlier will be lost. Just imagine, no matter how many Executor processes you have, how big the memory and CPU core are, but there are only 1 or 10 tasks, then 90% of the Executor processes may have no task execution at all, which is a waste of resources! Therefore, the setting principle suggested by the Spark official website is that it is more appropriate to set this parameter to 2~3 times of num-executors * executor-cores. For example, the total number of CPU cores of Executor is 300, then it is possible to set 1000 tasks. At this time The resources of the Spark cluster can be fully utilized.

  • spark.storage.memoryFraction
    sets the proportion of RDD persistent data in Executor memory, the default is 0.6. When there are more RDD persistence operations in the operation, the value of this parameter can be appropriately increased, and if it is less, it will be relatively less.

  • spark.shuffle.memoryFraction
    sets the proportion of Executor memory that can be used for aggregation operations after a task pulls the output of the task of the previous stage during the shuffle process. The default is 0.2. That is to say, the Executor defaults to only 20% of the memory used for this operation. When the shuffle operation is performing aggregation, if it is found that the memory used exceeds the 20% limit, the excess data will be overwritten to the disk file, which will greatly reduce the performance.

以下是一份spark-submit命令的示例,大家可以参考一下,并根据自己的实际情况进行调节:
./bin/spark-submit \
  --master yarn-cluster \
  --num-executors 100 \
  --executor-memory 6G \
  --executor-cores 4 \
  --driver-memory 1G \
  --conf spark.default.parallelism=1000 \
  --conf spark.storage.memoryFraction=0.5 \
  --conf spark.shuffle.memoryFraction=0.3 \

Data skew tuning

The principle of data skew:
When performing shuffle, the same key on each node must be pulled to a task on a node for processing, such as aggregation or join according to the key. At this time, if the amount of data corresponding to a key is particularly large, data skew will occur. For example, most keys correspond to 10 pieces of data, but individual keys correspond to 1 million pieces of data, then most tasks may only be allocated 10 pieces of data, and then run over in 1 second; but individual tasks may be allocated 1 million pieces of data data, to run for an hour or two. Therefore, the running progress of the entire Spark job is determined by the task with the longest running time.

How to locate code and data partial conditions that lead to skewed data:

Data skew only happens during shuffle. Here are some commonly used operators that may trigger shuffle operations: distinct, groupByKey, reduceByKey, aggregateByKey, join, cogroup, repartition, etc. When data skew occurs, it may be caused by the use of one of these operators in your code.
We can take a deep look at the amount of data allocated by each task in the current stage on the Spark Web UI, so as to further determine whether the uneven data allocated by the task causes the data to skew

  1. If the data is skewed by the group by and join statements in Spark SQL, query the key distribution of the tables used in SQL.
  2. If the data is skewed by executing the shuffle operator on Spark RDDs, you can add code to check the key distribution in the Spark job, such as RDD.countByKey(). Then collect/take the counted number of occurrences of each key to the client and print it, and you can see the distribution of keys.

Improve the parallelism of shuffle operations (preferred)

Increasing the number of shuffle read tasks allows multiple keys originally assigned to one task to be assigned to multiple tasks, allowing each task to process less data than before. For example, if there are originally 5 keys, each key corresponds to 10 pieces of data, and these 5 keys are allocated to a task, then this task will process 50 pieces of data. After adding the shuffle read task, each task is assigned a key, that is, each task processes 10 pieces of data, so naturally the execution time of each task will be shortened

This solution usually cannot completely solve the data skew, because if there are some extreme situations, such as the amount of data corresponding to a key is 1 million, then no matter how much your task number increases, the key corresponding to 1 million data will definitely still be allocated. It is processed in a task, so data skew is bound to occur.

image

Two-stage aggregation (local aggregation + global aggregation)

The core implementation idea of ​​this scheme is to perform two-stage aggregation. The first time is local aggregation. First, assign a random number to each key, such as a random number within 10. At this time, the same key will become different, such as (hello, 1) (hello, 1) (hello, 1) (hello, 1), it becomes (1_hello, 1) (1_hello, 1) (2_hello, 1) (2_hello, 1). Then, perform aggregation operations such as reduceByKey on the data after the random number, and perform local aggregation, then the local aggregation result will become (1_hello, 2) (2_hello, 2). Then remove the prefix of each key, it will become (hello,2)(hello,2), and perform the global aggregation operation again to get the final result, == is only applicable to the shuffle operation of the aggregation class==.

image

Convert reduce join to map join (broadcast + map)

Ordinary join will go through the shuffle process, and once shuffled, it is equivalent to pulling the data of the same key into a shuffle read task and then joining, which is the reduce join. However, if an RDD is relatively small, you can use the broadcast small RDD full data + map operator to achieve the same effect as join, that is, map join. At this time, no shuffle operation will occur, and no data skew will occur. . The specific principle is shown in the figure below.
There are few applicable scenarios, because this scheme is only applicable to the case of one large table and one small table.

image

Sampling skewed keys and splitting join operations

For the data skew caused by join, if only a few keys cause the skew, you can split a few keys into independent RDDs, and add random prefixes and break them into n parts for join. The data will not be concentrated on a few tasks, but distributed to multiple tasks for join.

==If there are too many keys that cause skew, for example, thousands of keys cause data skew, then this method is not suitable for ==

image

Join using random prefix and scaling RDD

The original same key is changed into a different key by appending a random prefix, and then these processed "different keys" can be distributed to multiple tasks for processing, instead of having one task process a large number of the same keys. The difference between the previous solution is that in the previous solution, only the data corresponding to a few skewed keys are specially processed as far as possible. Since the RDD needs to be expanded in the processing process, the memory usage of the previous solution after expanding the RDD is not significant. However, this solution is aimed at the situation where there are a large number of skewed keys, and some keys cannot be split for separate processing, so the data can only be expanded for the entire RDD, which requires high memory resources.


Shuffle tuning

Tuning overview

The performance of most Spark jobs is mainly consumed in the shuffle link, because this link includes a large number of operations such as disk IO, serialization, and network data transmission. Therefore, it is necessary to tune the shuffle process if the performance of the job is to be improved.

ShuffleManager development overview

The main component responsible for the execution, calculation and processing of the shuffle process is the ShuffleManager, that is, the shuffle manager. With the development of Spark versions, ShuffleManager is also iterating and becoming more and more advanced.

Before Spark 1.2, the default shuffle calculation engine was HashShuffleManager. The ShuffleManager has a very serious drawback, that is, a large number of intermediate disk files are generated, which in turn affects performance due to a large number of disk IO operations.

Therefore, in versions after Spark 1.2, the default ShuffleManager was changed to SortShuffleManager. Compared with HashShuffleManager, SortShuffleManager has certain improvements. The main thing is that each Task will generate more temporary disk files during the shuffle operation, but in the end, all temporary files will be merged into one disk file, so each Task has only one disk file. . When the shuffle read task of the next stage pulls its own data, it only needs to read part of the data in each disk file according to the index.

How HashShuffleManager works

Unoptimized HashShuffleManager

In the shuffle write stage
, after a stage completes the calculation, in order to execute the shuffle-like operator in the next stage, the data processed by each task is "classified" by key, and the same key is written (through a memory buffer) with the same key. In a disk file, each disk file belongs to only one task of the downstream stage.

Shuffle read
is usually what a stage does at the beginning. At this time, each task of the stage needs to pull all the same keys in the calculation result of the previous stage from each node through the network and pull the disk file created in the write stage to its own node. Then perform operations such as key aggregation or connection.

Optimized HashShuffleManager

spark.shuffle.consolidateFiles defaults to false. If set to true, the optimization mechanism is enabled. During the write process, the task does not create a disk file for each task of the downstream stage. ==The concept of shuffleFileGroup will appear at this time==, each shuffleFileGroup will correspond to a batch of disk files, and the number of disk files is the same as the number of tasks in the downstream stage. As many CPU cores are on an Executor, as many tasks can be executed in parallel. Each task executed in parallel in the first batch will create a shuffleFileGroup and write the data to the corresponding disk file.
When the CPU core of the Executor finishes executing a batch of tasks and then executes the next batch of tasks, the next batch of tasks will reuse the existing shuffleFileGroup, including the disk files in it.

How SortShuffleManager works

In this mode, the data will be written to a memory data structure first. At this time, different data structures may be selected according to different shuffle operators. If it is a shuffle operator of the aggregation class such as reduceByKey, the Map data structure will be used, and while the aggregation is performed through Map, the memory will be written at the same time; if it is a common shuffle operator such as join, the Array data structure will be used to write directly. into memory. Then, each time a piece of data is written into the memory data structure, it will be judged whether a certain critical threshold has been reached. If a critical threshold is reached, then an attempt is made to overflow the data in the in-memory data structure to disk and then empty the in-memory data structure.
Before the overflow is written to the disk file, the existing data in the memory data structure is sorted according to the key. After sorting, the data will be written to disk files in batches,
resulting in multiple temporary files. Finally, all the previous temporary disk files will be merged, which is the merge process. At this time, the data in all the previous temporary disk files will be read out, and then written into the final disk file in turn.

SortShuffleManager greatly reduces the number of files due to a process of merging disk files. For example, the first stage has 50 tasks, a total of 10 Executors, each Executor executes 5 tasks, and the second stage has 100 tasks. Since each task ultimately has only one disk file, there are only 5 disk files on each Executor at this time, and only 50 disk files for all Executors.

image

Bypass Operation Mechanism
The following diagram illustrates the principle of bypass SortShuffleManager. The trigger conditions for the bypass operation mechanism are as follows:

The number of shuffle map tasks is less than the value of the spark.shuffle.sort.bypassMergeThreshold parameter.
Not an aggregate shuffle operator (such as reduceByKey).
At this time, the task will create a temporary disk file for each downstream task, hash the data according to the key, and then write the key to the corresponding disk file according to the hash value of the key. Of course, when writing a disk file, it is also written to the memory buffer first, and after the buffer is full, it is overflowed to the disk file. Finally, all temporary disk files are also merged into one disk file and a single index file is created.
The disk writing mechanism of this process is actually exactly the same as that of the unoptimized HashShuffleManager, because a surprising number of disk files must be created, but only one disk file will be merged at the end. Therefore, a small number of final disk files also make the performance of shuffle read better compared to the unoptimized HashShuffleManager.

The difference between this mechanism and the normal SortShuffleManager operating mechanism is that: first, the disk writing mechanism is different; second, no sorting is performed. That is to say, the biggest advantage of enabling this mechanism is that in the process of shuffle write, data sorting operation is not required, which saves this part of the performance overhead.

image

Shuffle related parameter tuning
  • spark.shuffle.file.buffer
    defaults to 32K, the size of the shuffle write task buffer, and the appropriate jump size (64K) can reduce the number of times the disk file is overwritten during the shuffle write process

  • spark.reducer.maxSizeInFlight
    defaults to 48M, the buffer size of the shuffle read task

  • spark.shuffle.io.maxRetries
    defaults to 3. The number of retries for the huffle read task to pull data from the shuffle write task can be increased if the data volume is particularly large. Adjusting this parameter can greatly improve the stability.

  • spark.shuffle.io.retryWait
    defaults to 5S, and the waiting interval for each retry to pull data is increased (60S) to improve stability

  • spark.shuffle.memoryFraction
    defaults to 0.2, the proportion of memory allocated to the shuffle read task for aggregation operations in the Executor memory

  • spark.shuffle.manager
    sets the type of ShuffleManager, options: hash, sort and tungsten-sort. If the business needs to be sorted, use sort, and use the bypass mechanism if sorting is not required to avoid sorting.

  • spark.shuffle.sort.bypassMergeThreshold
    defaults to 200. When ShuffleManager is SortShuffleManager, if the number of shuffle read tasks is less than this threshold (default is 200), the sorting operation will not be performed during shuffle write.

spark.shuffle.consolidateFiles
defaults to false, set to true, then the consolidate mechanism will be enabled, and the output files of shuffle write will be greatly merged


Originally, most of the opinions came from Meituan Dianping ( https://tech.meituan.com/spark-tuning-basic.html ). Through my own practice, I found that tuning in development, resources, and data skew can have obvious effects. , and the result of shuffle tuning is not obvious. Therefore, you must pay attention to the priority in the process of tuning.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325287818&siteId=291194637