One piece of Spark performance optimization is enough

1. Spark performance optimization: development and optimization

1. Avoid creating duplicate RDDs.
For the same piece of data, only one RDD should be created, and multiple RDDs cannot be created to represent the same piece of data. Otherwise, the Spark job will repeatedly calculate multiple RDDs representing the same data, which will increase the performance overhead of the job.
2. Reuse the same RDD as much as possible.
If the data of multiple RDDs overlap or contain, we should reuse one RDD as much as possible, so that the number of RDDs can be reduced as much as possible, thereby reducing the number of operator executions as much as possible.
3. Persist the RDD used multiple times or chickPoint
every time you perform an operator operation on an RDD, it will recalculate from the source, calculate the RDD, and then execute your operator on this RDD operating. Persist the RDD that is used multiple times. At this point, Spark will save the data in the RDD to memory or disk according to your persistence strategy. In the future, each time this RDD is operated on, the persistent RDD data will be extracted directly from the memory or disk to avoid repeated calculations.
Persistence strategy, the best performance by default is MEMORY_ONLY, but the premise is that the memory must be sufficient Large, can store all the data of the entire RDD. Avoid serialization and deserialization operation performance overhead.
If a lot of RDD data will cause jvm memory overflow (OOM), then it is recommended to try to use MEMORY-ONLY_SER, change the level to serialize the RDD data and then save it in memory. At this time, the partition is just a byte array, which greatly reduces the object This level reduces the memory usage. This level has more serialization and deserialization performance overhead than MEMORY_ONLY. If the number of RDDs is too large, it will still cause memory overflow. Then it is recommended to use MEMORY_AND_DISK_SER. This strategy will give priority to trying to data Cached in memory, it will be written to disk only if it cannot be stored inside.
Use RDD.persist(StorageLevel.MEMORY_ONLY_SER) syntax like this.
Therefore, the persistence level of serialization can be further optimized, that is to say, use the Kryo serialization library, so that you can get faster serialization speed and occupy less memory space, but remember, If the RDD element is a custom type, register the custom type in Kryo in advance. If you want to ensure high performance even when the persistent data of the RDD may be lost, you can checkpoint the RDD.
4. Try to avoid using shuffle operators.
In the process of running Spark jobs, the most performance-consuming part is the shuffle process. In our development process, avoid using reduceByKey, join, distinct, repartition, etc. to shuffle as much as possible. Operators, try to use non-shuffle operators of the map class. In this case, Spark jobs with no shuffle operations or with fewer shuffle operations can greatly reduce performance overhead.
5. Use high-performance operators
1) Reasonable use of reduceByKey and groupByKey.
When possible, it is recommended to use reduceByKey instead of groupByKey, because the reduceByKey operator will use a custom function to pre-aggregate the same key locally on each node. groupByKey will not be pre-aggregated. The full amount of data will be distributed and transmitted between the boundaries of the cluster. The performance is relatively poor.
2) Use mapPartitons to replace ordinary map
mapPartitions operators. One function call will process one partition. All data, instead of processing one function call at a time, the performance is relatively higher.
3) Using foreachPartitions instead of foreach
is also a function call to process all the data of a partition, rather than a function call to process one piece of data. It is still helpful for performance improvement
4) Perform coalesce operation after using filter.
Usually after filtering out more data by performing filter operator on an RDD, it is recommended to use coalesce operator to manually reduce the number of RDD partitions and compress the data in RDD to fewer partitions. As long as you use fewer tasks, you can
process all the partitions.
6. Broadcast large variables
use external variables in the operator function. By default, Spark will assign multiple copies of the variable to the task through the network. Each task has a copy. If the variable itself is relatively large, the performance overhead of a large number of variable copies in the network transmission and the excessive memory occupied in the Excutor of each node cause frequent GC, which greatly affects performance. Therefore, it is recommended to use Spark's broadcast function to broadcast the changed amount to ensure that only one copy of the variable resides in the memory of each Executor. The copy is shared when the task is executed. Thereby reducing the performance overhead of network transmission, reducing the overhead of Executor memory, and reducing the frequency of GC.
7. Use Kryo to optimize serialization performance
If you want to use the Kryo serialization mechanism, you must first set a parameter with SparkConf and use new SparkConf(). set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). When using Kryo, it requires serialization classes, which must be registered first, and the best performance has been obtained. If you do not register, Kryo must always save the type's full authority name, but it takes up a lot of memory.
8. Optimize data structure
1). Give preference to arrays and strings instead of collections. In other words, use array in preference to collections such as ArrayList, LinkedList, and HashMap.
2). Avoid using multi-level nested object structure.
3). For some scenarios that can be avoided, try to use int instead of String.
In the implementation of Spark coding, especially for the code in the operator function, try to use strings instead of objects, use primitive types (int, Long) instead of strings, and use arrays instead of collection types, so as to reduce memory usage as much as possible, thereby reducing GC frequency to improve performance.

Second, Spark performance optimization: resource tuning

1. num-executors
This parameter is used to set the total number of Executor processes to be used for the spark job to execute. When the Driver sends true resources to the Yarn cluster, Yarn will start as many Executor processes as possible on each node in the group according to your messenger 4 . If you don't set it, only
a small number of Executor processes will be started by default . At this time, your spark running speed is very slow.
It is recommended to generally set about 50~100 ratio tables as appropriate. If there are too few, the cluster resources will not be fully utilized, and if there are too many, most of the queues may not be able to give sufficient resources.
2. executor-memory
This parameter is used to set the memory of each Executor process. The size of the
executor memory often directly determines the performance of the Spark job. It is recommended that the memory setting of each executor process is 4~8g. The specific setting still depends on the resource queues of different departments.
3. executor-cores
This parameter is used to set the number of CPU cores for each Executor process. This parameter determines the ability of each Executor process to execute task threads in parallel.
It is more appropriate to set the number of CPU cores of the Executor to 2~4. It also needs to be determined according to the resource queues of different departments. You can see what the maximum CPU core limit of your resource queue is, and then determine how many CPU cores each Executor process can allocate to each Executor process according to the number of Executors set.
4.driver- memory
This parameter is used to set the memory of the Driver process.
Generally speaking, the memory of the Driver is not set, or it should be enough to set about 1G. The collect operator that needs to be used pulls the RDD data to the Driver for processing, then the memory of the Driver must be large enough, otherwise the memory will overflow.
5.spark.default.parallelism
This parameter is used to set the default number of tasks for each stage. This parameter is extremely important. If it is not set, it may directly affect the performance of your Spark job. It is
recommended that the default number of tasks is 500~1000. If not set, spark sets the number of tasks according to the number of underlying HDFS blocks. By default, one HDFS block corresponds to one task. As a result, many Executor processes have no task execution, and resources are wasted.
6.spark.storage.memoryFraction
This parameter is used to set the percentage of RDD persistent data in the Executor memory. The default is 0.6. By default, 60% of Executor's memory can be used to maintain persistent RDD data.
If there are more RDD persistent operations in the Spark job, the value of this parameter can be appropriately increased to ensure that the persistent data can be stored in the memory. Avoid memory is not enough to cache all the data, resulting in data can only be written to the disk, reducing performance.
7.spark.shuffle.memoryFraction
This parameter is used to set the proportion of Executor memory that can be used for aggregation operations after a task pulls the output of the task from the previous stage in the shuffle process. The default is 0.2.
If there are few RDD persistence operations in the Spark job and many shuffle operations, it is recommended to reduce the memory ratio of the persistence operation and increase the memory ratio of the shuffle operation to avoid insufficient memory when there is too much data in the shuffle process, and overflow must be written To disk, reducing performance.
In addition, if the job is running slowly due to frequent GC, which means that the memory for the task to execute user code is not enough, then it is also recommended to lower the value of this parameter.

Three, Spark performance optimization: data tilt tuning

1. Filter a small number of keys
that cause tilt. If you find that there are only a few keys that cause tilt, and the impact on the calculation itself is not large, then it is very suitable to use this scheme
principle: after filtering out the keys that cause data tilt, These keys will not participate in the calculation, and it is naturally impossible to generate data tilt.
Advantages: simple to implement, and the effect is also very good, can completely avoid data tilt.
Disadvantages: There are not many applicable scenarios. In most cases, there are still many keys that cause tilt, not just a few.
2. Improve the parallelism of shuffle operations.
This solution is preferred because it is the simplest solution for dealing with data skew.
Principle: Increasing the number of shuffle read tasks allows multiple keys originally assigned to a task to be assigned to multiple tasks, so that each task can process less data than the original.
Advantages: It is relatively simple to implement, which can effectively alleviate and reduce The impact of data skew.
Disadvantages: It just relieves the data skew, not completely eradicating the problem, and its effect is limited.
3. Two-stage aggregation (local aggregation + global aggregation)
is more suitable when performing aggregation shuffle operators such as reduceByKey on RDD or group by statements in Spark SQL for group aggregation.
Principle: By adding random prefixes to the original same key, it becomes multiple different keys, so that the data originally processed by one task can be distributed to multiple tasks for partial aggregation, thereby solving the problem of excessive data processing by a single task Many questions. Then remove the random prefix and perform global aggregation again to get the final result.
Advantages: The effect is very good for the data skew caused by the shuffle operation of the aggregation class. Usually, data skew can be solved, or at least data skew can be greatly relieved, and the performance of Spark jobs can be improved by several times.
Disadvantages: Only applicable to aggregate shuffle operations, and the scope of application is relatively narrow. If it is a shuffle operation of the join class, other solutions have to be used.
4. Convert reduce join to map join.
When using join operations on RDDs or using join statements in Spark SQL, and the amount of data in an RDD or table in the join operation is relatively small.
Principle: ordinary joins will go The shuffle process, and once shuffle, it is equivalent to pulling the data of the same key into a shuffle read task and then join, at this time it is reduce join. But if an RDD is relatively small, you can use the broadcast small RDD full data + map operator to achieve the same effect as the join, that is, map join. At this time, no shuffle operation will occur, and no data skew will occur. .
Advantages: The effect is very good for data skew caused by the join operation, because shuffle will not happen at all, and data skew will not happen at all.
Disadvantages: There are fewer applicable scenarios, because this solution is only applicable to a large table and a small table. After all, we need to broadcast the small table, which will consume more memory resources at this time

Fourth, Spark performance optimization: shuffle tuning

1. spark.shuffle.file.buffer
This parameter is used to set the buffer size of the bufferOutputStream of the shuffle write task. The default value is 32K. Before writing the data to the disk file, it will be written to the buffer buffer. After the buffer is full, write to the disk.
Suggestion: If the memory resources available for the job are sufficient, you can appropriately increase the size of this parameter ( For example, 64k), thereby reducing the number of times that the disk file is overwritten during the shuffle write process, which can also reduce the number of disk IOs, thereby improving performance.
2.spark.reducer.maxSizeInFlight
This parameter is used to set the buffer size of shuffle read task, the default is 48m and this buffer determines how much data can be pulled each time.
Recommendation: If the memory resources available for the job are sufficient, you can appropriately increase the size of this parameter (for example, 96m) to reduce the number of times to pull data, which can also reduce the number of network transmissions, thereby improving performance.
3. spark.shuffle.io.maxRetries
default: 3 When
shuffle read task pulls its own data from the node where the shuffle write task is located, if the pull fails due to a network abnormality, it will automatically retry. This parameter represents the maximum number of retries that can be retried.
Recommendation: For those jobs that contain particularly time-consuming shuffle operations, it is recommended to increase the maximum number of retries (such as 60).
4. spark.shuffle.io.retryWait
default value 5s
this The parameter represents the waiting interval for each retry to pull data, and the default is 5s.
Suggestion: It is recommended to increase the interval time (such as 60s) to increase the stability of the shuffle operation.
5.spark.shuffle.memoryFraction
default value: 0.2
Parameter description: This parameter represents the proportion of memory allocated to shuffle read task for aggregation operation in Executor memory, and the default is 20%.
Tuning suggestion: This parameter was explained in the resource parameter tuning. If the memory is sufficient and the persistence operation is rarely used, it is recommended to increase the ratio.
6.spark.shuffle.manager
Default value: sort
This parameter is used to set the type of ShuffleManager.
Suggestion: Since SortShuffleManager sorts data by default, if you need this sorting mechanism in your business logic, you can use the default SortShuffleManager;
7.spark.shuffle.consolidateFiles
default value: false
If you use HashShuffleManager, this parameter is valid . If set to true, the consolidation mechanism will be turned on, and the output files of shuffle write will be greatly merged. In the case of a particularly large number of shuffle read tasks, this method can greatly reduce disk IO overhead and improve performance.
Suggestion: If you really don't need the sorting mechanism of SortShuffleManager, in addition to using the bypass mechanism, you can also try to manually specify the spark.shffle.manager parameter as hash, use HashShuffleManager, and turn on the consolidation mechanism.

Guess you like

Origin blog.csdn.net/weixin_43777152/article/details/109255182