The goal of Spark tuning is to efficiently complete business goals without affecting the normal operation of other businesses. Generally, in order to achieve this goal, it is generally necessary to maximize the use of the physical resources of the cluster, such as CPU, memory, and disk IO, to make certain One reaches the bottleneck.

Optimization of Spark-core

Dynamic resource scheduling in Yarn mode

principle:

Dynamic resource scheduling is to solve the waste of resources and unreasonable resources. According to the current application task load, the number of Executors is increased or decreased in real time, so as to realize dynamic resource allocation and make the entire Spark system healthier.

Suitable for scenarios: batch tasks. Especially when using Spark as a resident service, dynamic resource scheduling will greatly improve resource utilization. For example, the JDBCServer service, most of the time the process does not accept JDBC requests, so releasing resources during this period of idle time will greatly save cluster resources

Condition: Yarn External Shuffle must be turned on to use this function

spark.shuffle.service.enabled=true

spark.dynamicAllocation.enabled=true //Enable dynamic resource scheduling

spark.dynamicAllocation.minExecutors //Minimum number of Executors.

spark.dynamicAllocation.initialExecutors //The initial number of Executors.

spark.dynamicAllocation.maxExecutors //Maximum executors

spark.dynamicAllocation.executorIdleTimeout //Ordinary Executor idle timeout.

Shuffle stage tuning

1) Use the serialization KryoSerializer method

Spark supports the use of kryo serialization mechanism. The kryo serialization mechanism is faster than the default java serialization mechanism, and the serialized data is smaller, about 1/10 of the java serialization mechanism, so after kryo serialization is optimized, data can be transmitted over the network Less, the memory resources consumed in the cluster are greatly reduced.

The first step is to set in sparkconf: SparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")

The second step is to register some custom classes that you need to serialize through kryo: SparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer").registerKryoClasses(new Class[] {CategorySortKey.class})

2) Set a reasonable degree of parallelism

Adjust the degree of parallelism to optimize the number of tasks, the data processed by each task and the processing capacity of the machine. Check the CPU usage and memory usage. When tasks and data are not evenly distributed on each node, but concentrated on individual nodes, you can increase the degree of parallelism to make tasks and data more evenly distributed on each node. Increase the parallelism of tasks and make full use of the computing power of cluster machines. Generally, the parallelism is set to 2-3 times the sum of the cluster CPUs.

Method of setting parallelism:

(1) Set the parallelism parameter in the operation function that will generate shuffle, with the highest priority

RDD. groupByKey (10);

(2) Set the degree of parallelism in the code configuration parameters, followed by priority

SparkConf spConf=new SparkConf().setMaster("local[4]")

.set("spark.default.parallelism", "10")

(3) Configure in the spark-defaults.conf configuration file, with the lowest priority

spark.default.parallelism=10

3) Use broadcast variables

principle:

Broadcast distributes the data set to each node. When the Spark task uses this data set during the execution process, it will look for the broadcast data set locally. If you don't use Broadcast, every time the task needs a data collection, the data will be serialized into the task, which not only takes time, but also makes the task very large.

scenes to be used:

When each task segment needs the same data set during execution, the public data set can be broadcast to each node, so that each node saves a copy locally.
When large tables and small tables do join operations, the small tables can be broadcast to each node, so that the join operation can be converted into a normal operation, reducing the shuffle operation.

ArrayList list= new ArrayList();

list.add("test");

Broadcast bc= javaSparkContext.broadcast(list);

4) Use cache

If you use the same RDD multiple times in your application, you can cache the RDD, that is, cache the intermediate calculation results to avoid recalculation each iteration.

Commonly used caching methods:

MEMORY_ONLY_SER

MEMORY_ONLY

MEMORY_AND_DISK

DISK_ONLY

rdd.persist(StorageLevel.MEMORY_ONLY);

rdd.unpersist;

5) Use Checkpoint

There are two main applications of checkpoint in spark: one is to checkpoint RDD in spark core, which can cut off the dependency of checkpoint RDD, and save RDD data to reliable storage (such as HDFS) for data recovery; the other is used in In spark streaming, checkpoint is used to save DStreamGraph and related configuration information, so that when the Driver crashes and restarts, the previous progress can continue to be processed (for example, the job of the previous waiting batch will continue to be processed after the restart).

rdd.checkpoint() does not trigger the calculation. The calculation is only triggered after the action method is encountered. After the job is executed, the checkpoint calculation is started, and a job is triggered again to perform the checkpoint calculation on this rdd. So before the checkpoint, cache the rdd to avoid recalculating the rdd dependency chain during the checkpoint calculation process.

javaSparkContext.setCheckpointDir(pathName);

rdd.cache ();

rdd.checkpoint ();

6) Set spark.shuffle.memoryFraction

This parameter is used to set the proportion of Executor memory that can be used for aggregation operations after a task pulls the output of the task of the previous stage in the shuffle process. The default is 0.2. In other words, by default, Executor has only 20% of memory used for this operation. When the shuffle operation is performing aggregation, if it is found that the used memory exceeds the 20% limit, the excess data will be overwritten to the disk file, which will greatly reduce the performance.

7) Turn on the optimization of consolidatedFiles

The pulling process of shuffle read is to aggregate while pulling. Each shuffle read task will have its own buffer buffer, and each time it can only pull data of the same size as the buffer buffer (the buffer size can be set by the above parameters), and then aggregate it through a Map in memory And so on. After a batch of data is aggregated, the next batch of data is pulled and put into the buffer for aggregation. Keep looping until all the data is finally pulled, and the final result is obtained.

After the consolidation mechanism is turned on, during the shuffle write process, the task does not create a disk file for each task in the downstream stage. At this time, the concept of shuffleFileGroup will appear. Each shuffleFileGroup corresponds to a batch of disk files, and the number of disk files is the same as the number of tasks in the downstream stage. How many tasks can be executed in parallel as many CPU cores on an Executor. Each task executed in parallel in the first batch will create a shuffleFileGroup and write the data to the corresponding disk file.

When the CPU core of the Executor finishes executing a batch of tasks and then executes the next batch of tasks, the next batch of tasks will reuse the previously existing shuffleFileGroup, including the disk files. In other words, at this time, the task will write the data to the existing disk file, but not the new disk file. Therefore, the consolidation mechanism allows different tasks to reuse the same batch of disk files, which can effectively merge the disk files of multiple tasks to a certain extent, thereby greatly reducing the number of disk files and improving the performance of shuffle write.

new SparkConf().set("spark.shuffle.consolidateFiles", "true") 默认为false。

MapPartitions partition replacement map calculation results

Use mapPartitions to calculate the result for each partition

Use foreachPartitions instead of foreach

The principle is similar to "Using mapPartitions instead of map". It is also a function call to process all data of a partition instead of one function call to process one piece of data. In practice, it is found that the operators of the foreachPartitions class are still very helpful for performance improvement. For example, in the foreach function, all data in the RDD is written to Oracle, then if it is an ordinary foreach operator, it will write data one by one. Each function call may create a database connection, which is bound to be frequent. Create and destroy the database connection, the performance is very low; but if you use the foreachPartitions operator to process the data of one partition at a time, then for each partition, you only need to create a database connection, and then perform the batch insert operation. At this time, the performance is Relatively high.

Set num-executors parameter

This parameter is used to set the total number of Executor processes used by the Spark job to execute. When Driver applies for resources from the YARN cluster manager, the YARN cluster manager will start the corresponding number of Executor processes on each worker node of the cluster as far as possible according to the settings. This parameter is very important. If it is not set, only a small number of Executor processes will be started by default. At this time, the running speed of your Spark job is very slow.

If this parameter is set too little, the cluster resources cannot be fully utilized; if it is set too much, most queues may not be able to give sufficient resources. It is recommended to set this parameter to 1-5.

Set executor-memory parameters

This parameter is used to set the memory of each Executor process. The size of the executor memory often directly determines the performance of the Spark job, and it is also directly related to common JVM OOM exceptions.

For data exchange business scenarios, it is recommended that this parameter be set to 512M and below.

Set executor-cores

This parameter is used to set the number of CPU cores for each Executor process. This parameter determines the ability of each Executor process to execute task threads in parallel. Because each CPU core can only execute one task thread at the same time, the more CPU cores of each Executor process, the faster all task threads assigned to it can be executed.

Pay attention to the use of Collect

The collect operation sends the data of the Executor to the Driver. Therefore, you need to ensure that the memory of the Driver is sufficient before using the collect to avoid the OutOfMemory exception in the Driver process. When you are not sure about the size of the data, you can use operations such as saveAsTextFile to write the data into HDFS. Collect can be used only when the data size can be roughly determined and the driver memory is sufficient.

Use reduceByKey to replace groupByKey

reduceByKey will do local aggregation on the Map side, making the Shuffle process smoother, while shuffle operations such as groupByKey will not do aggregation on the Map side. Therefore, where reduceByKey can be used, use this operator as much as possible to avoid groupByKey.

Data skew

When the data is skewed, although there is no GC (Gabage Collection), the task execution time is seriously inconsistent.

The key needs to be redesigned to rationalize the task size with a smaller granular key.
Modify the degree of parallelism.

Convert text format data on HDFS to Parquet format data

Columnar storage layout queries only involve some columns, so you only need to read the data blocks corresponding to these columns, instead of reading the data of the entire table, thereby reducing I/O overhead. Parquet also supports flexible compression options that can significantly reduce storage on disk.

Optimization of Spark-sql

Use partition table

Large tables with a data volume of more than 1GB are related to each other, or before performing aggregation operations on large tables, you can partition the large tables according to the associated fields or aggregate fields when creating the tables. This can avoid shuffle operations and improve performance.

Use broadcast

Transfer the small table BroadCast to each node to transform it into a non-shuffle operation and improve task execution performance.

spark.sql.autoBroadcastJoin.Threshold = 10485760 //-1 means no broadcast

Use repartitioning to optimize small files

df.repartition(5).write.mode(SaveMode.Append).saveAsTable("t1");

Wide dependence and narrow dependence

1) Wide dependency: refers to a partition of a parent RDD corresponding to multiple child RDDs

Such as: groupByKey, reduceByKey, sortByKey, join Even if the shuffle operation operator is generally wide dependent.

2) Narrow dependency: refers to one or more parent RDD partitions corresponding to a child RDD partition

如：map，filter，union，co-partioned join

Even: wide dependence is one-to-many, narrow dependence is one-to-one or many-to-one

Spark Shuffle

In Spark, shuffle is between the two stages, and the component responsible for the execution, calculation and processing of the shuffle process is mainly ShuffleManager, which is also the shuffle manager. With the development of Spark, ShuffleManager has two implementation methods, namely HashShuffleManager and SortShuffleManager. Therefore, Spark's Shuffle includes Hash Shuffle and Sort Shuffle.

Before Spark 1.2, the default shuffle calculation engine was HashShuffleManager. In versions after Spark 1.2, the default ShuffleManager was changed to SortShuffleManager. HashShuffleManager will generate many small intermediate files. Compared with HashShuffleManager, SortShuffleManager has a certain improvement. The main reason is that while each task performs a shuffle operation, although more temporary disk files will be generated, all the temporary files will be merged into one disk file in the end, so each task has only one disk file . When the shuffle read task of the next stage pulls its own data, it only needs to read part of the data in each disk file according to the index.

Division of Spark Stage

Operations on RDDs are divided into two types: transformation and action. The actual job submission and execution occurs after the action. After the action is called, all transformation operations on the original input data are encapsulated into jobs and submitted to the cluster for operation. This process can be roughly described as follows:

The dependence between RDDs is analyzed by DAGScheduler, and the conversion dependence between RDDs is analyzed through DAG
Divide the job into multiple stages according to the RDD dependencies obtained by DAGScheduler analysis
Each stage generates a TaskSet and submits it to TaskScheduler, and the scheduling power is transferred to TaskScheduler, which is responsible for distributing tasks to workers for execution

Division of stage:

The stage is divided based on DAG to determine the dependency relationship, and the dependency chain is disconnected. Each stage can run in parallel, and the entire job is executed in sequence according to the stage order, and the entire job is finally completed. Spark uses dependencies. The scheduler starts from the end of the DAG graph and traverses the entire dependency chain in reverse. When it encounters ShuffleDependency (a name for wide dependency), it disconnects, and when it encounters NarrowDependency (narrow dependency), it is added to the current stage. The number of tasks in the stage is determined by the number of RDD partitions at the end of the stage. RDD conversion is a coarse-grained calculation based on partitions. The result of a stage execution is the RDD composed of these partitions.

yarn-cluster and yarn-client mode

The difference between yarn-cluster and yarn-client mode is actually the difference between the Application Master process. In yarn-cluster mode, the driver runs in AM (Application Master). It is responsible for applying for resources from YARN and supervising the running status of jobs. After the user submits the job, the Client can be closed and the job will continue to run on YARN. However, the yarn-cluster mode is not suitable for running interactive jobs. In the yarn-client mode, the Application Master only requests the executor from YARN, and the client communicates with the requested container to schedule their work, which means that the client cannot leave.

Yarn-cluster is suitable for production environments; yarn-client is suitable for interaction and debugging.

Spark memory allocation on Executor

spark.serializer (default org.apache.spark.serializer.JavaSerializer )

It is recommended to set it to org.apache.spark.serializer.KryoSerializer, because KryoSerializer is faster than JavaSerializer, but it is possible that some Objects will fail to serialize. At this time, you need to display the registration of the KryoSerializer for the failed serialization class. Configure spark.kryo.registrator parameters

Spark's Executor memory allocation

Spark Executor has two types of memory:

In-heap memory: managed by JVM

Off-heap memory: not managed by jvm

Executo heap memory:

The memory of Spark in an Executor is divided into three blocks, one is execution memory, one is storage memory, and the other is other memory.

Execution and storage are large users of the memory in Spark Executor, and the other occupies much less memory. In versions prior to spark-1.6.0, the memory allocation of execution and storage was fixed, and the parameter configurations used were spark.shuffle.memoryFraction (execution memory accounts for the total memory size of Executor, default 0.2) and spark.storage.memoryFraction ( The storage memory accounts for the size of the Executor memory, default 0.6), because the two pieces of memory were isolated from each other before 1.6.0, which led to the low utilization of the Executor’s memory, and the user needs to do it according to the specific situation of the Application. Adjusting these two parameters can optimize Spark's memory usage. In versions above spark-1.6.0, execution memory and storage memory can be borrowed from each other, which improves the memory usage rate in Spark and reduces OOM.

The execution memory is the execution memory. The document says that join and aggregate are executed in this part of the memory. The shuffle data will also be cached in this memory first, and then written to disk when it is full, which can reduce IO. In fact, the map process is also executed in this memory. The default total memory is 0.2, which is configured by the -executor-memory or spark.executor.memory parameter when the Spark application starts.
Storage memory is a place to store broadcast, cache, and persist data. The default total memory is 0.6, which is set by the spark.storage.storageFraction parameter.

Other memory is the memory reserved for itself when the program is executed. The default total memory is 0.2.

Executo stack outside the memory:

By default, off-heap memory is not enabled, it can be enabled by configuring the spark.memory.offHeap.enabled parameter, and the size of the off-heap space is set by the spark.memory.offHeap.size parameter. The off-heap memory mainly stores serialized binary data.

The space allocation outside the heap is relatively simple, except that there is no other space, the size of storage memory and execution memory are also fixed, and all running concurrent tasks share storage memory and execution memory.

[Spark] Spark commonly used optimization methods

Optimization purpose