Spark of tuning (1)

The following tuning items on Spark, some from the official, some from other engineers, some of them are my own summary.

The basic concepts and principles

      First, to clear up some of the basic concepts and principles Spark, otherwise the system performance tuning out of the question:

  •       Each host in parallel above the N worker, a worker below each of the M parallel executor, task executor it will be assigned to the above to perform. Stage refers to a set of task running in parallel, can not appear inside the stage is the shuffle, shuffle, like the fence because, like stop running parallel task encountered shuffle it means to stage the border.
  •       The number of core CPU, each executor may occupy one or more of the core, can understand the use of computing resources through changes observed usage of CPU, for example, a very common type of waste is an executor occupies multiple core, but the total CPU usage is not high (as a executor does not always take full advantage of the capabilities of multi-core), this time to consider it a executor takes up less core, while adding more executor worker below, or a host more worker increases above executor to increase the number of parallel execution, thereby increasing the CPU utilization. But the increase executor needs to be considered when good memory consumption, because a machine memory allocated to the more executor, the smaller each executor of memory, so that the situation too much data spill over even out of memory occurs.
  •       Parallelism and partition, partition refers to the number of data pieces, each data processing task only one partition, this value is too small will lead to too much data amount of each slice, resulting in memory pressure, or many computing power can not executor access to adequate; but if too much will lead to fragmentation and reduce efficiency. When (such as various reduce operations) execute action type of operation, the number of partition will choose the parent RDD biggest one. Refers to the parallelism is time to reduce operating in RDD class, the default number paritition return data (the type of operation performed when the map, partition number typically taken from a parent RDD larger, but also not involve shuffle Therefore this parallelism parameter has no effect). So, these two concepts are closely related, are related to the fragmentation of data, mode of action actually is unified. You can set the default number of fragments by spark.default.parallelism, and many operations can be specified RDD partition parameter to a specific number of explicit control fragments.
  •       On top of these two principles it seems simple, but very important to select different values ​​based on the hardware and tasks. Want to take a universal configuration is not realistic. Look at this a few examples: (1) practice run EMR Spark job, some very slowly, view the CPU utilization is low, we will try to reduce the number of occupied CPU core per executor, the executor parallel increase in the number of, in conjunction with fragmentation increases, the overall increase in CPU utilization, faster data processing. (2) it is easy to find a job memory overflow, we increase the number of fragments, thereby reducing the size of each piece of data, while also reducing the number of parallel executor, so that the same memory resources allocated to a lesser number of executor, rather to increase the memory allocated to each task, so the speed may slow a bit, but better than OOM strong. (3) the amount of data are extremely rare, there are a large number of small files generated, reducing file fragmentation, so no need to create a multi-task, this situation, if only the most primitive input is relatively small, generally can be noticed; however, If after the process is in operation, such as an application or a reduceBy a filter, a significant reduction in the data, this is rarely noticed inefficiencies.
  •       Finally, add that with the change and configuration parameters, performance bottlenecks are changing, do not forget that when analyzing the problem. Such as increasing the number of executor deployed on each machine, when a performance is beginning to increase, but also observed that the average CPU usage is increasing; but as more and more executor on a single machine, the performance degradation , because, as the executor of the number, the amount of memory is allocated to each executor decreases, less and less direct memory operations, spill over more and more data to disk, naturally deteriorates the performance .

      Here to such an intuitive example, the current total cpu utilization is not high:


But after adjustment in accordance with the above principles, after, can significantly increase the overall utilization cpu found:


Secondly, we are often involved in performance tuning to change the configuration, there are three common configurations in the Spark inside, although some configuration parameters that can replace each other, but as a best practice, still need to follow in different situations using different configurations:
  1. Setting environment variables , and the environment is mainly used in this way, hardware-related configuration;
  2. Command line parameters, this approach is mainly used for different times of operation parameters will change, and start with a double dash;
  3. Code inside (such as Scala) explicitly set (SparkConf object), such configurations are typically application-level configuration, do not generally alter.

For a specific example of the configuration. adjusting the ratio between the slave and the worker executor,. We often need to adjust the number of parallel executor, then simply two ways:

  •       1. Always run within a executor of each worker, but the number of parallel adjustment on a single slave worker. For example, SPARK_WORKER_INSTANCES can set the number of each slave worker, but in changing the parameters of time, such as change 2, be sure to set the corresponding value SPARK_WORKER_CORES, so that each worker to use the original half of the core, so as to allow two worker work together;
  •       Always deploy only a worker within 2. Each slave, but deploying multiple executor within the worker. We are employed in the framework of this adjustment is achieved YARN executor amount of change, a typical approach is to run a host only a worker, and then configure the spark.executor.cores to N of the host CPU core, and also provided Spark spark.executor.memory assigned to one of N points is calculated memory on host, so that the host can start the N executor.

      Some configurations in different MR framework / tools are not the same, such as YARN default values ​​of some parameters are different, and that needs attention.

After these things clearly based on, come one by one to see the performance tuning of the main points.

RAM

Memory Tuning, Java objects will take two to five times the original data even more space. The best way to detect an object in memory consumption is to create RDD, then go inside into the cache, and then look at changes in the UI storage above; of course, can also be used to estimate SizeEstimator. Use -XX: + UseCompressedOops compression option pointer (8 bytes into 4 bytes). When the API call collect, and so have to be careful - large blocks of data to memory copy when the heart to make it clear. To save some memory to the operating system, such as 20%, which also includes a buffer cache OS, if too little reserved, will see this error:

“Required executor memory (235520+23552 MB) is above the max threshold (241664 MB) of this cluster! Please increase the value of ‘yarn.scheduler.maximum-allocation-mb’.”

Or simply no such mistake, but there are still problems because of insufficient memory, there will be some warning, such as this:

“16/01/13 23:54:48 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory”

Sometimes even such logs have not see, but to see some of the executor not clear why the loss of information:

“Exception in thread “main” org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 17.0 failed 4 times, most recent failure: Lost task 12.3 in stage 17.0 (TID 1257, ip-10-184-192-56.ec2.internal): ExecutorLostFailure (executor 79 lost)”

Reduce Task of memory usage. In some cases reduce task especially consume memory, such as when shuffle arise, such as sortByKey, groupByKey, reduceByKey and join other to build a giant hash table in memory inside. One solution is to increase the level of parallelism, so that the input size of each task is correspondingly reduced. Also, pay attention to shuffle the memory limit is set, and sometimes there is enough memory, but the shuffle memory is not enough, also do not increase performance. When we join a lot of data and other operations, often shuffle memory limit to 50% executor of the configuration.

Note that the original input size, there are always many operations need certain data in memory inside complete works completed, then parallelism is not hard to increase the value of partition and can reduce the memory footprint very small. We met some poor performance even OOM problem is to change these two parameters are difficult to alleviate. But it can be increased by increasing the amount of memory to solve each machine's memory, or increase the number of machines, either directly or indirectly.

EC2 machine type is selected when you want to clear the bottleneck (may be clear by means of testing), such as we encounter is the use of r3.8 xlarge and c3.8 xlarge choice, computing power equivalent to the former than the latter expensive 50%, but the memory is 5 times the latter.

In addition, some of RDD API, such as cache, persist, will be forced into the data memory inside, not clear if the benefits of doing so, do not use them.

 

CPU

Level of Parallelism. Specify it in the future, reduce the type of operation is performed when the number of the default partition was specified. This parameter is usually in the actual project is essential, are generally required to be determined based on input and each executor memory size. Level of parallelism or attribute is provided to change the parallel spark.default.parallelism level, generally speaking, each of the assigned CPU core can be 2-3 task.

CPU core access mode is shared or exclusive. Namely, the CPU core is shared executor on the same host or divide and exclusive. For example, total resources on a machine of the CPU core 32, the two Executor deployed, a total memory 50G, then the way is arranged spark.executor.cores to 16, spark.executor.memory to 20G, since this memory limit, will deploy two executor on this machine, each using 20G memory, and each use of a "sovereign" 16 CPU core resources; and memory resources in the same premise, but also allows the two executor "sharing" the 32 core. According to my tests, slightly better performance exclusive mode and shared mode.

GC tuning. Print GC Information: -verbose: gc -XX: + PrintGCDetails -XX: + PrintGCTimeStamps. To remember the default 60% of the executor memory can be used as RDD cache, so that only 40% of the memory can be used as a space object created, this can be changed by setting spark.storage.memoryFraction. If there are many small objects are created, but these objects can be recovered in the course of the GC incomplete, then increase the Eden area will certainly be helpful. If there are tasks to copy data from the HDFS, there is a simple memory consumption estimation formula - such as the HDFS block size is 64MB, the work area there are four copies of task data, and decompressing a block size to be increased three times, then the estimate memory consumption is: 4 * 3 * 64MB. In addition, the work encountered such a problem: GC there is a limit case of default, the default GC time is not more than 2% of CPU time, but if a large number of object creation (it is easy to appear in the Spark, the code pattern is a RDD turn next RDD), it will lead to a lot of GC time, so there "OutOfMemoryError: GC overhead limit exceeded", for this, you can set -XX: -UseGCOverheadLimit turn it off.

And transmission sequence

Data Serialization, using the default Java Serialization, the programmers are most familiar with, but the performance, performance space than the poor. Another option is Kryo Serialization, faster, higher compression rates, but not support serialization of arbitrary classes. Spark UI can be seen on the proportion of the total time sequence of occupancy cost, if this ratio is high, then you can consider optimizing memory usage and serialization.

Broadcasting Large Variables. When task using static large objects, it can broadcast out. Spark will print size serialized, generally speaking, if it exceeds 20KB worth doing. A common scenario is that a large table join a small table, a small table after the broadcast, the data does not need a large table Fengpao between each node, quietly stay local and other small table broadcast over just fine .

Data Locality. Data and code to be put together to deal with, usually better than the code data to be smaller, so the code to be faster everywhere. Data Locality and the data processing code in the room space close degree: PROCESS_LOCAL (same JVM), NODE_LOCAL (the same node, such as data on the HDFS, and code it in the same node), NO_PREF, RACK_LOCAL (not the same server, but in the same rack), ANY. Of course, highest to lowest priority, but if there is no unprocessed data in the idle executor above, then there are two options:

  • (1) either CPU such as today's busy retired and sit process as "local" data,
  • (2) or the like is not started directly opposite task to the remote data processing.

When this happens default Spark will wait for a while (spark.locality), namely policy (1), if the CPU is busy not stop, it will execute the policy (2).

Code references to large objects. When referring to large objects inside the task to be careful, because it will be serialized with the task to each node up, causing performance problems. As long as serialization process does not throw an exception, a reference object serialization problem in fact rarely pay attention to people. If this is indeed the need of large objects, then we might as well turn it into a good RDD. Most of the time, for large object serialization behavior is unwittingly occurred, or is beyond the expected, such as a piece of code like this in our project:

 
 
  1. eet . map ( r => {
  2. println(BackfillTypeIndex)
  3. })

Actually, it is equivalent to this:

 
 
  1. eet . map ( r => {
  2. println(this.BackfillTypeIndex)
  3. })

Do not underestimate this this, sometimes it's serialization is very large overhead.

For such problems, one of the most direct solution is to:

 
 
  1. val dereferencedVariable = this.BackfillTypeIndex
  2. rdd.map(r => println(dereferencedVariable)) // "this" is not serialized

Relatedly, notes @transient used to identify certain variables are not serialized, which for large objects will be excluded from the sequence of the trap is very useful. Also, note that inheritance hierarchy relationship between class, sometimes a small case class may come from a tree.

File read and write

Optimize file storage and read. For example, for some of the case, if only a few columns, and use rcfile format such parquet will greatly reduce the cost of reading the file. There is also stored in the file to the HDFS or S3, a more suitable form may be selected according to the situation, such as a higher compression ratio format. In addition, especially for a particularly large number of shuffle case, consider leaving a certain amount of additional memory to the operating system as a buffer cache of the operating system, such as a total of 50G of memory, JVM to allocate up to 40G bit more.

File fragmentation. In the above example to support S3 files stored in fragmented form, suffix partXX. The method used to set coalesce into many pieces, the level of parallelism is adjusted to or an integral multiple read and write performance can be improved. But too high too low is not good, is too low can not take full advantage of the ability to read and write in parallel S3, it is much too high for small files, pre-merger, and so are the connection setup time cost ah, read and write also easily exceed the throttle.

task

Spark of Speculation. By setting spark.speculation and several other related options, allowing Spark found time to perform some task particularly slow, it can be re-executed without waiting for completion, and finally the same task as long as there is an execution is over, then the fastest End of execution that results will be adopted.

Reduce Shuffle. In fact, Spark is often calculated quickly, but spent a lot of overhead and network IO above, and the shuffle is a classic. For example, if the (k, v1) join (k , v2) => (k, v3), then, this situation is actually very good Spark is optimized, because of the need to join in a partition of a node which , join quickly completed, the result is the same node (the series of operations may be on the same stage inside). However, if the data structure is designed to (obj1) join (obj2) = > (obj3), of which the join condition obj1.column1 == obj2.column1, often forced to shuffle this time, because there is no longer the same key to ensure that the data on the intensity of the same node. In the case must shuffle as much as possible to reduce the size of the data before the shuffle, such as the example groupByKey avoid this . The following comparison of images from a speech Spark Summit 2013's , talking about the same thing:


Repartition. Big, sometimes small amount of data during the operation, select the appropriate number of partition stake, if there are too many partition led to many small tasks and produce space missions; if too little leads to computing resources can not take full advantage, when necessary, can use repartition to adjust, but it is not without a price, which price is a major shuffle. Yet another common problem is that the data size too different, this is mainly partition data key value is not uniform in fact caused by (the use of HashPartitioner default), it needs to be improved, such as rewriting hash algorithm. Want to know the number of times the test partition can call rdd.partitions (). Size () informed.

Task时间分布。关注Spark UI,在Stage的详情页面上,可以看得到shuffle写的总开销,GC时间,当前方法栈,还有task的时间花费。如果你发现task的时间花费分 布太散,就是说有的花费时间很长,有的很短,这就说明计算分布不均,需要重新审视数据分片、key的hash、task内部的计算逻辑等等,瓶颈出现在耗 时长的task上面。


重用资源。有的资源申请开销巨大,而且往往相当有限,比如建立连接,可以考虑在partition建立的时候就创建好(比如使用mapPartition方法),这样对于每个partition内的每个元素的操作,就只要重用这个连接就好了,不需要重新建立连接。

可供参考的文档:官方调优文档Tuning Spark,Spark配置的官方文档,Spark Programming Guide,JVMGC调优文档,JVM性能调优文档,How-to: Tune Your Apache Spark Jobs part-1 & part-2

注:本文转载自:http://www.raychase.net/3546

Guess you like

Origin blog.csdn.net/weixin_42177380/article/details/90711750