Six, spark - spark Tuning

[TOC]

I. Introduction spark tune

1.1 What is the spark tuning

Calculated essence spark is distributed computing, application performance is affected by any factors in the cluster, such as: CPU, network bandwidth, memory and so on. Under normal circumstances, if the memory is large enough, then the other factors that affect performance. Then appears tuning needs more resources because the situation is not enough, it becomes necessary to regulate the use of resources, more efficient use of resources. For example, if memory is tight enough to hold all the data (1 billion), the need for memory usage, tuning to reduce memory consumption

The main direction of 1.2 spark tuning

Spark performance optimization, most of the work, for memory usage, tune. Typically, the amount of program data processing Spark is small, enough memory to use, as long as the network is usually, generally no major performance problems. However, the application performance problems Spark often occurs when calculated for a large amount of data (data spurt). Often this is now the environment is not met, it may lead to the collapse of the cluster.
In addition to the memory tuning, there are some means to optimize performance. For example, spark the process of using mysql and interactive, it should take into account this time tuning performance problems mysql.

The main technical means 1.3 spark tuning

1, high-performance serialization library. The purpose of reducing the time and sequence of the size of the serialized data
2, data structure optimization. The purpose of reducing memory footprint
3, RDD for multiple use for persistence (RDD Cache), the checkpoint
. 4, using the sequence of the persistence level: MEMORY_ONLY not serialization, MEMORY_ONLY_SER serialization.
MEMORY_ONLY than MEMORY_ONLY_SER to take up more memory space.
Note, however, serialized cpu usage will increase costs, so weigh the good
5, Java virtual machine garbage collection tuning.
6, Shuffle tuning, 90% of the problems are leading to shuffle (when 1.x version of this serious problem, the 2.x version, the basic official website has been optimized, so the 2.x version, this problem can be ignored)

Other ways performance optimization:
improve calculation parallelism
broadcast shared data

The following will be analyzed for this 6:00 tuning means

Second, the diagnosis spark memory usage

2.1 Memory spending (spending target memory)

1、每个 java/scala对象,由两部分组成,一个是对象头,占用16字节,主要包含对象的一些元信息,比如指向它的类的指针。另一个是对象本身。如果对象比较小,比如int,它的对象头比自己对象本身都大。

2、String对象,会比他内部的原始数据,多出40个字节,用于保存string类型的元信息
String内部使用char数组来保存字符串序列,并且还要保存诸如数组长度之类的信息。String使用UTF-16编码,所以每个字符会占用2个字节。
比如:包含10个字符的String,占用 2*10 + 40 个字节。

3、集合类型,比如HashMap和LinkedList,内部使用链表数据结构,对链表中的每个数据,使用Entry对象包装。Entry对象,不光有对象头,还有指向下一个Entry的指针,占用8个字节。所以一句话就是,这种内部还包含多个对象的类型,占用内存更多。因为对象多了,除了对象本身数据占用内存之外,更多对象也就会有更多对象头,占用了不少内存空间。

4、基本数据类型的集合,比如int集合,内部会使用对象的包装类 Integer来存储元素。

2.2 acquire spark program memory usage

Under the driver to view the log directory running log

less ${spark_home}/work/app-xxxxxx/0/stderr
观察到类似如下信息:
INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 320.9 KB, free 366.0 MB)
        19/07/05 05:57:47 INFO MemoryStore: Block rdd_3_1 stored as values in memory (estimated size 26.0 MB, free 339.9 MB)
        19/07/05 05:57:47 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2212 bytes result sent to driver
        19/07/05 05:57:48 INFO MemoryStore: Block rdd_3_0 stored as values in memory (estimated size 26.7 MB, free 313.2 MB)

estimated size 320.9 KB:当前使用的内存大概大小
free 366.0 MB:剩余空闲内存大小

So you can know if the task using a memory

Three, spark tuning techniques

2.1 high performance serialization library

2.1.1 spark use serialization

spark as a distributed system, and other distributed systems, require serialization. Any of a distributed system, the sequence of the part is very important. If serialization technique, the operation is slow, the serialized data is large, distributed systems would lead to performance degradation many applications. So, the first step Spark performance optimization is to optimize serialization performance.
spark in some places will use serialization, such as when the shuffle, but the spark of convenience and performance trade-offs, spark for convenience, use the default java serialization mechanism, also talked about before serialization mechanism of java , performance is not high, slow serialization after the serialization large data. Therefore, the general production, preferably modified spark use serialization mechanism

2.1.2 Configuration spark kryo used to serialize

It supports the use of spark kryo to achieve serialization. kryo java faster than the speed of the sequence, a small footprint, about 10 times smaller. But to use, relatively less convenient.
Configuring spark use kryo:

spark在读取配置时,会读取conf目录下的配置文件,其中有一个 spark-defaults.conf 文件就是用来指定spark的一些工作参数的。

vim spark-defaults.conf
spark.serializer        org.apache.spark.serializer.KryoSerializer

这就配置了使用kryo,当然也可以在spark程序中使用 conf对象来来设置
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")

2.1.3 kryo library optimization

(1) Optimization of cache size
serialized custom type if the registered itself particularly large, such as more than 100 field contains, will lead to too large serialized objects. At this point we need to be optimized for kyro itself. Because kyro internal cache itself is not enough to store such a large object.

设置:spark.kryoserializer.buffer.max  参数值调大,即可。

(2) custom type registered in advance
using Kryo, for higher performance, preferably registered in advance need to be serialized classes, such as:

在sparkConf 对象中注册
conf.registerKryoClasses(Array(classof[Student],classof[Teacher]))

Note: This is basically for custom classes, and the preparation of spark project with scala, in fact, does not involve too many custom class, unlike java

2.2 optimize data structure

2.2.1 Overview

Optimized data structures, mainly to avoid the extra memory overhead caused by the syntax properties.
Core: Internal Functions optimization algorithm to use external local data or sub-data count.
Objective: To reduce memory consumption and occupancy.

2.2.2 Specific tools

(1) and the string array preferably used, rather than collections.

即:优先使用array,而不是ArrayList,LinkedList,hashMap
使用int[] 比 List<Integer> 节省内存。

前面也说过,集合类包含更多的额外数据,以及复杂的类结构,所以占用内存多。此举就是为了将结构简单化,满足使用的情况下,越简单越好

(2) to convert an object into a string.

在企业中,将HashMap,List这种数据,统一使用String拼接成特殊格式的字符串。
举例:
Map<Integer,Person> persons = new HashMap<>()
优化为:
id:name,address,idCardNum,family......|id:name,address,idCardNum,family......

(3) avoids the use of multiple nested object structure.

public class Teacher{private List<Student> students = new ArrayList<>()}
以上例子不好,因为Teacher类的内部又嵌套了大量的小的Student对象。
改进:
转成json,处理字符串
{"teacherId":1,....,students[{"studentId":1,.....}]}

(4) can be avoided for the scene, instead of using the int String

虽然String性能比List高,但是int占用更少内存。
比如:数据库主键,id,推荐使用自增主键,而不是uuid。

2.3 RDD cache

This is very simple, mainly to the RDD cache used multiple times in memory to avoid double counting when used again. Implementation look at the front spark core articles

2.4 uses serialization to cache

By default, when RDD cache, RDD objects are not serialized, that is persistent level MEMORY_ONLY. Recommended MEMORY_ONLY_SER for persistence, because in this way will also be serialized, the serialized take up less memory space. Implementation look at the front spark core articles

Tuning 2.5 jvm

2.5.1 Background

If at the time of RDD persistence, persistence of large amounts of data, then the garbage collector Java virtual machine may become a performance bottleneck. Java virtual machine garbage collection on a regular basis, then it will track all Java objects, and the chase to find objects that are no longer used in garbage collection, cleaning up old object, to make room for the new object.
Garbage collection performance overhead, and is proportional to the number of objects in memory. But also to note that, before making the Java Virtual Machine tuning, tuning need to do other work on top of it makes sense. Because of the above tuning work, in order to save memory cost, better and more efficient use of memory. The above optimization compared to the benefits obtained were jvm tuning is much greater. Jvm tuning and good, but not good upper application memory usage, optimize jvm also no good.

2.5.2 gc principle

Here mention this and more is to allow the reader to understand this principle, whatever Baidu can be found, do not repeat it here.

2.5.3 Detection garbage collection

We can monitor garbage collection, including how often garbage collection, and the time spent each time garbage collection.
In the spark-submit a script, add a profile:

--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimesStamps"

注意:输出到worker的日志中,而不是driver日志。

/usr/local/spark-2.1.0-bin-hadoop2.7/work/app-20190705055405-0000/0
这是driver日志

/usr/local/spark-2.1.0-bin-hadoop2.7/logs
这是worker日志

2.5.4 Executor memory optimization ratio

For GC tuning, the most important regulatory, RDD cache memory space occupied by the operator to perform the proportion of memory used to create the object space. By default, Spark using each Executor 60% of memory space to cache RDD, then the objects created during the execution of the task, only 40% of the memory space to store objects.
In this case, most likely because of insufficient memory to create the object task is too large, resulting in 40% of the memory space is not enough to trigger Java Virtual Machine garbage collection operation. In extreme cases, frequent garbage collection operation will be triggered.
According to the actual situation, the object storage space can be increased to reduce the probability of occurrence gc way:

conf.set("spark.storage.memoryFraction",0.5)
将RDD缓存占用空间比例降低到50%

2.6 shuffle

In spark1.x previous version, if there is a shuffle, then each map task will result task according to (may also be called reduce task) number, the results of the partition map, respectively, to a different result task processing, each partition generate a file. When the high number map, will have a large number of files, which can cause performance problems.
In spark2.x, the output from a data map task are placed in a file, and then adding an index file, identifying the location of the different partitions of the data in a file, thus ensuring only a task generates a file . Thereby reducing the pressure IO

Guess you like

Origin blog.51cto.com/kinglab/2450775