Aji tune tuning Spark Development Road (eight) learning SparkCore of [turn]

Foreword

In the big data computing, Spark has become more and more popular, one of the increasingly popular computing platform. Spark function data encompasses a large field of an offline batch, the SQL processing class, streaming / real-time calculation, machine learning, FIG computing various types of calculation operations, a very wide range of applications and prospects. In the US Mission • public comment, there have been many students try to use Spark in various projects. The reason most students (including this writer), originally started using Spark's very simple, mainly in order to make big data computing job execution speed faster, higher performance.

However, by Spark develop high-performance big data computing jobs, it is not so simple. If you do not work for Spark reasonable tuning, Spark job execution speed can be slow, so completely does not embody the Spark as a fast large data calculation engine to advantage. Therefore, you want to make good use of Spark, it must be reasonable performance optimization.

Spark performance tuning is actually composed of many parts, not a few parameters can be adjusted immediate lifting operation performance. We need business scenarios based on different data and circumstances, to undertake a comprehensive analysis Spark job, and then adjust and optimize various aspects in order to obtain the best performance.

According to the author Spark job development experience and practice accumulated before, summed up a set of performance optimization Spark job. Program development is divided into package of tuning, tuning resources, data skew tuning, tuning shuffle several parts. Development tuning and resource tuning is all Spark job requires attention and some basic principles to follow, is the basis for high-performance Spark operations; data skew tuning, mainly on the complete set of job data to solve Spark inclined to solve program; shuffle tune, for there is a deeper level of research students and master the principles of Spark, mainly on the shuffle operation procedures and details on how Spark jobs tune.

Spark herein Performance Optimization Basics Guide, mainly on the development of tuning and resource tuning.

Development Tuning

Tuning Overview

The first step Spark performance optimization, is to pay attention to some basic principles and application performance optimization in the process of developing Spark job. Development tuning, is to let everyone know some of the following Spark basic development principles, including: RDD lineage design, rational use of sub-optimize operations and other special operators. In the development process, and always should be noted that the above principles, and those principles according to specific business as well as the practical application scenarios, the flexibility to apply to their own Spark job.

A principle: avoid creating duplicate RDD

Generally speaking, when we work in the development of a Spark, first create an initial RDD is based on a data source (such as a table or Hive HDFS file); then perform an operation on the operator RDD, and then get the next RDD; And so on, ad infinitum, until we need to calculate the final results. In this process, a plurality of different RDD will arithmetic operators (such as map, reduce, etc.) to string together, the "RDD string" is RDD Lineage, i.e. "RDD kinship chain."

We should pay attention to in the development process: For the same data, but should create an RDD, can not create multiple RDD to represent the same data.

Some Spark beginner when at the beginning of the development of Spark job or are an experienced engineer in the development of RDD lineage extremely lengthy Spark jobs, you may forget yourself before for a copy of the data already created a RDD, and leading to the same data, create multiple RDD. This means that our Spark job will be repeated calculations to create multiple RDD represent the same data, thus increasing the performance overhead operations.

A simple example

// 需要对名为“hello.txt”的HDFS文件进行一次map操作,再进行一次reduce操作。也就是说,需要对一份数据执行两次算子操作。

// 错误的做法:对于同一份数据执行多次算子操作时,创建多个RDD。
// 这里执行了两次textFile方法,针对同一个HDFS文件,创建了两个RDD出来,然后分别对每个RDD都执行了一个算子操作。
// 这种情况下,Spark需要从HDFS上两次加载hello.txt文件的内容,并创建两个单独的RDD;第二次加载HDFS文件以及创建RDD的性能开销,很明显是白白浪费掉的。
val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt")
rdd1.map(...)
val rdd2 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt")
rdd2.reduce(...)

// 正确的用法:对于一份数据执行多次算子操作时,只使用一个RDD。
// 这种写法很明显比上一种写法要好多了,因为我们对于同一份数据只创建了一个RDD,然后对这一个RDD执行了多次算子操作。
// 但是要注意到这里为止优化还没有结束,由于rdd1被执行了两次算子操作,第二次执行reduce操作的时候,还会再次从源头处重新计算一次rdd1的数据,因此还是会有重复计算的性能开销。
// 要彻底解决这个问题,必须结合“原则三:对多次使用的RDD进行持久化”,才能保证一个RDD被多次使用时只被计算一次。
val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt")
rdd1.map(...)
rdd1.reduce(...)

Second principle: as far as possible reuse the same RDD

In addition to avoid creating more of an RDD exactly the same data in the development process than when different data on the operator to perform the operation also possible to reuse a RDD. For example, there is a data format RDD key-value type, the other type is a single value, the value of these two data RDD is exactly the same. So at this point we can only use the RDD key-value type, because it already contains other data. Similar data RDD has a plurality of such overlapping or contained, we should try to reuse a RDD, thus reducing the number of RDD as possible to reduce the number of possible execution operator.

A simple example

// 错误的做法。

// 有一个<Long, String>格式的RDD,即rdd1。
// 接着由于业务需要,对rdd1执行了一个map操作,创建了一个rdd2,而rdd2中的数据仅仅是rdd1中的value值而已,也就是说,rdd2是rdd1的子集。
JavaPairRDD<Long, String> rdd1 = ...
JavaRDD<String> rdd2 = rdd1.map(...)

// 分别对rdd1和rdd2执行了不同的算子操作。
rdd1.reduceByKey(...)
rdd2.map(...)

// 正确的做法。

// 上面这个case中,其实rdd1和rdd2的区别无非就是数据格式不同而已,rdd2的数据完全就是rdd1的子集而已,却创建了两个rdd,并对两个rdd都执行了一次算子操作。
// 此时会因为对rdd1执行map算子来创建rdd2,而多执行一次算子操作,进而增加性能开销。

// 其实在这种情况下完全可以复用同一个RDD。
// 我们可以使用rdd1,既做reduceByKey操作,也做map操作。
// 在进行第二个map操作时,只使用每个数据的tuple._2,也就是rdd1中的value值,即可。
JavaPairRDD<Long, String> rdd1 = ...
rdd1.reduceByKey(...)
rdd1.map(tuple._2...)

// 第二种方式相较于第一种方式而言,很明显减少了一次rdd2的计算开销。
// 但是到这里为止,优化还没有结束,对rdd1我们还是执行了两次算子操作,rdd1实际上还是会被计算两次。
// 因此还需要配合“原则三:对多次使用的RDD进行持久化”进行使用,才能保证一个RDD被多次使用时只被计算一次。

Principle III: The RDD multiple use for persistence

When you repeatedly to do a RDD operators operating in Spark code after Congratulations, you have achieved the first step in optimizing the Spark job, that is, as far as possible reuse RDD. At this point in relation to this basis, the second step is optimized, that is, to ensure that when a RDD operators to perform multiple operations, the RDD itself is only counted once.

Spark in for a default RDD operators to perform multiple sub-principle is this: Every time you perform a one RDD operator operating, will be recalculated again at the source, to calculate the RDD, and then perform the RDD your operator operation. In this way the performance is poor.

Therefore, in this case, our advice is: RDD for multiple use for persistence. At this Spark will be based on your persistence strategy to save the data in the RDD into memory or disk. Each subsequent time of the RDD perform arithmetic operators, will be extracted directly from memory or disk persistent RDD data and then execute operator, and will not be recalculated from the source RDD at this again, and then perform operator operations.

RDD repeated use of sample code is persistence

// 如果要对一个RDD进行持久化,只要对这个RDD调用cache()和persist()即可。

// 正确的做法。
// cache()方法表示:使用非序列化的方式将RDD中的数据全部尝试持久化到内存中。
// 此时再对rdd1执行两次算子操作时,只有在第一次执行map算子时,才会将这个rdd1从源头处计算一次。
// 第二次执行reduce算子时,就会直接从内存中提取数据进行计算,不会重复计算一个rdd。
val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt").cache()
rdd1.map(...)
rdd1.reduce(...)

// persist()方法表示:手动选择持久化级别,并使用指定的方式进行持久化。
// 比如说,StorageLevel.MEMORY_AND_DISK_SER表示,内存充足时优先持久化到内存中,内存不充足时持久化到磁盘文件中。
// 而且其中的_SER后缀表示,使用序列化的方式来保存RDD数据,此时RDD中的每个partition都会序列化成一个大的字节数组,然后再持久化到内存或磁盘中。
// 序列化的方式可以减少持久化的数据对内存/磁盘的占用量,进而避免内存被持久化数据占用过多,从而发生频繁GC。
val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt").persist(StorageLevel.MEMORY_AND_DISK_SER)
rdd1.map(...)
rdd1.reduce(...)

For persist () method, we can choose a different persistence levels according to different business scenarios.

Spark persistent level of

Persistence level Meaning Explanation
MEMORY_ONLY Using non-Java object serialization format, the data stored in memory. If not enough memory to store all the data, the data may not be persisted. So next time when this RDD perform operator operations that have not been persistent data need to be recalculated again from the source. This is the default persistence strategy, the use of cache () method, which is to use the actual persistence strategy.
MEMORY_AND_DISK Use non-serialized Java object format, try to priority data stored in memory. If not enough memory to store all the data, writes the data to a disk file, the next execution count midnight on the RDD, persistent disk file data is read out of use.
MEMORY_ONLY_SER The basic meaning of the same MEMORY_ONLY. The only difference is that the data will be serialized in RDD, each Partition RDD will be serialized into a byte array. This approach is more save memory, thereby avoiding persistent data used too much memory results in frequent GC.
MEMORY_AND_DISK_SER The basic meaning of the same MEMORY_AND_DISK. The only difference is that the data will be serialized in RDD, each Partition RDD will be serialized into a byte array. This approach is more save memory, thereby avoiding persistent data used too much memory results in frequent GC.
DISK_ONLY Use non-serialized Java object format, all data written to the disk file.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, 等等. For any of these persistent policy, if the suffix _2, on behalf of each persistent data, are single copy, save and send a copy to other nodes. This persistence mechanism based on a copy of the mainly used for fault tolerance. If you hang a node, the node memory or disk persistent data is lost, then the subsequent use of the RDD computing can have a copy of the data on other nodes. If there is no copy of it, it can only be re-calculated from these data sources at it again.

How to choose the most appropriate persistence strategy

  • By default, the highest performance, of course, MEMORY_ONLY, but only if you have enough memory large enough to be more than enough to store all data in the entire RDD. Since no serialization and deserialization operation, to avoid the performance overhead of this part; RDD subsequent operation of the operator, you are not required to read data from a disk file data based on the operation of pure memory, high performance; does not require copying copy of the data, and remote transmission to other nodes. But here it must be noted that, in the actual production environment, I am afraid that this strategy can be directly used scenes still limited, if the data RDD relatively long time (such as several billion), the direct use of this persistence level, will OOM lead to the JVM memory overflow exception.

  • If there is a memory overflow when using MEMORY_ONLY level, it is recommended to try to use MEMORY_ONLY_SER level. This level will then RDD serialized data stored in memory, then each partition is just only a byte array, reduces the number of objects, and reduces memory footprint. This extra level of performance overhead than MEMORY_ONLY, mainly serialization and de-serialization overhead. However, the operator may operate the subsequent memory-based, so overall performance is still relatively high. In addition, problems may occur above, if the amount of data RDD is too much, then he may cause memory overflow OOM exception.

  • If pure memory levels are not available, it is recommended to use MEMORY_AND_DISK_SER strategy, rather than MEMORY_AND_DISK strategy. Since we come to this step, it shows a large amount of data RDD, the memory can not be completely put down. Data serialization is relatively small, it can save memory and disk space overhead. Meanwhile, the strategy will give priority to try to attempt to cache data in memory, no less than the memory cache will be written to disk.

  • Generally not recommended DISK_ONLY and suffix _2 level: because complete disk-based file read and write data, it will lead to a sharp decrease performance, sometimes not as good as recalculated once all the RDD. _2 suffix level, you must make a copy of all data sent to other nodes, data replication and network transmission can result in a large performance overhead, except for high-availability operation, otherwise not recommended.

Principle IV: try to avoid using shuffle Class Operator

If possible, try to avoid using the shuffle class operator. Spark job because the process is running, the place is the most consumed shuffle process performance. shuffle process, simply put, it is distributed over multiple nodes of the cluster with a key, to pull on the same node, for polymerization or other join operations. For example reduceByKey, join and other operators, will trigger the shuffle operation.

shuffle process, the key will be the same on each node in the first file written to the local disk, then the other nodes necessary to pull disk files on the same key of each node across the network. When the same key and get dragged into the same node aggregation operation, because there may be too many key processes on a node, resulting in not enough memory to store, and then overflow to disk file. So shuffle process, network transmission operation of a large number of disk file read and write IO operations, and the data may occur. Disk IO and network data transmission is the main reason for poor performance shuffle.

So in our development process, to avoid as far as possible avoid the use reduceByKey, join, distinct, repartition and other operators would be a shuffle, shuffle operator to make use of non-map class. In this case, there is no or only a small shuffle operation Spark job shuffle operation, you can greatly reduce the performance overhead.

Broadcast join with the map code sample for

// 传统的join操作会导致shuffle操作。
// 因为两个RDD中,相同的key都需要通过网络拉取到一个节点上,由一个task进行join操作。
val rdd3 = rdd1.join(rdd2)

// Broadcast+map的join操作,不会导致shuffle操作。
// 使用Broadcast将一个数据量较小的RDD作为广播变量。
val rdd2Data = rdd2.collect()
val rdd2DataBroadcast = sc.broadcast(rdd2Data)

// 在rdd1.map算子中,可以从rdd2DataBroadcast中,获取rdd2的所有数据。
// 然后进行遍历,如果发现rdd2中某条数据的key与rdd1的当前数据的key是相同的,那么就判定可以进行join。
// 此时就可以根据自己需要的方式,将rdd1当前数据与rdd2中可以连接的数据,拼接在一起(String或Tuple)。
val rdd3 = rdd1.map(rdd2DataBroadcast...)

// 注意,以上操作,建议仅仅在rdd2的数据量比较少(比如几百M,或者一两G)的情况下使用。
// 因为每个Executor的内存中,都会驻留一份rdd2的全量数据。

Principle V: using the map-side prepolymerized shuffle operation

If, for business needs, we must use the shuffle operation, the operator can not be replaced with the map-based, it can make use of the prepolymerized map-side operator.

The so-called map-side pre-polymerization is to say the same key once polymerization operation, similar to the local combiner MapReduce locally in each node. After prepolymerization map-side, each local node would be only one key the same, because the same key pieces are polymerized up. Pull the other nodes in the same key on all the nodes, it will greatly reduce the amount of data that needs to pull, which will reduce the disk IO and network transmission overhead. Generally speaking, where possible, it is recommended to use reduceByKey or aggregateByKey operator to replace lost groupByKey operator. Because reduceByKey aggregateByKey operator and user-defined function will use the same key for each node of the local prepolymerization. GroupByKey the operator is not pre-polymerized, the total amount of data is transmitted and distributed, relatively performance difference between the nodes of a cluster.

For example the following two figures, is a typical example, were based on the word count and reduceByKey groupByKey. Wherein the first graph is a schematic groupByKey can be seen, without any local aggregation performed, all data transfer between the cluster nodes; FIG is a second schematic reduceByKey can be seen, each local node the same key data, were carried out pre-polymerization before polymerisation globally transmitted to other nodes.

Principle VI: the use of high-performance operator

In addition to operator optimization principles have shuffle-related, other operators also have a corresponding optimization principles.

  • Use reduceByKey / aggregateByKey groupByKey alternative
    details, see "five principles: using map-side prepolymerized shuffle operation."
  • 使用mapPartitions替代普通map
    mapPartitions类的算子,一次函数调用会处理一个partition所有的数据,而不是一次函数调用处理一条,性能相对来说会高一些。但是有的时候,使用mapPartitions会出现OOM(内存溢出)的问题。因为单次函数调用就要处理掉一个partition所有的数据,如果内存不够,垃圾回收时是无法回收掉太多对象的,很可能出现OOM异常。所以使用这类操作时要慎重!
  • 使用foreachPartitions替代foreach
    原理类似于“使用mapPartitions替代map”,也是一次函数调用处理一个partition的所有数据,而不是一次函数调用处理一条数据。在实践中发现,foreachPartitions类的算子,对性能的提升还是很有帮助的。比如在foreach函数中,将RDD中所有数据写MySQL,那么如果是普通的foreach算子,就会一条数据一条数据地写,每次函数调用可能就会创建一个数据库连接,此时就势必会频繁地创建和销毁数据库连接,性能是非常低下;但是如果用foreachPartitions算子一次性处理一个partition的数据,那么对于每个partition,只要创建一个数据库连接即可,然后执行批量插入操作,此时性能是比较高的。实践中发现,对于1万条左右的数据量写MySQL,性能可以提升30%以上。
  • 使用filter之后进行coalesce操作
    通常对一个RDD执行filter算子过滤掉RDD中较多数据后(比如30%以上的数据),建议使用coalesce算子,手动减少RDD的partition数量,将RDD中的数据压缩到更少的partition中去。因为filter之后,RDD的每个partition中都会有很多数据被过滤掉,此时如果照常进行后续的计算,其实每个task处理的partition中的数据量并不是很多,有一点资源浪费,而且此时处理的task越多,可能速度反而越慢。因此用coalesce减少partition数量,将RDD中的数据压缩到更少的partition之后,只要使用更少的task即可处理完所有的partition。在某些场景下,对于性能的提升会有一定的帮助。
  • 使用repartitionAndSortWithinPartitions替代repartition与sort类操作
    repartitionAndSortWithinPartitions是Spark官网推荐的一个算子,官方建议,如果需要在repartition重分区之后,还要进行排序,建议直接使用repartitionAndSortWithinPartitions算子。因为该算子可以一边进行重分区的shuffle操作,一边进行排序。shuffle与sort两个操作同时进行,比先shuffle再sort来说,性能可能是要高的。

原则七:广播大变量

有时在开发过程中,会遇到需要在算子函数中使用外部变量的场景(尤其是大变量,比如100M以上的大集合),那么此时就应该使用Spark的广播(Broadcast)功能来提升性能。

在算子函数中使用到外部变量时,默认情况下,Spark会将该变量复制多个副本,通过网络传输到task中,此时每个task都有一个变量副本。如果变量本身比较大的话(比如100M,甚至1G),那么大量的变量副本在网络中传输的性能开销,以及在各个节点的Executor中占用过多内存导致的频繁GC,都会极大地影响性能。

因此对于上述情况,如果使用的外部变量比较大,建议使用Spark的广播功能,对该变量进行广播。广播后的变量,会保证每个Executor的内存中,只驻留一份变量副本,而Executor中的task执行时共享该Executor中的那份变量副本。这样的话,可以大大减少变量副本的数量,从而减少网络传输的性能开销,并减少对Executor内存的占用开销,降低GC的频率。

广播大变量的代码示例

// 以下代码在算子函数中,使用了外部的变量。
// 此时没有做任何特殊操作,每个task都会有一份list1的副本。
val list1 = ...
rdd1.map(list1...)

// 以下代码将list1封装成了Broadcast类型的广播变量。
// 在算子函数中,使用广播变量时,首先会判断当前task所在Executor内存中,是否有变量副本。
// 如果有则直接使用;如果没有则从Driver或者其他Executor节点上远程拉取一份放到本地Executor内存中。
// 每个Executor内存中,就只会驻留一份广播变量副本。
val list1 = ...
val list1Broadcast = sc.broadcast(list1)
rdd1.map(list1Broadcast...)

原则八:使用Kryo优化序列化性能

在Spark中,主要有三个地方涉及到了序列化:

  • 在算子函数中使用到外部变量时,该变量会被序列化后进行网络传输(见“原则七:广播大变量”中的讲解)。
  • 将自定义的类型作为RDD的泛型类型时(比如JavaRDD,Student是自定义类型),所有自定义类型对象,都会进行序列化。因此这种情况下,也要求自定义的类必须实现Serializable接口。
  • 使用可序列化的持久化策略时(比如MEMORY_ONLY_SER),Spark会将RDD中的每个partition都序列化成一个大的字节数组。

对于这三种出现序列化的地方,我们都可以通过使用Kryo序列化类库,来优化序列化和反序列化的性能。Spark默认使用的是Java的序列化机制,也就是ObjectOutputStream/ObjectInputStream API来进行序列化和反序列化。但是Spark同时支持使用Kryo序列化库,Kryo序列化类库的性能比Java序列化类库的性能要高很多。官方介绍,Kryo序列化机制比Java序列化机制,性能高10倍左右。Spark之所以默认没有使用Kryo作为序列化类库,是因为Kryo要求最好要注册所有需要进行序列化的自定义类型,因此对于开发者来说,这种方式比较麻烦。

以下是使用Kryo的代码示例,我们只要设置序列化类,再注册要序列化的自定义类型即可(比如算子函数中使用到的外部变量类型、作为RDD泛型类型的自定义类型等):

// 创建SparkConf对象。
val conf = new SparkConf().setMaster(...).setAppName(...)
// 设置序列化器为KryoSerializer。
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// 注册要序列化的自定义类型。
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))

原则九:优化数据结构

ava中,有三种类型比较耗费内存:

  • 对象,每个Java对象都有对象头、引用等额外的信息,因此比较占用内存空间。
  • 字符串,每个字符串内部都有一个字符数组以及长度等额外信息。
  • 集合类型,比如HashMap、LinkedList等,因为集合类型内部通常会使用一些内部类来封装集合元素,比如Map.Entry。

因此Spark官方建议,在Spark编码实现中,特别是对于算子函数中的代码,尽量不要使用上述三种数据结构,尽量使用字符串替代对象,使用原始类型(比如Int、Long)替代字符串,使用数组替代集合类型,这样尽可能地减少内存占用,从而降低GC频率,提升性能。

但是在笔者的编码实践中发现,要做到该原则其实并不容易。因为我们同时要考虑到代码的可维护性,如果一个代码中,完全没有任何对象抽象,全部是字符串拼接的方式,那么对于后续的代码维护和修改,无疑是一场巨大的灾难。同理,如果所有操作都基于数组实现,而不使用HashMap、LinkedList等集合类型,那么对于我们的编码难度以及代码可维护性,也是一个极大的挑战。因此笔者建议,在可能以及合适的情况下,使用占用内存较少的数据结构,但是前提是要保证代码的可维护性。

原则十:Data Locality本地化级别

PROCESS_LOCAL:进程本地化,代码和数据在同一个进程中,也就是在同一个executor中;计算数据的task由executor执行,数据在executor的BlockManager中;性能最好

NODE_LOCAL:节点本地化,代码和数据在同一个节点中;比如说,数据作为一个HDFS block块,就在节点上,而task在节点上某个executor中运行;或者是,数据和task在一个节点上的不同executor中;数据需要在进程间进行传输
NO_PREF:对于task来说,数据从哪里获取都一样,没有好坏之分
RACK_LOCAL:机架本地化,数据和task在一个机架的两个节点上;数据需要通过网络在节点之间进行传输
ANY:数据和task可能在集群中的任何地方,而且不在一个机架中,性能最差

spark.locality.wait,默认是3s

Spark在Driver上,对Application的每一个stage的task,进行分配之前,都会计算出每个task要计算的是哪个分片数据,RDD的某个partition;Spark的task分配算法,优先,会希望每个task正好分配到它要计算的数据所在的节点,这样的话,就不用在网络间传输数据;

但是可能task没有机会分配到它的数据所在的节点,因为可能那个节点的计算资源和计算能力都满了;所以呢,这种时候,通常来说,Spark会等待一段时间,默认情况下是3s钟(不是绝对的,还有很多种情况,对不同的本地化级别,都会去等待),到最后,实在是等待不了了,就会选择一个比较差的本地化级别,比如说,将task分配到靠它要计算的数据所在节点,比较近的一个节点,然后进行计算。

但是对于第二种情况,通常来说,肯定是要发生数据传输,task会通过其所在节点的BlockManager来获取数据,BlockManager发现自己本地没有数据,会通过一个getRemote()方法,通过TransferService(网络数据传输组件)从数据所在节点的BlockManager中,获取数据,通过网络传输回task所在节点。

对于我们来说,当然不希望是类似于第二种情况的了。最好的,当然是task和数据在一个节点上,直接从本地executor的BlockManager中获取数据,纯内存,或者带一点磁盘IO;如果要通过网络传输数据的话,那么实在是,性能肯定会下降的,大量网络传输,以及磁盘IO,都是性能的杀手。

什么时候要调节这个参数?

观察日志,spark作业的运行日志,推荐大家在测试的时候,先用client模式,在本地就直接可以看到比较全的日志。
日志里面会显示,starting task。。。,PROCESS LOCAL、NODE LOCAL,观察大部分task的数据本地化级别。

如果大多都是PROCESS_LOCAL,那就不用调节了
如果是发现,好多的级别都是NODE_LOCAL、ANY,那么最好就去调节一下数据本地化的等待时长
调节完,应该是要反复调节,每次调节完以后,再来运行,观察日志
看看大部分的task的本地化级别有没有提升;看看,整个spark作业的运行时间有没有缩短

但是注意别本末倒置,本地化级别倒是提升了,但是因为大量的等待时长,spark作业的运行时间反而增加了,那就还是不要调节了。

spark.locality.wait,默认是3s;可以改成6s,10s

默认情况下,下面3个的等待时长,都是跟上面那个是一样的,都是3s

spark.locality.wait.process//建议60s
spark.locality.wait.node//建议30s
spark.locality.wait.rack//建议20s

Guess you like

Origin www.cnblogs.com/cjunn/p/12234198.html