spark memory overflow processing

Brief introduction

Spark in the OOM problem confined to the following two conditions

  • map execution memory overflow
  • After the shuffle memory overflow

map execution memory overflow map represents all types of operations. Including: flatMap, filter, mapPatitions and so on.
After the shuffle memory overflow shuffle operation includes join, reduceByKey, repartition operation.
Behind first summarize my understanding of Spark memory model, and then summarize the summary of a variety of circumstances OOM corresponding solutions and performance optimization. If you have the wrong understanding, I wish to point out in the comments.

Spark memory model

Spark Executor in a memory divided into three, is an execution memory, is a storage memory, a memory is the other.

  • execution memory is the execution memory, the document says join, aggregate are performed in this part of the memory, shuffle data will first be cached in the memory is full and then written to disk, it is possible to reduce the IO. In fact, map process is executed in this memory.
  • storage memory is stored broadcast, cache, local persist data.
  • other memory is reserved for its own memory during program execution.

execution and storage are Spark Executor in memory-hungry, other take up a lot of memory is relatively small, can not say here.
Memory allocation spark 1.6.0-previous version, Execution and storage are fixed, using configuration parameters (representing the total memory size Executor):

  • execution:spark.shuffle.memoryFraction (default 0.2)
  • storage: spark.storage.memoryFraction(default 0.6)

Because in these two memory before 1.6.0 are isolated from each other, which led to the Executor memory utilization is not high, but also according to the specific circumstances of the Application, the users themselves to regulate these two parameters to optimize memory usage of Spark .

After the spark-1.6.0 version, execution memory and storage memory can borrow from each other and improve the memory of Spark in memory usage, while also reducing the OOM situation.

There are external memory using heap two ways, one is to pass parameters StorageLevel.OFF_HEAP when rdd calls persist, which requires use with Tachyon used together. Another is to use the Spark comes spark.memory.offHeap.enabled configuration using is true, but this way at 1. 6 version 0.05 does not support the use of just more than this parameter in future versions It will be open. 
OOM problem usually occurs in the execution of this memory, this memory as storage after the data storage is full, it will directly discard old data in memory, but will not have an impact on the performance of the OOM problem.

Memory overflow Solution

1. map process generates a lot of objects overrunning memory

The reason for this is the large number of overflow of the map objects in a single result.

For example: rdd.map (x => for ( i <- 1 to 10000) yield i.toString), rdd in this operation, each object generated 10,000 objects, which must be very prone to memory overflow problems. In response to this problem, without increasing the memory, by reducing the size of each Task in order to achieve each object Task Executor even if a large amount of memory can be installed under. Specifically repartition method call may generate large number of objects in the map will be before the operation, partitioned into smaller blocks incoming map. For example: rdd.repartition (10000) .map (x => for (i <- 1 to 10000) yield i.toString).
Faced with this problem note that you can not use rdd.coalesce method, this method can only reduce a partition, a partition can not be increased, there will be no shuffle of the process.

2. Data imbalance leads to memory overflow

In addition to data imbalance may cause a memory leak, but also may lead to performance problems, and to solve the said method similar to the above, it is to call the repartition repartition. There is no longer a burden.

3.coalesce call causes memory overflow

This is a problem that I encountered recently, because hdfs does not exist for small problems, so after Spark calculated if the resulting file is too small, we will call coalesce merge files stored again in hdfs. But this can cause a problem, for example, prior to coalesce 100 files, it also means being able to have 100 Task, now calling coalesce (10), and finally produces only 10 files, because not coalesce shuffle operation, which means not coalesce as I had thought it would first perform 100 Task, then Task execution results merged into 10, but only 10 in place from the beginning in the Task execution, originally 100 files are performed separately, and now each Task At the same time the first reading 10 files, memory usage is 10 times the original, which leads to OOM. The solution to this problem is to make the program as we want to perform 100 Task then the results are combined into 10 files, this problem can also be solved by repartition, calling repartition (10), because this process there is a shuffle, shuffle before and after two Stage, a 100 partition, a partition is 10, will be able to perform in accordance with our ideas.

After 4.shuffle memory overflow

Shuffle memory overflow situation can be said after all shuffle, single file is too large due. In Spark, join, reduceByKey this type of process, there will be a process of shuffle, shuffle in use, the need to pass a partitioner, shuffle operation in most of Spark, the default partitioner is HashPatitioner, the default is the parent RDD the maximum number of partitions, the control parameters spark.default.parallelism (spark.sql.shuffle.partitions used in spark-sql), spark.default.parallelism HashPartitioner parameter is only valid if it is their own or else Partitioner Partitioner can not be implemented to control the amount of concurrent use spark.default.parallelism shuffle of this parameter. If it is another partitioner caused shuffle out of memory, we need to increase the number of partitions from the partitioner code.

It leads to uneven distribution of resources at 5. standalone mode memory overflow

In standalone mode if the -total-executor-cores and -executor-memory configuration of these two parameters, but is not configured -executor-cores of this argument, it may lead to, each Executor of memory is the same, but different numbers of cores, then the number of cores in the Executor, it is possible to perform multiple Task same time, it is easy to cause memory overflow. This solution is is to configure parameter spark.executor.cores -executor-cores or, Executor ensure uniform distribution of resources.

 

6. In RDD, the common case where the object is possible to reduce the OOM

This is rather special, said here about the record, we encountered a situation like this rdd.flatMap (x => for (i <- 1 to 1000) yield ( "key", "value")) cause OOM, but under the same circumstances, the use of rdd.flatMap (x => for (i <- 1 to 1000) yield "key" + "value") would not have the OOM problem, because every time ( "key", " value ") has produced a Tuple object, and the" key "+" value ", no matter how many, only one object, point to the constant pool. Specific tests are as follows:

This example shows ( "key", "value") and ( "key", "value") in memory of the existence of different locations, which is saved two, but the "key" + "value" Although there have been two times, but only keep a copy at the same address, which used the knowledge JVM constant pool. Ever since, if there are a large number of duplicate data in the RDD, or the Array need to save a large amount of duplicate data when we can repeat data into String, can effectively reduce memory usage.

optimization

This part recorded at the spark-1.6.1 version, I think there are some parameters to configure and optimize the performance of the role of some code optimization techniques, optimization of parameters in part, if I think the default is optimal, and no longer recorded here .

1. mapPartitions map instead of the majority operation, or continuous operation using a map

It should be a little talk about the difference between the RDD and DataFrame. RDD emphasized that immutable objects, each RDD is immutable, when calling RDD's map type operation, when a new object is generated, which leads to a problem, a large number of calls to a map if RDD if the type of operation, each map into a plurality of operations can generate RDD objects, although this does not necessarily lead to memory overflow, but will produce a large number of intermediate data gc operation increases. In addition RDD when calling action operation will start Stage division, but within each Stage can be optimized part is not optimized, for example rdd.map ( +1) .map ( +1), the operation numeric RDD is equivalent to rdd.map (_ + 2), but does not optimize the internal RDD process. DataFrame is different, because of the type of information it DataFrame is variable, and the program can be used in sql, in addition to both interpreter, there will be a sql optimizer, DataFrame no exception, there is a optimizer the Catalyst, specifically look behind the introduction of reference articles.

The above mentioned drawbacks of the RDD, the part can be optimized using mapPartitions, mapPartitions can replace simultaneously rdd.map, effect rdd.filter, rdd.flatMap, so long operation, it may be in a large number of mapPartitons RDD together write operation, to avoid a large number of intermediate rdd objects, in a further partition is mapPartitions may multiplex variable-type, it is possible to avoid frequent create a new object. Use mapPartitions drawbacks is the expense of legibility.

2.broadcast join and ordinary join

In large distributed data systems, the effect of moving large amounts of data on performance is enormous. Based on this idea, when two RDD conduct join operation, if one of the RDD relatively small lot, small RDD can be carried out and then collect operating variable is set to broadcast, after doing so, you can use the map to another RDD join operations , this can effectively reduce the relatively large number of data movement of the RDD.

3. In the first filter join

This is the predicate pushdown, this is clear, then join after the filter, will reduce the amount of data shuffle, to mention here that are spark-sql optimizer has been optimized for this part, it does not require the user to display the operation, to achieve personal rdd when computing the required attention to this.

4.partitonBy optimization

In another article in this section "spark partitioner tips" are described in detail here do not say.

5.combineByKey use:

Map-Reduce This operation is also here for example:. Rdd.groupByKey () mapValue (. _ Sum) is less efficient than rdd.reduceByKey

FIG vertical difference is above two processes have photograph combineByKey reduces the amount of data shuffle, the following no. combineByKey is a key-value type rdd own API, you can use directly.

6. optimization of insufficient memory

In low memory use, use rdd.persist (StorageLevel.MEMORY_AND_DISK_SER) instead of rdd.cache ():
rdd.cache () and rdd.persist (Storage.MEMORY_ONLY) are equivalent, rdd.cache in memory of the time ( ) data will be lost, again, when will recalculate, and rdd.persist (StorageLevel.MEMORY_AND_DISK_SER) will be stored in the memory of the time in the disk to avoid recalculation, just time consuming point IO.

7. When using spark of hbase, spark and hbase set up in the same cluster:

Hbase in conjunction with the spark, and the spark hbase preferably built on the same cluster, the cluster nodes or spark cover all the nodes hbase. Data stored in the hbase HFile usually single HFile are relatively large, in addition Spark Hbase read data when not in accordance with a corresponding one of RDD HFile partition, but a region corresponding to a partition RDD. So when Spark Hbase read data, usually single RDD will be relatively large, if not set up in the same cluster, data movement can take a lot of time.

Parameter optimization section

8.spark.driver.memory (default 1g)

This parameter is used to set the Driver's memory. In the Spark program, SparkContext, DAGScheduler are running Driver side. Rdd segmentation corresponds to the Stage also in the Driver-side run, if the user to write their own programs there are too many steps to cut too much separation Stage, which is part of the information consumed Driver memory, this time need to turn up the Driver RAM.

9.spark.rdd.compress (default false)

This parameter in tight memory, when they need to persist data have good performance, you can set this parameter to true, so when using the persist (StorageLevel.MEMORY_ONLY_SER), it can be compressed rdd data in memory. Reduce memory consumption, that is, when in use will take time to extract the CPU.

10.spark.serializer (default org.apache.spark.serializer.JavaSerializer )

The recommended setting is org.apache.spark.serializer.KryoSerializer, because KryoSerializer faster than JavaSerializer, but there may be some Object serialization will fail, this time on the need to show the sequence of failure to register KryoSerializer class, this time to spark.kryo.registrator configuration parameters or in reference to the following codes:
. valconf = newSparkConf () a setMaster (...) .setAppName (...)
conf.registerKryoClasses (the Array (classOf [MyClass1], classOf [MyClass2]))
valsc = newSparkContext (the conf )

11.spark.memory.storageFraction (default 0.5)

This parameter sets the in-memory representation Executor memory storage / (storage + execution), although the spark-1.6.0 + version of memory storage and execution of memory already can borrow each other, but also need to consume borrowing and redemption performance, so if knowing the program storage is much or little you can adjust what this parameter.

12.spark.locality.wait (default 3s)

There are four spark localized execution level, PROCESS_LOCAL-> NODE_LOCAL-> RACK_LOCAL-> ANY, a task executed, if the waiting time spark.locality.wait, the wait PROCESS The Task reach, if not, wait for the task NODE level down to wait spark.locality.wait time, and so on, until ANY. Whether to perform a distributed system can be a good influence on the performance of local files it is great. If each partition data RDD is more, each partition processing time is too long, we should put due emphasis on spark.locality.wait bigger, let Task to have more time to wait for local data. Especially after using persist or cache, after these two operations, stored in the local memory machine call data would be very efficient, but if you need cross-machine data transmission memory, the efficiency will be very low.

13.spark.speculation (default false)

A large cluster, the performance will be different for each node, spark.speculation this parameter indicates the resource nodes will not attempt to perform free is still running, and Task running too long, to avoid a single node running too slowly lead the entire task cards on a node. This parameter is best set to true. Parameter associated with it may be provided together with a spark.speculation. × parameter beginning. There is an article in reference to a detailed description of this parameter.

 

Guess you like

Origin www.cnblogs.com/wcgstudy/p/11407607.html