Spark (2): Memory Management

Reprinted from: http://www.cnblogs.com/tgzhu/p/5822370.html

 As a computing engine that is good at in-memory computing, Spark’s memory management scheme is a very important module; Spark’s memory can be roughly classified into two categories: execution and storage. The former includes the memory required for shuffles, joins, sorts, and aggregations. The latter includes the memory required for cache and data transfer between nodes; in Spark 1.5 and earlier versions, the two are statically configured and do not support borrowing. Spark 1.6 optimizes the memory management module. Through the fusion of memory space, the elimination of The above limits provide better performance. The official website only requires that the memory be above 8GB ( Impala recommends that the machine be configured at 128GB ), but the running efficiency of spark jobs mainly depends on: data size, memory consumption, number of cores (determine the number of tasks running concurrently)

content:

  •  Basic knowledge
  • spark1.5- memory management
  • spark1.6 memory management

basic knowledge:


  • on-heap memory: All non-null objects allocated in Java are managed by the Java virtual machine's garbage collector, also known as on-heap memory. The virtual machine periodically reclaims garbage memory, and at some specific point in time, it will perform a complete reclamation (full gc). During a complete collection, the garbage collector will perform a complete scan of all allocated heap memory, which means an important fact - the impact of such a garbage collection on a Java application is proportional to the size of the heap. Excessive heap can affect the performance of Java applications
  • Off-heap memory: Off-heap memory means that memory objects are allocated in memory outside the Java virtual machine's heap, which is directly managed by the operating system (rather than the virtual machine). The result of this is to keep a small heap to reduce the impact of garbage collection on the application
  • LRU Cache (Least Recently Used): LRU can be said to be an algorithm or a principle used to determine how to clear objects from the Cache, and LRU is the "least recently used" principle. When the Cache overflows, the least recently used The used object will be cleared from the Cache
  • spark source code: https://github.com/apache/spark/releases
  • scale ide for Intellij : http://plugins.jetbrains.com/plugin/?id=1347

Spark1.5- Memory management:


  • Version 1.6 introduces a new memory management scheme, configuration parameters: spark.memory.useLegacyMode The default false means to use the new scheme, and true means to use the old scheme. The source code of SparkEnv.scala is as follows:
  •   
  • View constructor and memory acquisition definitions in the staticMemoryManager.scala class
  •        

  • Through code inference, if spark.testing.memory is set, the value of this configuration is used as systemMaxMemory, otherwise, the maximum memory of JVM is used as systemMaxMemory.
  • spark.testing.memory is only used for testing and is generally not set, so here we think that  the value of systemMaxMemory is the maximum available memory of the executor
  • Execution: used to cache temporary data for shuffle, join, sort and aggregation, configured through spark.shuffle.memoryFraction
  • spark.shuffle.memoryFraction: The percentage of executor runtime memory during shuffle, expressed as a decimal. At any time, the total size of memory used for shuffle must not exceed this limit, and the excess will be spilled to disk. If you spill frequently, consider increasing the parameter value
  • spark.shuffle.safetyFraction: To prevent OOM, systemMaxMemory * spark.shuffle.memoryFraction cannot be fully used, and a safety percentage is required
  • The final amount of memory used for execution is: executor maximum available memory * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction, the default is executor  maximum available memory * 0.16
  • Execution memory is allocated to multiple task threads in the JVM.
  • The execution memory allocation between tasks is dynamic. If no other tasks exist, Spark allows a task to occupy all the available execution memory.

  • The storage memory allocation analysis process is consistent with Execution. From the above code, the amount of memory used for storage is: executor maximum available memory * spark.storage.memoryFraction * spark.storage.safetyFraction, the default is executor  maximum available memory * 0.54
  • In storage, a part of the memory is used for unroll, unroll is the deserialization block, the proportion of this part is controlled by spark.storage.unrollFraction, the default is 0.2

  • Through code analysis, storage and execution use a total of 80% of the memory, and the remaining 20% ​​of the memory is reserved by the system to store objects generated during operation. This type of memory is uncontrollable.

summary:


  • The defect of this kind of memory management method, that is, the state allocation of execution and storage memory, cannot be shared even when one party's memory is insufficient and the other party's memory is free, resulting in a waste of memory. To solve this problem, spark1.6 enables a new Memory management solution UnifiedMemoryManager
  • staticMemoryManager- jvm heap memory allocation diagram is as follows

 

Spark1.6 memory management:


  • Starting from spark 1.6, a new memory management method - Unified Memory Manager (Unified Memory Manager) has been introduced. Under unified memory management, the jvm heap memory in an executor of spark is divided into the following figure:

  • Reserved Memory, this part of the memory is the part that we cannot use. Spark reserves memory internally, which will store some spark internal objects and so on.
  • The default Reserved Memory size of spark1.6 is 300MB. The size of this part is not allowed to be changed by our users. Simply put, after we apply for memory for the executor, there is 300MB that we cannot use. And if the size of the executor we apply for is less than 1.5 * Reserved Memory ie < 450MB, spark will report an error:
  • User Memory: A series of non-spark-managed memory overheads such as object storage created by users in the program occupy this part of memory
  • Spark Memory: The size of this part is (JVM Heap Size - Reserved Memory) * spark.memory.fraction, where spark.memory.fraction can be configured by us (default 0.75), as shown below:
  • If the spark.memory.fraction configuration is small, when our spark task generates data during execution, including when we are doing cache, it is likely that spill to disk is often generated due to insufficient memory, which affects efficiency. Use the official recommended default configuration
  • The Spark Memory block is divided into two parts, Execution Memory and Storage Memory, which are configured by spark.memory.storageFraction to configure the size of each of the two blocks (default 0.5, half on one side), as shown in the figure:
  • Storage Memory is mainly used to store the data of our cache and the data of unroll when the temporary space is serialized, as well as the content stored at the cache level of the broadcast variable
  • Execution Memory is the memory used when the spark task is executed (for example, a lot of memory is required for sorting during shuffle)
  • In order to improve memory utilization, spark has the following strategies for Storage Memory and Execution Memory:
    1. When one side is free and the other side has insufficient memory, the side with insufficient memory can borrow memory from the free side.
    2. Only the Execution Memory can forcibly take back the Storage Memory. When the Execution Memory is idle, part of the borrowed Execution Memory memory (if the Storage Memory data is lost due to forced retrieval, recalculate it)
    3. If the Storage Memory can only wait for the Execution Memory to actively release the memory when the occupied Storage Memory is idle. (There is no mandatory retrieval here, because if the task is executed, data loss will cause the task to fail)

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324496266&siteId=291194637