[Spark] Memory Model

1. Introduction

1. Background

Spark is a memory-based distributed computing engine. The memory model and management are the core knowledge points. Understanding it can better develop Spark applications and perform performance tuning (solving the problem of job GC time-consuming—mainly Young GC).

2. Overall structure & operation process

1
2

The overall running process of Spark:

  1. Build the operating environment . The SparkContext is created by the Driver for resource application, task allocation and monitoring;
  2. Allocate resources . SparkContext communicates with Cluster Manager, applies for resources for Executor, performs task allocation and monitoring, and starts processes;
  3. Decompose Stage and apply for Task . SparkContext constructs a DAG graph, then decomposes it into stages, and sends the TaskSet of each stage to TaskScheduler. The executor requests a task from the driver.
  4. Run & log off . Task runs on the Executor, and returns to TaskScheduler and DAGScheduler after execution, and then releases resources.

From the perspective of the overall process, the Driver side creates a SparkContext, submits jobs, and coordinates tasks; the Executor side executes tasks. From the perspective of memory usage, the memory design on the Executor side is more complicated, which is summarized below.

Two, Executor end memory design

2.1 Heap division

3

The Executor started by the worker node is a JVM process, so the memory management of the Executor is based on the JVM memory management (On-Heap memory). At the same time, Spark also introduces Off-Heap (out-of-heap memory) to make it directly in the system Open up space in the memory, avoid unnecessary serialization and deserialization overhead in the data processing process, and reduce GC overhead at the same time.

2.1.1, On-Heap (in the heap)

The size of the heap memory is configured by the -executor-memory or Spark.executor.memory parameter when the Spark application is started. The concurrent tasks running in the Executor share the memory in the JVM heap. These tasks are classified as storage memory when caching RDD or broadcasting data, and when these tasks execute Shuffle, the memory occupied is classified as execution memory. There is no special planning for the remaining part. Those Spark internal object instances or user-defined Spark application object instances all occupy the remaining space. In different management modes, the space occupied by these three parts is different.

Spark's management of heap memory is only logically "planned" management. The application and release of memory occupied by object instances are all done by JVM, while Spark only records the memory after application and before release. Let's take a look at the specific process:

[1] Apply for memory
1. Spark new an object instance in the code;
2. JVM allocates space from the heap memory, creates the object and returns the object reference;
3. Spark will save the object reference and record the memory occupied by the object.

[2] Release memory
1. Spark records the memory released by the object and deletes the reference to the object.
2. Waiting for the JVM garbage collection mechanism to release the heap memory occupied by the object.

Spark serialization tips:
JVM objects can be stored in a serialized manner. The process of serialization is to convert the object into a binary byte stream. In essence, it can be understood as the conversion of chain storage in discontinuous space into continuous space or block storage. During access, the reverse process of serialization is required—deserialization, which converts the byte stream into an object. The method of serialization can save storage space, but increases the computational overhead of storage and reading.
Serialized objects in Spark will exist in the form of byte streams, and the memory occupied can be directly calculated. For non-serialized objects, the memory occupied can only be approximated by periodic sampling, that is, not every time it is added. Data items will be calculated once to occupy the memory size. This method reduces the time overhead, but the error may be large, resulting in the actual memory far exceeding expectations at a certain moment. In addition, when the object instance marked as released by Spark is not actually recycled by the JVM, the actual available memory is less than the available memory recorded by Spark, so Spark cannot accurately record the actual available heap memory, so OOM cannot be completely avoided. .
Although it cannot precisely control the application and release of in-heap memory, Spark can decide whether to cache new RDDs in storage memory and whether to allocate execution memory for new tasks through independent planning and management of storage memory and execution memory. To a certain extent, memory utilization can be improved and abnormalities can be reduced.

2.1.2, Off-Heap (off-heap)

In order to further optimize memory usage and provide the efficiency of sorting during Shuffle, Spark introduces off-heap memory so that it can directly open up space in the memory of the worker node system to store serialized binary data. Using the JDK Unsafe API (starting with Spark 2.0, the management of off-heap memory storage is no longer based on Tachyon, but the same as off-heap execution memory, based on the JDK Unsafe API implementation), Spark can directly operate the off-heap memory, reducing the cost Necessary memory overhead, as well as frequent GC scanning and recycling, improve processing performance. Off-heap memory can be accurately requested and released, and the space occupied by serialized data can be accurately calculated, so compared to on-heap memory, management difficulty and errors are reduced.
By default, off-heap memory is not enabled, it can be enabled through the Spark.memory.offHeap.enabled parameter, and the size of the off-heap space is set by the Spark.memory.offHeap.size parameter. Except that there is no other space, off-heap memory and in-heap memory are divided in the same way, and all running concurrent tasks share storage memory and execution memory.

2.2 Memory division and corresponding functions

In order to reduce the occurrence of OOM, Spark further divides the in-heap memory into Storage, Executor, and Other. This memory division method manages memory utilization respectively.

  • Calculation of available heap memory space:
可用的存储内存 = systemMaxMemory * spark.storage.memoryFraction * spark.storage.safetyFraction
可用的执行内存 = systemMaxMemory * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction
 
systemMaxMemory:取决于当前JVM堆内存大小
memoryFraction: ……
safetyFraction:降低实际内存超出预设范围而导致OOM风险(上文提到,对于非序列化对象的内存采用估算会产生误差),注意,这个预留保险区域只是逻辑规划,具体使用时Spark并没区别对待,和其他内存一样交给了JVM管理。

The off-heap space allocation is relatively simple, with only storage memory and execution memory. As shown in Figure 4, the available execution memory and storage memory footprint are directly determined by the parameter Spark.memory.storageFraction. Since the off-heap memory footprint can be accurately calculated, So there is no need to set up insurance areas.
The static memory management mechanism is simple to implement, but if the user is not familiar with the Spark storage mechanism, or does not configure it according to the specific data scale and computing tasks, it is easy to cause a "general seawater, general flame" situation, that is, the remaining memory and the execution memory are left A lot of space, but the other party is full early, and has to eliminate or overflow old content to store new ones. Due to the emergence of a new memory management mechanism, this method is currently rarely used by developers. For compatibility reasons, Spark still retains its implementation.
4

According to the purpose of memory usage, the internal and external memory of the
heap is divided as above: for the internal memory of the heap, it can be divided into 4 blocks:
1. Storage Memory: This part of the memory is mainly used for caching or internal data transmission, such as caching RDD or Broadcast data;
2. Execution Memory: This part of memory is mainly used for computing memory, including shuffles.Join.sorts and aggregations;
3. Other Memory: This part of memory is mainly used to store user-defined data structures or Spark internal metadata;
4. Reserved memory: It has the same function as other memory, but because Spark uses an estimation method for heap memory, it provides reserved memory to ensure sufficient space.

The off-heap memory is divided into 2 blocks (As mentioned earlier, Spark can accurately calculate the use of off-heap memory):
1. Storage Memory
2. Execution Memory

2.3 Unified memory management mechanism

After Spark 1.6, a unified memory management mechanism was introduced. The difference from static memory management is that storage memory and execution memory share the same space and can dynamically occupy each other's free area.

2.3.1 Static memory model

The biggest feature of the static memory model is: the size of each area of ​​the heap memory is fixed during the running of the Spark application, and the user can configure it before starting; this also requires the user to be very familiar with the Spark memory model, otherwise it may cause serious consequences due to improper configuration .
The division and proportion of the heap memory memory area are as follows:
Here is the Unroll process: RDD is cached in the storage memory in the form of Block, and Record occupies a continuous space in the heap or off-heap storage memory. The process of converting Partition from discontinuous storage space to continuous space is Unroll, also known as the unfolding operation.
5
6

The division of off-heap memory is relatively simple, that is, only storage memory and execution memory. Because Spark calculates the use of off-heap memory more accurately, there is no need to reserve additional space to avoid OOM.

Dynamic occupancy mechanism:
Set the basic storage memory and execution memory area (Spark.storage.storageFraction parameter). This setting determines the space range of both parties.
When both parties have insufficient space, it will be stored on the hard disk; if one's own space is insufficient and the other party is free, the other party's space can be borrowed. (Insufficient storage space refers to not enough to put a complete block) After the
execution memory space is occupied by the other party, the other party can transfer the occupied part to the hard disk, and then "return" the borrowed space. After the storage memory space is occupied by the other party, the other party cannot be "returned" because many factors in the shuffle process need to be considered, and the implementation is more complicated.
7

With the unified storage management mechanism, Spark improves the utilization of in-heap and out-of-heap memory resources to a certain extent, and reduces the difficulty for developers to maintain Spark memory, but it does not mean that developers can sit back and relax. For example, if the storage memory space is too large or the cached data is too large, it will lead to frequent full garbage collection and reduce the performance during task execution, because the number of cached RDDs is usually long-term resident memory. Therefore, in order to give full play to Spark performance, developers need to further understand the storage memory and execution memory management methods and implementation principles.

Three, storage memory management

3.1 RDD persistence mechanism

RDD is a data abstraction, created based on a stable physical storage data set, or performed conversion on other existing RDDs, and the resulting dependency key constitutes lineage. Spark can ensure that each RDD is restored by its blood, but all the conversions of the RDD are lazy, knowing that it encounters the Action operator.
At the beginning of the task, it will read the partition and determine whether it is persisted. If not, check the Checkpoint and recalculate it according to the lineage. Therefore, the persist or cache method can be used for the first time to persist or cache this RDD to avoid double calculation and improve the calculation speed. The cache uses the default MEMORY storage level to persist RDD in memory, so the cache is a special kind of persistence. In-heap and off-heap memory design can make unified planning and management of the memory used when caching RDDs. (Caching broadcast data is outside the scope of this article).
RDD persistence is in charge of Spark's storage module, which realizes the decoupling of RDD and physical storage. The Storage module is responsible for managing the data generated in the Spark calculation process, and encapsulating the functions of storing and storing data locally or remotely in the memory or disk. In specific implementation, the BlockManager on the Driver side and the Executor side are slaves. The Storage module logically uses Block as the basic storage unit, and each Partition of the RDD corresponds to a unique Block after processing. The master is responsible for the management and maintenance of the block metadata information of the entire Spark application, and the slave needs to report the block update and other status to the master, and at the same time receive Master commands, such as adding or deleting an RDD.
8

3.2 RDD caching process

Before RDD cache storage memory, the data in Partition is generally accessed in the data structure of iterators. Iterator can obtain each serialized or non-serialized data item in the partition. These Record object instances logically occupy the other part of the memory in the JVM heap, and the different Record spaces of the same Partition are not continuous.
After the RDD is cached in the storage memory, the Partition is converted to a Block, and the Record occupies a continuous space in the on-heap or off-heap storage memory. The process of transforming Partition from discontinuous storage space to continuous storage space is called "Unroll" by Spark. Block has two storage formats, serialized and non-serialized, and the specific method depends on the RDD storage level. Deserialized Block is defined as a DeserializedMemoryEntry data structure and uses an array to store all object instances. The serialized block is defined by the SerializedMemoryEntry data structure, and the byte buffer is used to store binary data.
Each Executor's Storage module uses a chained Map structure (LinkedHashMap) to manage all Block object instances in the heap and off-heap storage memory. The addition and deletion of this LinkedHashMap indirectly records the memory application and release.
Because there is no guarantee that the storage space can hold all the data in the Iterator at one time, the current calculation task should apply to the MemoryManager for sufficient Unroll space to temporarily occupy the space when the current calculation task is unrolled. If the space is insufficient, the Unroll will fail. If it is enough, it can continue.
For serialized Partitions, the required Unroll space can be directly accumulated and applied for once. The non-serialized Partition must be applied in sequence during the traversal of the Record, that is, every time a Record is read, the Unroll space required by the sample is estimated and applied. If the space is insufficient, it can be interrupted and the occupied Unroll space is released. If the final Unroll is successful, the Unroll space occupied by the current Partition is converted to the normal cache RDD storage space, as shown in the figure below.
9

In the above static memory management, the storage memory is specifically divided into a piece of Unroll space with a fixed size, while the unified memory management does not distinguish the Unroll space. When the storage space is insufficient, it will be processed according to the dynamic occupancy mechanism.

3.3 Elimination and placing

All computing tasks of the same Executor share limited storage memory space. When a new block needs to be cached but the remaining space is insufficient and cannot be dynamically occupied, the old block in the LinkedHashMap must be eliminated (Eviction), and the eliminated block if the storage level is If it also contains the storage disk requirements, it must be compassed, otherwise the block will be deleted directly.
The storage memory elimination rule is: The memory mode of the eliminated
old block is the same as that of the new block, that is, it belongs to the off-heap/internal memory.
The old and new blocks cannot belong to the same RDD to avoid cyclic elimination.
The RDD to which the old block belongs cannot be read to avoid consistency problems.
Traverse the Blocks in the LinkedHashMap and eliminate them in the order of least used LRU, knowing the space required to meet the block, and the LRU is the LinkedHashMap feature.
The disk placement process is simple. If the storage level meets the conditions, the serialization form is judged, otherwise, the serialization is performed. Finally, the data is stored to the disk and the information is updated in the Storage module.

Fourth, perform memory management

4.1 Memory allocation among multiple tasks

Tasks running in the Executor also share execution memory, and Spark uses a HashMap structure to save the mapping from tasks to memory consumption. Each task can occupy execution memory in the range of 1/2N~1/N, where N is the number of tasks running in the Executor. When each task starts, it needs to apply for at least 1/2N of execution memory from the MemoryManager. If the requirement cannot be met, the task is blocked until enough execution memory is released by other tasks, the task can be awakened.

4.2 Shuffle memory usage

Execution memory is mainly used to store the memory occupied by tasks when executing shuffle. Shuffle is a process of repartitioning RDD according to certain rules. Let's take a look at the use of execution memory in the Writer and Read phases:
Shuffle Write:
If the normal sorting method is used on the map side, ExternalSorter will be used for external rowing, and the execution space in the heap is mainly used in the memory.
If Tungsten sorting is selected on the map side, ShuffleExternalSorter is used to directly sort the data stored in serialized form. When storing data in memory, it can occupy off-heap or in-heap execution space, depending on whether the user opens off-heap memory and off-heap execution memory Is it enough.

Shuffle Read:
When aggregating data on the reduce side, the data is handed over to the Aggregator for processing, and the execution space in the heap is occupied when the data is stored in the memory.
If you need to sort the final result, you need to pass the data to the ExternalSorter again, occupying the execution space in the heap.

In ExternalSorter and Aggregator, Spark will use a hash table called AppendOnlyMap to store data in memory in the heap, but during the shuffle process, all data cannot be saved in the hash table. When the hash table is occupied The memory is estimated periodically. When it reaches a certain level and no new execution memory can be requested from the MemoryManager, Spark will store all the memory to the disk file. This process is called Spill, and the overflow file is finally Will be merged (Merge).
The Tungsten used in the Shuffle Write stage was proposed by Databricks to optimize the memory and CPU usage plan for Spark, which solved some of the JVM performance limitations and drawbacks. Spark will automatically choose whether to use Tungsten sorting according to the Shuffle situation.
The page memory management mechanism adopted by Tungsten is based on MemeoryManager, that is, Tungsten abstracts the use of execution memory in a step, so that there is no need to care about whether the data is stored in the heap or out of the heap during the shuffle process. Each memory page is defined by a MemoryBlock, and two traversals of Object obj and long offset are used to uniformly identify the address of a memory page in system memory.
The MemoryBlock in the heap is the initial offset address of the long array in the JVM. The two can be used together to locate the absolute address of the array in the heap. The MemoryBlock outside the heap is directly applied to the memory block, its obj is null, and the offset is this The 64-bit absolute address of the memory block in the system memory. Spark uses MemoryBlock to cleverly abstract and encapsulate in-heap and out-of-heap memory pages, and uses a page table (pageTable) to manage each task's application to memory pages.
All memory under Tungsten page management is represented by a 64-bit logical address, which is composed of a page number and an offset within the page:
Page number: 13 bits, uniquely representing a memory page, Spark applies for a free page number before applying for a memory page.
In-page offset: occupies 51 bits, which is the offset address of the data in the page when the memory page is used to store data.

With the unified addressing mode, Spark can use 64-bit logical address pointers to locate in-heap or off-heap memory. The entire Shuffle Write sorting process only needs to sort the pointers and does not need to deserialize. The whole process is very efficient. It is very efficient for memory access and CPU usage efficiency has brought a significant improvement.

Spark storage memory and execution memory have completely different management methods: for storage memory, Spark uses a LinkedHashMap to centrally manage all blocks, which are transformed from the Partition of the RDD that needs to be cached; and for execution memory, Spark uses AppendOnlyMap. Storing data in the Shuffle process is abstracted as page memory management in Tungsten sorting, opening up a new JVM memory management mechanism.

Five, reference

1. Spark memory model
2. Tungsten On Spark-memory model design
3. Spark memory model introduction and Spark application memory optimization stepping record
4. Understanding spark memory model from a SQL task
5. Spark memory model

Guess you like

Origin blog.csdn.net/HeavenDan/article/details/115250764