Spark memory management mechanism

I. Overview

As a memory-based distributed computing engine, Spark's memory management module plays a very important role in the entire system. Understanding the basic principles of Spark memory management will help you better develop Spark applications and perform performance tuning. The purpose of this article is to sort out the context of Spark memory management, to introduce some ideas, and to lead readers to have in-depth discussions on this topic. The principles explained in this article are based on Spark version 2.1. Reading this article requires readers to have a certain foundation in Spark and Java and understand related concepts such as RDD, Shuffle, and JVM.

When executing a Spark application, the Spark cluster will start two JVM processes, Driver and Executor. The former is the main control process and is responsible for creating the Spark context, submitting Spark jobs (Jobs), and converting jobs into computing tasks (Tasks). Each Executor process coordinates the scheduling of tasks. The latter is responsible for executing specific computing tasks on the worker node and returning the results to the Driver. It also provides storage functions for RDDs that need to be persisted [1]. Since the Driver's memory management is relatively simple, this article mainly analyzes the Executor's memory management. The Spark memory below refers specifically to the Executor's memory.

2. In-heap and off-heap memory planning

As a JVM process, Executor's memory management is based on the JVM's memory management. Spark allocates the JVM's on-heap space in more detail to make full use of memory. At the same time, Spark introduces off-heap memory, which allows it to directly open up space in the system memory of the working node, further optimizing the use of memory.

　　　　　　　　　　　　　　　　　　　　　　　　Figure 1. Schematic diagram of in-heap and off-heap memory

2.1 Heap memory

The size of the heap memory is configured by the –executor-memory or spark.executor.memory parameter when the Spark application is started. Concurrent tasks running in the Executor share the JVM heap memory. The memory occupied by these tasks when caching RDD data and broadcast data is planned as storage memory, and the memory occupied by these tasks when executing Shuffle is planned as Execution (Execution) memory, the remaining part is not specially planned, those object instances within Spark, or object instances in user-defined Spark applications, all occupy the remaining space. Under different management modes, the space occupied by these three parts is different (will be introduced in Section 2 below).

Spark's management of in-heap memory is a logical "planning" management, because the application and release of memory occupied by object instances are completed by the JVM. Spark can only record these memories after application and before release. Let's take a look. Its specific process:

Apply for memory:

Spark new an object instance in code

JVM allocates space from heap memory, creates objects and returns object references

Spark saves a reference to the object and records the memory occupied by the object

Free memory:

Spark records the memory released by the object and deletes the reference to the object

Wait for the JVM's garbage collection mechanism to release the heap memory occupied by the object

We know that JVM objects can be stored in a serialized manner. The serialization process is to convert the object into a binary byte stream. In essence, it can be understood as converting chain storage of non-continuous space into continuous space or block storage. When accessing, you need to perform the reverse process of serialization - deserialization to convert the byte stream into an object. The serialization method can save storage space, but it increases the computational overhead during storage and reading.

For serialized objects in Spark, since they are in the form of byte streams, the memory size they occupy can be calculated directly, while for non-serialized objects, the memory size they occupy is approximately estimated through periodic sampling, that is, The memory size occupied is not calculated every time a new data item is added. This method reduces the time overhead but may cause large errors, causing the actual memory at a certain moment to far exceed expectations [2]. In addition, it is very likely that the object instances marked as released by Spark are not actually recycled by the JVM, causing the actual available memory to be less than the available memory recorded by Spark. Therefore, Spark cannot accurately record the actual available heap memory, so it cannot completely avoid memory overflow (OOM, Out of Memory) exceptions.

Although it cannot precisely control the application and release of memory in the heap, Spark can decide whether to cache new RDDs in storage memory and whether to allocate execution memory for new tasks through independent planning and management of storage memory and execution memory. To a certain extent, it can improve memory utilization and reduce the occurrence of exceptions.

2.2 Off-heap memory

In order to further optimize the use of memory and improve the efficiency of sorting during Shuffle, Spark introduces off-heap memory, which allows it to directly open up space in the system memory of the working node and store serialized binary data. Utilizing the JDK Unsafe API (starting from Spark 2.0, the management of storage memory outside the heap is no longer based on Tachyon, but is the same as the execution memory outside the heap, based on the JDK Unsafe API [3]), Spark can directly operate the operating system outside the heap memory, reducing unnecessary memory overhead, as well as frequent GC scanning and recycling, improving processing performance. Off-heap memory can be accurately applied for and released, and the space occupied by serialized data can be accurately calculated, so it reduces the difficulty of management and reduces errors compared with on-heap memory.

Off-heap memory is not enabled by default. It can be enabled by configuring the spark.memory.offHeap.enabled parameter, and the size of the off-heap space is set by the spark.memory.offHeap.size parameter. Except that there is no other space, off-heap memory is divided in the same way as in-heap memory, and all running concurrent tasks share storage memory and execution memory.

2.3 Memory management interface

Spark provides a unified interface for the management of storage memory and execution memory - MemoryManager. Tasks in the same Executor call the methods of this interface to apply for or release memory:

Listing 1. Main methods of the memory management interface

//Apply for storage memory 
def acquireStorageMemory(blockId: BlockId, numBytes: Long, memoryMode: MemoryMode): Boolean 
//Apply for expansion memory 
def acquireUnrollMemory(blockId: BlockId, numBytes: Long, memoryMode: MemoryMode): Boolean 
//Apply for execution memory 
def acquireExecutionMemory(numBytes: Long, taskAttemptId: Long, memoryMode: MemoryMode): Long 
//Release storage memory 
def releaseStorageMemory(numBytes: Long, memoryMode: MemoryMode): Unit 
//Release execution memory 
def releaseExecutionMemory(numBytes: Long, taskAttemptId: Long, memoryMode: MemoryMode): Unit 
//Release the expanded memory 
def releaseUnrollMemory(numBytes: Long, memoryMode: MemoryMode): Unit

We see that when calling these methods, you need to specify its memory mode (MemoryMode). This parameter determines whether the operation is completed within the heap or outside the heap.

Regarding the specific implementation of MemoryManager, Spark 1.6 defaults to the unified management ( Unified Memory Manager ) mode. The static management ( Static Memory Manager ) mode used before 1.6 is still retained and can be enabled by configuring the spark.memory.useLegacyMode parameter. The difference between the two methods lies in the way of space allocation. Section 3 below will introduce these two methods respectively.

3. Memory space allocation

3.1 Static memory management

Under the static memory management mechanism originally adopted by Spark, the sizes of storage memory, execution memory and other memories are fixed during the running of the Spark application, but users can configure it before the application starts. The allocation of memory in the heap is shown in Figure 2 Shown:

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　Figure 2. Static memory management diagram - within the heap

As you can see, the size of available heap memory needs to be calculated as follows:

Listing 2. Available in-heap memory space

Available storage memory = systemMaxMemory * spark.storage.memoryFraction * spark.storage.safetyFraction 
Available execution memory = systemMaxMemory * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction

Among them, systemMaxMemory depends on the size of the current JVM heap memory. On this basis, the last available execution memory or storage memory is multiplied by the respective memoryFraction parameter and safetyFraction parameter. The significance of the two safetyFraction parameters in the above calculation formula is to logically reserve a safe area of 1-safetyFraction to reduce the risk of OOM caused by the actual memory exceeding the current preset range (mentioned above, for non-sequential ized object's memory sampling estimate will produce errors). It is worth noting that this reserved insurance area is only a logical plan. Spark does not treat it differently when used specifically. It is left to the JVM to manage like "other memory".

The space allocation outside the heap is relatively simple, with only storage memory and execution memory, as shown in Figure 3. The amount of space occupied by available execution memory and storage memory is directly determined by the parameter spark.memory.storageFraction. Since the space occupied by off-heap memory can be accurately calculated, there is no need to set the insurance area.

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　Figure 3. Static memory management diagram - outside the heap

The static memory management mechanism is relatively simple to implement, but if the user is not familiar with Spark's storage mechanism, or does not configure it according to the specific data scale and computing tasks, it can easily lead to a "half sea water, half fire" situation, that is, storage One side of the memory and the execution memory has a large amount of space left, while the other side is occupied early, and old content has to be eliminated or moved out to store new content. Due to the emergence of new memory management mechanisms, this method is currently used by few developers. For the purpose of compatibility with older versions of applications, Spark still retains its implementation.

3.2 Unified memory management

The unified memory management mechanism introduced after Spark 1.6 differs from static memory management in that storage memory and execution memory share the same space and can dynamically occupy each other's free areas, as shown in Figure 4 and Figure 5

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　Figure 4. Unified memory management diagram - within the heap

　　　　　　　　　　　　　　　　　　　　　　　　　　　　Figure 5. Unified memory management diagram - outside the heap

The most important optimization is the dynamic occupancy mechanism, whose rules are as follows:

Set the basic storage memory and execution memory area (spark.storage.storageFraction parameter). This setting determines the range of space owned by both parties.

When both parties have insufficient space, it will be stored in the hard disk; if one's own space is insufficient and the other party has free space, the other party's space can be borrowed; (Insufficient storage space means that there is not enough space to put down a complete Block)

After the execution memory space is occupied by the other party, the other party can transfer the occupied part to the hard disk and then "return" the borrowed space.

After the storage memory space is occupied by the other party, it cannot be "returned" by the other party, because many factors need to be considered in the Shuffle process, and the implementation is relatively complicated[4]

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　Figure 6. Illustration of dynamic occupancy mechanism

With its unified memory management mechanism, Spark improves the utilization of in-heap and off-heap memory resources to a certain extent and reduces the difficulty for developers to maintain Spark memory. However, this does not mean that developers can sit back and relax. For example, if the storage memory space is too large or there is too much cached data, it will lead to frequent full garbage collection and reduce the performance of task execution, because cached RDD data usually resides in memory for a long time [5] . Therefore, in order to fully utilize the performance of Spark, developers need to further understand the management methods and implementation principles of storage memory and execution memory.

4. Storage memory management

4.1 RDD persistence mechanism

As the most fundamental data abstraction of Spark, elastic distributed data set (RDD) is a collection of read-only partition records (Partition), which can only be created based on data sets in stable physical storage, or on other existing RDDs. Performing a transformation operation produces a new RDD. The dependency relationship between the converted RDD and the original RDD constitutes the lineage. With lineage, Spark guarantees that every RDD can be restored. However, all transformations of RDD are lazy, that is, only when an action that returns results to the Driver occurs, Spark will create a task to read the RDD, and then actually trigger the execution of the transformation.
When Task reads a partition at the beginning of startup, it will first determine whether the partition has been persisted. If not, it needs to check the Checkpoint or recalculate according to the lineage. Therefore, if multiple actions are to be performed on an RDD, you can use the persist or cache method in the first action to persist or cache the RDD in memory or disk, thereby increasing the calculation speed in subsequent actions. In fact, the cache method uses the default storage level of MEMORY_ONLY to persist RDD to memory, so cache is a special kind of persistence. The design of in-heap and off-heap storage memory allows for unified planning and management of the memory used when caching RDDs (other application scenarios of storage memory, such as caching broadcast data, are temporarily outside the scope of this article).

The persistence of RDD is taken care of by Spark's Storage module [7], which realizes the decoupling of RDD and physical storage. The Storage module is responsible for managing the data generated by Spark during the calculation process, encapsulating the functions of accessing data in memory or disk, locally or remotely. In the specific implementation, the Storage modules on the Driver side and the Executor side form a master-slave architecture, that is, the BlockManager on the Driver side is the Master, and the BlockManager on the Executor side is the Slave. The Storage module logically uses Block as the basic storage unit. Each Partition of the RDD uniquely corresponds to a Block after processing (the format of BlockId is rdd_RDD-ID_PARTITION-ID). The Master is responsible for the management and maintenance of the metadata information of the Block in the entire Spark application, while the Slave needs to report the status of Block updates to the Master and receive commands from the Master, such as adding or deleting an RDD.

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　Figure 7. Storage module diagram

When persisting RDD, Spark provides 7 different storage levels , including MEMORY_ONLY, MEMORY_AND_DISK, etc. , and the storage level is a combination of the following 5 variables:

Listing 3. Storage levels

class StorageLevel private( 
private var _useDisk: Boolean, // Disk 
private var _useMemory: Boolean, // This actually refers to the in-heap memory 
private var _useOffHeap: Boolean, // Off-heap memory 
private var _deserialized: Boolean, // Whether it is non- Serialization 
private var _replication: Int = 1 //Number of copies 
)

Through the analysis of the data structure, it can be seen that the storage level defines the storage method of the Partition (also known as the Block) of the RDD from three dimensions:

Storage location: disk/in-heap memory/out-of-heap memory. For example, MEMORY_AND_DISK is stored on the disk and in the heap memory at the same time, achieving redundant backup. OFF_HEAP is only stored in off-heap memory. Currently, when selecting off-heap memory, it cannot be stored in other locations at the same time.

Storage form: After the Block is cached in the storage memory, whether it is in a non-serialized form. For example, MEMORY_ONLY is a non-serialized storage, and OFF_HEAP is a serialized storage.

Number of copies: When greater than 1, remote redundant backup is required to other nodes. For example, DISK_ONLY_2 requires 1 remote backup copy.

4.2 RDD caching process

Before the RDD is cached in the storage memory, the data in the Partition is generally accessed using an iterator ( Iterator ) data structure, which is a method of traversing a data collection in the Scala language. Each serialized or non-serialized data item (Record) in the partition can be obtained through Iterator. These Record object instances logically occupy other parts of the JVM heap memory. The space of different Records in the same Partition does not continuous.

After the RDD is cached in the storage memory, the Partition is converted into a Block, and the Record occupies a continuous space in the heap or off-heap storage memory. The process of converting a Partition from discontinuous storage space to continuous storage space is called "Unroll" by Spark. Block has two storage formats, serialized and non-serialized, depending on the storage level of the RDD. The non-serialized Block is defined as a DeserializedMemoryEntry data structure and uses an array to store all object instances. The serialized Block is defined as a SerializedMemoryEntry data structure and uses a byte buffer (ByteBuffer) to store binary data. The Storage module of each Executor uses a linked Map structure (LinkedHashMap) to manage all Block object instances in the heap and off-heap storage memory [6]. The addition and deletion of this LinkedHashMap indirectly records the application and release of memory. .

Because there is no guarantee that the storage space can accommodate all the data in the Iterator at one time, the current computing task must apply for enough Unroll space from the MemoryManager to temporarily occupy the space when Unrolling. If there is insufficient space, Unroll will fail. When there is enough space, it can continue. For serialized Partitions, the required Unroll space can be directly calculated cumulatively and applied for once. Non-serialized Partitions must be applied for in sequence during the process of traversing the Record, that is, each time a Record is read, the Unroll space required is sampled and estimated and applied for. When the space is insufficient, it can be interrupted to release the occupied Unroll space. If Unroll succeeds in the end, the Unroll space occupied by the current Partition is converted into normal cached RDD storage space, as shown in Figure 8 below.

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　Figure 8. Spark Unroll diagram

As can be seen in Figure 3 and Figure 5, in static memory management, Spark specially divides an Unroll space in the storage memory, and its size is fixed. In unified memory management, there is no special distinction between Unroll space. When storing When there is insufficient space, it will be processed according to the dynamic occupation mechanism.

4.3 Elimination and placement

Since all computing tasks of the same Executor share a limited storage memory space, when a new Block needs to be cached but the remaining space is insufficient and cannot be dynamically occupied, the old Block in the LinkedHashMap must be eliminated (Eviction) and eliminated. If the Block's storage level also includes the requirement to store it on disk, it must be dropped (Drop); otherwise, the Block must be deleted directly.

The elimination rules for storage memory are:

The old Block to be eliminated must have the same MemoryMode as the new Block, that is, they both belong to off-heap or in-heap memory.

The old and new blocks cannot belong to the same RDD to avoid cyclic elimination.

The RDD to which the old Block belongs cannot be in a read state to avoid causing consistency problems.

Traverse the Blocks in the LinkedHashMap and eliminate them in order of least recently used (LRU) until the space required by the new Block is met. Among them, LRU is a characteristic of LinkedHashMap.

The process of downloading to disk is relatively simple. If its storage level meets the condition that _useDisk is true, it will be judged according to its _deserialized whether it is in a non-serialized form. If so, it will be serialized, and finally the data will be stored on the disk. In Storage Module updates its information.

5. Perform memory management

5.1 Memory allocation among multiple tasks

Tasks running in the Executor also share execution memory, and Spark uses a HashMap structure to save the mapping of tasks to memory consumption. The execution memory size that each task can occupy ranges from 1/2N to 1/N, where N is the number of tasks running in the current Executor. When each task starts, it must request at least 1/2N of execution memory from the MemoryManager. If the request cannot be met, the task will be blocked until other tasks release enough execution memory, and the task can be awakened. .

5.2 Memory usage of Shuffle

Execution memory is mainly used to store the memory occupied by tasks when executing Shuffle. Shuffle is a process of repartitioning RDD data according to certain rules. Let's look at the use of execution memory in the two stages of Shuffle's Write and Read:

Shuffle Write

If you choose the ordinary sorting method on the map side, ExternalSorter will be used for external sorting. When storing data in memory, it mainly takes up the execution space in the heap.
If Tungsten's sorting method is selected on the map side, ShuffleExternalSorter is used to directly sort the data stored in serialized form. When storing data in memory, it can occupy off-heap or in-heap execution space, depending on whether the user has enabled off-heap memory and Is there enough off-heap execution memory?

Shuffle Read

When aggregating data on the reduce side, the data must be handed over to the Aggregator for processing, and the heap execution space will be occupied when the data is stored in memory.
If the final result needs to be sorted, the data will be handed over to ExternalSorter again for processing, occupying the heap execution space.

In ExternalSorter and Aggregator, Spark will use a hash table called AppendOnlyMap to store data in the heap execution memory. However, during the Shuffle process, not all data can be saved to the hash table. When this hash table occupies The memory will be periodically sampled and estimated. When it reaches a certain level and new execution memory cannot be requested from the MemoryManager, Spark will store all its contents in a disk file. This process is called overflow ( Spill), files spilled to disk will eventually be merged (Merge).

Tungsten used in the Shuffle Write stage is a plan proposed by Databricks to optimize memory and CPU usage for Spark [9], which solves some JVM performance limitations and drawbacks. Spark will automatically choose whether to use Tungsten sorting based on the Shuffle situation. The page memory management mechanism adopted by Tungsten is built on MemoryManager, that is, Tungsten further abstracts the use of execution memory, so that during the Shuffle process, there is no need to care whether the data is stored in the heap or outside the heap. Each memory page is defined with a MemoryBlock, and the two variables Object obj and long offset are used to uniformly identify the address of a memory page in the system memory. MemoryBlock in the heap is memory allocated in the form of a long array. The value of obj is the object reference of this array. Offset is the initial offset address of the long array in the JVM. The two can be used together to locate this array. The absolute address in the heap; the MemoryBlock outside the heap is the directly applied memory block, its obj is null, and offset is the 64-bit absolute address of this memory block in the system memory. Spark uses MemoryBlock to cleverly abstractly encapsulate in-heap and off-heap memory pages, and uses page tables (pageTable) to manage the memory pages requested by each Task.

All memory under Tungsten page management is represented by a 64-bit logical address, consisting of a page number and an offset within the page:

Page number: 13 bits, uniquely identifies a memory page. Spark must apply for a free page number before applying for a memory page.
Intra-page offset: occupies 51 bits. It is the offset address of the data in the page when using the memory page to store data.

With a unified addressing method, Spark can use a 64-bit logical address pointer to locate memory inside or outside the heap. The entire Shuffle Write sorting process only requires sorting the pointers and does not require deserialization. The entire process is very efficient. , which has significantly improved memory access efficiency and CPU usage efficiency [10].

Spark's storage memory and execution memory have completely different management methods: for storage memory, Spark uses a LinkedHashMap to centrally manage all Blocks, and Blocks are converted from Partitions of RDDs that need to be cached; while for execution memory, Spark uses AppendOnlyMap is used to store data in the Shuffle process, and is even abstracted into page memory management in Tungsten sorting, opening up a new JVM memory management mechanism.