Spark Kernel Analysis-Memory Management 7(6)

1. Spark memory management

As a memory-based distributed computing engine, Spark's memory management module plays a very important role in the entire system. Understanding the basic principles of Spark memory management will help you better develop Spark applications and perform performance tuning. The principles explained in this article are based on Spark version 2.1.
When executing a Spark application, the Spark cluster will start two JVM processes, Driver and Executor. The former is the main control process and is responsible for creating a Spark context, submitting Spark jobs (Jobs), and converting jobs into computing tasks (Tasks). Each Executor process coordinates the scheduling of tasks. The latter is responsible for executing specific computing tasks on the worker nodes and returning the results to the Driver. It also provides storage functions for RDDs that need to be persisted. Since the Driver's memory management is relatively simple, this article mainly analyzes the Executor's memory management. The Spark memory below refers specifically to the Executor's memory.

1.1 In-heap and off-heap memory planning

As a JVM process, Executor's memory management is based on the JVM's memory management. Spark allocates the JVM's on-heap space in more detail to make full use of memory. At the same time, Spark introduces off-heap memory, which allows it to directly open up space in the system memory of the working node, further optimizing the use of memory. The schematic diagram of on-heap and off-heap memory is as follows:

Insert image description here

1.1.1 In-heap memory

The size of the heap memory is configured by the –executor-memory or spark.executor.memory parameter when the Spark application is started. Concurrent tasks running in the Executor share the JVM heap memory. The memory occupied by these tasks when caching RDD data and broadcast data is planned as storage memory, and the memory occupied by these tasks when executing Shuffle is planned as Execution (Execution) memory, the remaining part is not specially planned, those object instances within Spark, or object instances in user-defined Spark applications, all occupy the remaining space. Under different management modes, the space occupied by these three parts is different (will be introduced in Section 2 below).
Spark's management of in-heap memory is a logical "planning" management, because the application and release of memory occupied by object instances are completed by the JVM. Spark can only record these memories after application and before release. Let's take a look. The specific process:
apply for memory:
1) Spark creates a new object instance in the code
2) JVM allocates space from the heap memory, creates the object and returns the object reference
3) Spark saves the reference to the object, records the memory occupied by the object
and releases the memory :
1) Spark records the memory released by the object and deletes the reference to the object
2) Waits for the JVM's garbage collection mechanism to release the heap memory occupied by the object.
We know that JVM objects can be stored in a serialized manner, and the serialization process It converts the object into a binary byte stream. In essence, it can be understood as converting the chain storage of non-continuous space into continuous space or block storage. When accessing, the reverse process of serialization is required - deserialization, and the word Throttling is converted into objects, and serialization can save storage space, but increases the computational overhead during storage and reading.
For serialized objects in Spark, since they are in the form of byte streams, the memory size they occupy can be calculated directly, while for non-serialized objects, the memory size they occupy is approximately estimated through periodic sampling, that is, The memory size occupied is not calculated every time a new data item is added. This method reduces the time overhead but may cause large errors, causing the actual memory at a certain moment to far exceed expectations [2]. In addition, it is very likely that the object instances marked as released by Spark are not actually recycled by the JVM, causing the actual available memory to be less than the available memory recorded by Spark. Therefore, Spark cannot accurately record the actual available heap memory, so it cannot completely avoid memory overflow (OOM, Out of Memory) exceptions.
Although it cannot precisely control the application and release of memory in the heap, Spark can decide whether to cache new RDDs in storage memory and whether to allocate execution memory for new tasks through independent planning and management of storage memory and execution memory. To a certain extent, it can improve memory utilization and reduce the occurrence of exceptions.

1.1.2 Off-heap memory

In order to further optimize the use of memory and improve the efficiency of sorting during Shuffle, Spark introduces off-heap memory, which allows it to directly open up space in the system memory of the working node and store serialized binary data. Utilizing the JDK Unsafe API (starting from Spark 2.0, the management of storage memory outside the heap is no longer based on Tachyon, but is the same as the execution memory outside the heap, based on the JDK Unsafe API [3]), Spark can directly operate the operating system outside the heap memory, reducing unnecessary memory overhead, as well as frequent GC scanning and recycling, improving processing performance. Off-heap memory can be accurately applied for and released, and the space occupied by serialized data can be accurately calculated, so it reduces the difficulty of management and reduces errors compared with on-heap memory.
Off-heap memory is not enabled by default. It can be enabled by configuring the spark.memory.offHeap.enabled parameter, and the size of the off-heap space is set by the spark.memory.offHeap.size parameter. Except that there is no other space, off-heap memory is divided in the same way as in-heap memory, and all running concurrent tasks share storage memory and execution memory.

1.1.3 Memory management interface

Spark provides a unified interface for the management of storage memory and execution memory - MemoryManager. Tasks in the same Executor all call the methods of this interface to apply for or release memory: The
main method of the memory management interface:
//Apply for storage memory
def acquireStorageMemory (blockId: BlockId, numBytes: Long, memoryMode: MemoryMode): Boolean
//Apply for expansion memory
def acquireUnrollMemory(blockId: BlockId, numBytes: Long, memoryMode: MemoryMode): Boolean
//Apply for execution memory
def acquireExecutionMemory(numBytes: Long, taskAttemptId : Long, memoryMode: MemoryMode): Long
//
def releaseStorageMemory(numBytes: Long, memoryMode: MemoryMode): Unit
//Release execution memory
def releaseExecutionMemory(numBytes: Long, taskAttemptId: Long, memoryMode: MemoryMode): Unit
//Release expansion Memory
def releaseUnrollMemory(numBytes: Long, memoryMode: MemoryMode): Unit
We see that when calling these methods, you need to specify its memory mode (MemoryMode). This parameter determines whether the operation is completed within the heap or outside the heap.
Regarding the specific implementation of MemoryManager, Spark 1.6 defaults to the Unified Memory Manager mode. The Static Memory Manager mode used before 1.6 is still retained and can be enabled by configuring the spark.memory.useLegacyMode parameter. The difference between the two methods lies in the way of space allocation. Section 2 below will introduce these two methods respectively.

1.2 Memory space allocation

1.2.1 Static memory management

Under the static memory management mechanism originally adopted by Spark, the sizes of storage memory, execution memory and other memories are fixed during the running of the Spark application, but users can configure it before the application starts. The allocation of memory in the heap is shown in Figure 2 Shown:
Figure 2. Static memory management diagram - within the heap

Insert image description here
As you can see, the size of available heap memory needs to be calculated as follows:
Available heap memory space:
1 Available storage memory = systemMaxMemory * spark.storage.memoryFraction * spark.storage.safetyFraction

2 Available execution memory = systemMaxMemory * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction
where systemMaxMemory depends on the size of the current JVM heap memory. The final available execution memory or storage memory must be combined with the respective memoryFraction parameters on this basis. The safetyFraction parameters are multiplied together. The significance of the two safetyFraction parameters in the above calculation formula is to logically reserve a safe area of ​​1-safetyFraction to reduce the risk of OOM caused by the actual memory exceeding the current preset range (mentioned above, for non-sequential ized object's memory sampling estimate will produce errors). It is worth noting that this reserved insurance area is only a logical plan. Spark does not treat it differently when used specifically. It is left to the JVM to manage like "other memory".
The space allocation outside the heap is relatively simple, with only storage memory and execution memory, as shown in Figure 3. The amount of space occupied by available execution memory and storage memory is directly determined by the parameter spark.memory.storageFraction. Since the space occupied by off-heap memory can be accurately calculated, there is no need to set the insurance area.
Figure 3. Illustration of static memory management - the off-heap
Insert image description here
static memory management mechanism is relatively simple to implement, but if the user is not familiar with Spark's storage mechanism, or does not configure it according to the specific data scale and computing tasks, it is easy to cause The situation of "half sea water, half fire" means that one side of the storage memory and the execution memory has a lot of space left, but the other side is full early, and old content has to be eliminated or moved out to store new content. Due to the emergence of new memory management mechanisms, this method is currently used by few developers. For the purpose of compatibility with older versions of applications, Spark still retains its implementation.

1.2.2 Unified memory management

The unified memory management mechanism introduced after Spark 1.6 differs from static memory management in that storage memory and execution memory share the same space and can dynamically occupy each other's free areas, as shown in Figure 4 and Figure 5. Figure 4. Illustration of
unified memory management ——In the heap

Insert image description here
Illustration of unified memory management - off-heap.
Insert image description here
The most important optimization is the dynamic occupancy mechanism. The rules are as follows:
1) Set the basic storage memory and execution memory area (spark.storage.storageFraction parameter). This setting determines the relationship between both parties. The range of space owned by each
party 2) When both parties have insufficient space, it will be stored in the hard disk; if one's own space is insufficient and the other party has free space, the other party's space can be borrowed; (Insufficient storage space means that there is not enough space to put down a complete Block)
3 ) After the execution memory space is occupied by the other party, the other party can transfer the occupied part to the hard disk, and then "return" the borrowed space
4) After the storage memory space is occupied by the other party, the other party cannot "return" it because it needs to be considered Many factors in the Shuffle process make implementation complex.
Figure 6. Illustration of the dynamic occupancy mechanism
Insert image description here
. With the unified memory management mechanism, Spark improves the utilization of in-heap and off-heap memory resources to a certain extent, and reduces the time for developers to maintain Spark memory. difficulty, but that doesn’t mean developers can sit back and relax. For example, if the storage memory space is too large or there is too much cached data, it will lead to frequent full garbage collection and reduce the performance of task execution, because cached RDD data usually resides in memory for a long time [5] . Therefore, in order to fully utilize the performance of Spark, developers need to further understand the management methods and implementation principles of storage memory and execution memory.

1.3 Storage memory management

1.3.1Persistence mechanism of RDD

As the most fundamental data abstraction of Spark, elastic distributed data set (RDD) is a collection of read-only partition records (Partition), which can only be created based on data sets in stable physical storage, or on other existing RDDs. Performing a transformation operation produces a new RDD. The dependency relationship between the converted RDD and the original RDD constitutes the lineage. With lineage, Spark guarantees that every RDD can be restored. However, all transformations of RDD are lazy, that is, only when an action that returns results to the Driver occurs, Spark will create a task to read the RDD, and then actually trigger the execution of the transformation.
When Task reads a partition at the beginning of startup, it will first determine whether the partition has been persisted. If not, it needs to check the Checkpoint or recalculate according to the lineage. Therefore, if multiple actions are to be performed on an RDD, you can use the persist or cache method in the first action to persist or cache the RDD in memory or disk, thereby increasing the calculation speed in subsequent actions. In fact, the cache method uses the default storage level of MEMORY_ONLY to persist the RDD into memory, so the cache is a special kind of persistence. The design of in-heap and off-heap storage memory allows for unified planning and management of the memory used when caching RDDs (other application scenarios of storage memory, such as caching broadcast data, are temporarily outside the scope of this article).
The persistence of RDD is taken care of by Spark's Storage module [7], which realizes the decoupling of RDD and physical storage. The Storage module is responsible for managing the data generated by Spark during the calculation process, encapsulating the functions of accessing data in memory or disk, locally or remotely. In the specific implementation, the Storage modules on the Driver side and the Executor side form a master-slave architecture, that is, the BlockManager on the Driver side is the Master, and the BlockManager on the Executor side is the Slave. The Storage module logically uses Block as the basic storage unit. Each Partition of the RDD uniquely corresponds to a Block after processing (the format of BlockId is rdd_RDD-ID_PARTITION-ID). The Master is responsible for the management and maintenance of the metadata information of the Block in the entire Spark application, while the Slave needs to report the status of Block updates to the Master and receive commands from the Master, such as adding or deleting an RDD.
Figure 7. Storage module diagram

Insert image description here
When persisting RDD, Spark provides 7 different storage levels, including MEMORY_ONLY, MEMORY_AND_DISK, etc., and the storage level is a combination of the following 5 variables:
Listing 3. Storage level

class StorageLevel private(
  private var _useDisk: Boolean, //磁盘
  private var _useMemory: Boolean, //这里其实是指堆内内存
  private var _useOffHeap: Boolean, //堆外内存
  private var _deserialized: Boolean, //是否为非序列化
  private var _replication: Int = 1 //副本个数
)

Through the analysis of the data structure, it can be seen that the storage level defines the storage method of the Partition (also known as the Block) of the RDD from three dimensions:
1) Storage location: disk/in-heap memory/out-of-heap memory. For example, MEMORY_AND_DISK is stored on the disk and in the heap memory at the same time, achieving redundant backup. OFF_HEAP is only stored in off-heap memory. Currently, when selecting off-heap memory, it cannot be stored in other locations at the same time.
2) Storage form: After the Block is cached in the storage memory, whether it is in a non-serialized form. For example, MEMORY_ONLY is a non-serialized storage, and OFF_HEAP is a serialized storage.
3) Number of copies: When it is greater than 1, remote redundant backup is required to other nodes. For example, DISK_ONLY_2 requires 1 remote backup copy.

1.3.2 RDD caching process

Before the RDD is cached in the storage memory, the data in the Partition is generally accessed using the data structure of an iterator, which is a method of traversing a data collection in the Scala language. Each serialized or non-serialized data item (Record) in the partition can be obtained through Iterator. These Record object instances logically occupy other parts of the JVM heap memory. The space of different Records in the same Partition does not continuous.
After the RDD is cached in the storage memory, the Partition is converted into a Block, and the Record occupies a continuous space in the heap or off-heap storage memory. The process of converting a Partition from discontinuous storage space to continuous storage space is called "Unroll" by Spark. Block has two storage formats, serialized and non-serialized, depending on the storage level of the RDD. The non-serialized Block is defined as a DeserializedMemoryEntry data structure and uses an array to store all object instances. The serialized Block is defined as a SerializedMemoryEntry data structure and uses a byte buffer (ByteBuffer) to store binary data. The Storage module of each Executor uses a linked Map structure (LinkedHashMap) to manage all Block object instances in the heap and off-heap storage memory [6]. The addition and deletion of this LinkedHashMap indirectly records the application and release of memory. .
Because there is no guarantee that the storage space can accommodate all the data in the Iterator at one time, the current computing task must apply for enough Unroll space from the MemoryManager to temporarily occupy the space when Unrolling. If there is insufficient space, Unroll will fail. When there is enough space, it can continue. For serialized Partitions, the required Unroll space can be directly calculated cumulatively and applied for once. Non-serialized Partitions must be applied for in sequence during the process of traversing the Record, that is, each time a Record is read, the Unroll space required is sampled and estimated and applied for. When the space is insufficient, it can be interrupted to release the occupied Unroll space. If Unroll succeeds in the end, the Unroll space occupied by the current Partition is converted into normal cached RDD storage space, as shown in Figure 8 below.
Figure 8. Schematic diagram of Spark Unroll.
Insert image description here
As can be seen in Figure 3 and Figure 5, in static memory management, Spark specially divides a piece of Unroll space in the storage memory, and its size is fixed. In unified memory management, there is no Unroll space. Special distinction is made, and when the storage space is insufficient, it will be processed according to the dynamic occupation mechanism.

1.3.3 Elimination and placement

Since all computing tasks of the same Executor share a limited storage memory space, when a new Block needs to be cached but the remaining space is insufficient and cannot be dynamically occupied, the old Block in the LinkedHashMap must be eliminated (Eviction) and eliminated. If the Block's storage level also includes the requirement to store it on disk, it must be dropped (Drop); otherwise, the Block must be deleted directly.
The elimination rules for storage memory are:
1) The eliminated old Block must have the same MemoryMode as the new Block, that is, they both belong to off-heap or in-heap memory
2) The old and new Blocks cannot belong to the same RDD to avoid cyclic elimination
3) The RDD to which the old Block belongs It cannot be in the read state to avoid causing consistency problems.
4) Traverse the Blocks in the LinkedHashMap and eliminate them in order of least recently used (LRU) until the space required by the new Block is met. Among them, LRU is a characteristic of LinkedHashMap.
The process of downloading to disk is relatively simple. If its storage level meets the condition that _useDisk is true, it will be judged according to its _deserialized whether it is in a non-serialized form. If so, it will be serialized, and finally the data will be stored on the disk. In Storage Module updates its information.

1.4 Perform memory management

1.4.1 Memory allocation between multiple tasks

Tasks running within the Executor also share execution memory, and Spark uses a HashMap structure to save the mapping of tasks to memory consumption. The execution memory size that each task can occupy ranges from 1/2N to 1/N, where N is the number of tasks running in the current Executor. When each task starts, it must request at least 1/2N of execution memory from the MemoryManager. If the request cannot be met, the task will be blocked until other tasks release enough execution memory, and the task can be awakened. .

1.4.2Shuffle memory usage

Execution memory is mainly used to store the memory occupied by tasks when executing Shuffle. Shuffle is a process of repartitioning RDD data according to certain rules. Let's look at the use of execution memory in the two stages of Shuffle's Write and Read:
Shuffle Write
1) If in If the map side chooses the ordinary sorting method, ExternalSorter will be used for external sorting. When storing data in memory, it mainly takes up the execution space in the heap.
2) If Tungsten's sorting method is selected on the map side, ShuffleExternalSorter is used to directly sort the data stored in serialized form. When storing data in memory, it can occupy off-heap or in-heap execution space, depending on whether the user has turned on off-heap Whether the memory and off-heap execution memory are sufficient.
Shuffle Read
1) When aggregating data on the reduce side, the data must be handed over to the Aggregator for processing. When storing data in memory, it takes up execution space in the heap.
2) If the final result needs to be sorted, the data will be handed over to ExternalSorter again for processing, occupying the execution space in the heap.
In ExternalSorter and Aggregator, Spark will use a hash table called AppendOnlyMap to store data in the heap execution memory. However, during the Shuffle process, not all data can be saved to the hash table. When this hash table occupies The memory will be periodically sampled and estimated. When it reaches a certain level and new execution memory cannot be requested from the MemoryManager, Spark will store all its contents in a disk file. This process is called overflow ( Spill), files spilled to disk will eventually be merged (Merge).
Tungsten used in the Shuffle Write stage is a plan proposed by Databricks to optimize memory and CPU usage for Spark [9], which solves some JVM performance limitations and drawbacks. Spark will automatically choose whether to use Tungsten sorting based on the Shuffle situation. The page memory management mechanism adopted by Tungsten is built on MemoryManager, that is, Tungsten further abstracts the use of execution memory, so that during the Shuffle process, there is no need to care whether the data is stored in the heap or outside the heap. Each memory page is defined with a MemoryBlock, and the two variables Object obj and long offset are used to uniformly identify the address of a memory page in the system memory. MemoryBlock in the heap is memory allocated in the form of a long array. The value of obj is the object reference of this array. Offset is the initial offset address of the long array in the JVM. The two can be used together to locate this array. The absolute address in the heap; the MemoryBlock outside the heap is the directly applied memory block, its obj is null, and offset is the 64-bit absolute address of this memory block in the system memory. Spark uses MemoryBlock to cleverly abstractly encapsulate in-heap and off-heap memory pages, and uses page tables (pageTable) to manage the memory pages requested by each Task.
All memory under Tungsten page management is represented by a 64-bit logical address, which consists of a page number and an offset within the page:
Page number: 13 bits, uniquely identifies a memory page. Spark must apply for free space before applying for a memory page. Page number.
Intra-page offset: occupies 51 bits. It is the offset address of the data in the page when using the memory page to store data.
With a unified addressing method, Spark can use a 64-bit logical address pointer to locate memory inside or outside the heap. The entire Shuffle Write sorting process only requires sorting the pointers and does not require deserialization. The entire process is very efficient. , which has significantly improved memory access efficiency and CPU usage efficiency [10].
Spark's storage memory and execution memory have completely different management methods: for storage memory, Spark uses a LinkedHashMap to centrally manage all Blocks, and Blocks are converted from Partitions of RDDs that need to be cached; while for execution memory, Spark uses AppendOnlyMap is used to store data in the Shuffle process, and is even abstracted into page memory management in Tungsten sorting, opening up a new JVM memory management mechanism.

Guess you like

Origin blog.csdn.net/qq_44696532/article/details/135392177