Spark kernel analysis (7) analysis of Spark memory management principle

When executing Spark applications, the Spark cluster will start two JVM processes, Driver and Executor. The former is the main control process and is responsible for creating Spark context, submitting Spark jobs, and converting jobs into computing tasks. Each Executor process coordinates the scheduling of tasks. The latter is responsible for executing specific computing tasks on the worker nodes and returning the results to the Driver. At the same time, it provides storage functions for RDDs that need to be persisted. Since the memory management of the Driver is relatively simple, this section mainly analyzes the memory management of the Executor. The Spark memory below refers specifically to the memory of the Executor.

1. In-heap and off-heap memory planning

作为一个 JVM 进程,Executor 的内存管理建立在 JVM 的内存管理之上,Spark 对JVM 的堆内(On-heap)空间进行了更为详细的分配,以充分利用内存。Simultaneously,Spark 引入了堆外(Off-heap)内存,使之可以直接在工作节点的系统内存中开辟空间,进一步优化了内存的使用。

堆内内存受到 JVM 统一管理堆外内存是直接向操作系统进行内存的申请和释放。
Insert picture description here

1.1 In-heap memory

The size of the heap memory is configured by the-executor-memory or spark.executor.memory parameter when the Spark application starts. The concurrent tasks running in the Executor share the JVM heap memory. 些任务在缓存 RDD 数据和广播(Broadcast)数据时占用的内存被规划为存储 ( Storage)内存 , 而 这些任务在执行 Shuffle 时 占 用 的 内 存 被 规 划 为 执 行(Execution)内存,剩余的部分不做特殊规划,那些 Spark 内部的对象实例,或者用户定义的 Spark 应用程序中的对象实例,均占用剩余的空间。Under these different management modes, the space occupied by these three parts is different.

Spark's management of the heap memory is a logical " 规划式" management, because the application and release of the memory occupied by the object instance is completed by the JVM, and Spark can only record the memory after the application and before the release. Let's take a look at the process .

The specific process of applying for memory:

(1) Spark new an object instance in the code;
(2) JVM allocates space from the heap memory, creates the object and returns the object reference;
(3) Spark saves the reference of the object and records the memory occupied by the object.

The specific process of releasing memory:

(1) Spark records the memory released by the object and deletes the reference to the object;
(2) Waits for the garbage collection mechanism of the JVM to release the heap memory occupied by the object.

We know that JVM objects can be stored in a serialized manner. The process of serialization is to convert the object into a binary byte stream 本质上可以理解为将非连续空间的链式存储转化为连续空间或块存储. When accessing, the reverse process of serialization is required-to 反序列化convert the byte stream into an object , The serialization method can save storage space, but increases the computational overhead of storage and reading.

For serialized objects in Spark, because they are in the form of byte streams, the memory occupied by them can be calculated directly, while for non-serialized objects, the memory occupied by them is approximated by periodic sampling, that is, and Not every new data item will calculate the size of the memory occupied once. This method reduces the time overhead but may have a large error, resulting in the actual memory at a certain moment may far exceed expectations. In addition, 在被 Spark 标记为释放的对象实例,很有可能在实际上并没有被 JVM 回收,导致实际可用的内存小于 Spark 记录的可用内存。soSpark 并不能准确记录实际可用的堆内内存,从而也就无法完全避免内存溢出(OOM, Out of Memory)的异常。

Although it cannot precisely control the application and release of memory in the heap, Spark can decide whether to cache new RDDs in storage memory and whether to allocate execution memory for new tasks through independent planning and management of storage memory and execution memory. To a certain extent, it can improve the utilization of memory and reduce the occurrence of abnormalities.

1.2 Off-heap memory

In order to further optimize the use of memory and improve the efficiency of sorting in Shuffle, Spark introduces Off-heap memory to make it可以直接在工作节点的系统内存中开辟空间,存储经过序列化的二进制数据。

Off-heap memory means that memory objects are allocated in memory outside the heap of the Java virtual machine, which is directly managed by the operating system (not the virtual machine). The result of this is能保持一个较小的堆,以减少垃圾收集对应用的影响。

Using JDK Unsafe API (starting with Spark 2.0, the management of off-heap storage memory is no longer based on Tachyon, but the same as off-heap execution memory, based on JDK Unsafe API implementation), off- Spark 可以直接操作系统堆外内存,减少了不必要的内存开销,以及频繁的 GC 扫描和回收,提升了处理性能。heap memory can be accurately applied and released (The reason why off-heap memory can be accurately applied and released is because the application and release of memory no longer go through the JVM mechanism, but directly apply to the operating system. The JVM cannot accurately specify the time point for memory cleaning, so it cannot To achieve precise release), and the space occupied by serialized data can be accurately calculated, so compared to the heap memory降低了管理的难度,也降低了误差。

Off-heap memory is not enabled by default. It can be enabled by configuring the spark.memory.offHeap.enabled parameter, and the size of the off-heap space is set by the spark.memory.offHeap.size parameter.除了没有 other空间,堆外内存与堆内内存的划分方式相同,所有运行中的并发任务共享存储内存和执行内存。

Two, memory space allocation

2.1 Static memory management

In the static memory management Spark original adoption 存储内存、执行内存和其他内存的大小在 Spark 应用程序运行期间均为固定的,但用户可以应用程序启动前进行配置.

The memory allocation in the heap is shown in the following figure: As
Insert picture description here
you can see, the size of the available heap memory needs to be calculated according to the code listing 1-1:
Insert picture description here
where systemMaxMemory depends on the size of the current JVM heap memory, and the last available execution memory or The storage memory must be multiplied by the respective memoryFraction parameters and safetyFraction parameters on this basis. The significance of the two safetyFraction parameters in the above calculation formula is to logically reserve a 1-safetyFraction insurance area to reduce the risk of OOM caused by the actual memory exceeding the current preset range (mentioned above, for non-sequential The memory sampling estimation of the modified object will produce errors). It is worth noting that this reserved insurance area is only a logical plan, and Spark does not treat it differently during specific use, and is handed over to the JVM for management like "other memory".

Storage 内存和 Execution 内存都有预留空间,目的是防止 OOM,因为 Spark 堆内内存大小的记录是不准确的,需要留出保险区域。

The space allocation outside the heap is relatively simple, only storage memory and execution memory, as shown in the following figure. The available execution memory and the space occupied by the storage memory are directly determined by the parameter spark.memory.storageFraction. The 由于堆外内存占用的空间可以被精确计算,所以无需再设定保险区域。
Insert picture description here
static memory management mechanism is relatively simple to implement, but 如果用户不熟悉 Spark 的存储机制,或没有根据具体的数据规模和计算任务或做相应的配置,很容易造成”一半海水,一半火焰”的局面,即存储内存和执行内存中的一方剩余大量的空间,而另一方却早早被占满,不得不淘汰或移出旧的内容以存储新的内容。due to the emergence of new memory management mechanisms, there are currently few developers in this way. For the purpose of compatibility with older versions of applications, Spark still retains its implementation.

2.2 Unified memory management

The unified memory management mechanism introduced after Spark 1.6 differs from static memory management in that the存储内存和执行内存共享同一块空间,可以动态占用对方的空闲区域 in-heap memory structure of
Insert picture description here
unified memory management is shown in the following figure: the out- of-heap memory structure of unified memory management is shown in the following figure: the
Insert picture description here
most important optimization is 动态占用机制, The rules are as follows:

(1) Set the basic storage memory and execution memory area (spark.storage.storageFraction parameter), which determines the range of space owned by both parties;

(2) 双方的空间都不足时,则存储到硬盘;若己方空间不足而对方空余时,可借用对方的空间;(Insufficient storage space means not enough to put down a complete block)

(3) After the execution memory space is occupied by the other party, the other party can transfer the occupied part to the hard disk, and then "return" the borrowed space;

(4) After the storage memory space is occupied by the other party, the other party cannot "return" it, because many factors in the shuffle process need to be considered, which is more complicated to implement.

The dynamic occupancy mechanism of unified memory management is shown in the following figure:
Insert picture description here
With the unified memory management mechanism, Spark improves the utilization of in-heap and off-heap memory resources to a certain extent, and reduces the difficulty for developers to maintain Spark memory, but it does not mean So developers can sit back and relax. 如果存储内存的空间太大或者说缓存的数据过多,反而会导致频繁的全量垃圾回收,降低任务执行时的性能,因为缓存的 RDD 数据通常都是长期驻留内存的。Therefore, in order to give full play to the performance of Spark, developers need to further understand the management methods and implementation principles of storage memory and execution memory.

Three, storage memory management

3.1 The persistence mechanism of RDD

Resilient Distributed Data Set (RDD), as the most fundamental data abstraction of Spark, is a collection of read-only partition records (Partition), which can only be created based on data sets in stable physical storage, or on other existing RDDs Perform a transformation (Transformation) operation to generate a new RDD. The dependency relationship between the converted RDD and the original RDD constitutes a lineage. 凭借血统,Spark 保证了每一个 RDD 都可以被重新恢复。However, all transformations of RDDs are lazy, that is, only when an action that returns the result to the Driver occurs, Spark will create a task to read the RDD, and then actually trigger the execution of the transformation.

When the Task reads a partition at the beginning of the startup, it will first determine whether the partition has been persisted. If not, it needs to check the Checkpoint or recalculate according to the lineage. Therefore, if you need to perform multiple actions on an RDD, you can use the persist or cache method in the first action to persist or cache the RDD in memory or disk, so as to increase the calculation speed during subsequent actions.

In fact, the cache method uses the default MEMORY_ONLY storage level to persist the RDD to memory, so 缓存是一种特殊的持久化.堆内和堆外存储内存的设计,便可以对缓存 RDD 时使用的内存做统一的规划和管理。

Spark's Storage module is responsible for the persistence of RDD, which realizes the decoupling of RDD and physical storage. Storage 模块负责管理 Spark 在计算过程中产生的数据,将那些在内存或磁盘、在本地或远程存取数据的功能封装了起来。In the specific implementation, the Storage modules on the Driver side and the Executor side constitute a master-slave architecture, that is, the BlockManager on the Driver side is the Master, and the BlockManager on the Executor side is the Slave.

The Storage module logically uses Block as the basic storage unit RDD 的每个 Partition 经过处理后唯一对应一个 Block(the format of BlockId is rdd_RDD-ID_PARTITION-ID). The Master on the Driver side is responsible for the management and maintenance of the block metadata information of the entire Spark application, while the Slave on the Executor side needs to report the update status of the Block to the Master, and receive commands from the Master, such as adding or deleting an RDD.
Insert picture description here
When persisting RDDs, Spark specifies
7 different storage levels such as MEMORY_ONLY and MEMORY_AND_DISK , and the storage level is a combination of the following 5 variables:

class StorageLevel private(
private var _useDisk: Boolean, //磁盘
private var _useMemory: Boolean, //这里其实是指堆内内存
private var _useOffHeap: Boolean, //堆外内存
private var _deserialized: Boolean, //是否为非序列化
private var _replication: Int = 1 //副本个数
)

The seven storage levels in Spark are as follows:
Insert picture description here
Through the analysis of the data structure, it can be seen that the storage level defines the storage mode of the RDD Partition (also known as Block) from three dimensions:

(1) 存储位置: Disk/in-heap memory/out-of-heap memory. For example, MEMORY_AND_DISK is stored on the disk and the in-heap memory at the same time to realize redundant backup. OFF_HEAP is only stored in off-heap memory. Currently, off-heap memory cannot be stored in other locations at the same time.

(2) 存储形式: After the block is cached in the storage memory, whether it is in a non-serialized form. For example, MEMORY_ONLY is stored in a non-serialized manner, and OFF_HEAP is stored in a serialized manner.

(3) 副本数量: When it is greater than 1, remote redundant backup to other nodes is required. For example, DISK_ONLY_2 requires a remote backup copy.

3.2 RDD caching process

RDD 在缓存到存储内存之前,Partition 中的数据一般以迭代器(Iterator)的数据结构来访问,这是 Scala 语言中一种遍历数据集合的方法。Iterator can obtain each serialized or non-serialized data item (Record) in the partition. The object instances of these Records logically occupy the space of the other part of the JVM heap memory 同 一 Partition 的 不 同Record 的存储空间并不连续.

After the RDD is cached in the storage memory, the Partition is converted into a Block, and the Record occupies a continuous space in the on-heap or off-heap storage memory.将 Partition 由不连续的存储空间转换为连续存储空间的过程,Spark 称之为"展开"(Unroll)。

Block has two storage formats, serialized and non-serialized, depending on the storage level of the RDD. The non-serialized Block is defined by a DeserializedMemoryEntry data structure, using an array to store all object instances, and the serialized Block is defined by the SerializedMemoryEntry data structure, using a byte buffer (ByteBuffer) to store binary data.每个 Executor 的Storage 模块用一个链式 Map 结构(LinkedHashMap)来管理堆内和堆外存储内存中所有的 Block 对象的实例,对这个 LinkedHashMap 新增和删除间接记录了内存的申请和释放。

Because there is no guarantee that the storage space can hold all the data in the Iterator at one time, the current calculation task should apply for enough Unroll space from the MemoryManager to temporarily occupy the space when the current calculation task is unrolled. If the space is insufficient, Unroll will fail. If the space is sufficient, it can continue.

For serialized Partition, the required Unroll space can be directly accumulated and applied for once.

For non-serialized Partitions, you must apply in sequence during the traversal of the Record, that is, each time a Record is read, the Unroll space required by the sample is estimated and applied. When the space is insufficient, it can be interrupted to release the occupied Unroll space.

If the final Unroll is successful, the Unroll space occupied by the current Partition is converted to the normal cache RDD storage space, as shown in the following figure.
Insert picture description here
In static memory management, Spark specifically divides a piece of Unroll space in the storage memory, and its size is fixed.统一内存管理时则没有对 Unroll 空间进行特别区分,当存储空间不足时会根据动态占用机制进行处理。

3.3 Elimination and placing

由于同一个 Executor 的所有的计算任务共享有限的存储内存空间,当有新的Block 需要缓存但是剩余空间不足且无法动态占用时,就要对 LinkedHashMap 中的旧 Block 进行淘汰(Eviction),而被淘汰的 Block 如果其存储级别中同时包含存储到磁盘的要求,则要对其进行落盘(Drop),否则直接删除该 Block。

The elimination rules for storage memory are:

(1) The obsolete old block must have the same MemoryMode as the new block, that is, belong to off-heap or in-heap memory;
(2) the old and new blocks cannot belong to the same RDD to avoid cyclic elimination;
(3) the RDD to which the old block belongs cannot be in Read the state to avoid causing consistency problems;
(4) Traverse the blocks in the LinkedHashMap and eliminate them in the order of least recently used (LRU) until the space required by the new block is satisfied. Among them, LRU is a characteristic of LinkedHashMap.

The disk placement process is relatively simple. If its storage level meets the condition that _useDisk is true, then it is determined whether it is in a non-serialized form according to its _deserialized, and if it is, it is serialized, and finally the data is stored to disk in Storage Update its information in the module.

Fourth, perform memory management

Execution memory is mainly used to store the memory occupied by tasks when executing Shuffle. Shuffle is a process of repartitioning RDD data according to certain rules. Shuffle's Write and Read two stages use execution memory:

(1)Shuffle Write

On the map side, ExternalSorter will be used for external exclusion, and the execution space in the heap is mainly occupied when storing data in the memory.

(2)Shuffle Read

When aggregating data on the reduce side, the data must be handed over to the Aggregator for processing, and the execution space in the heap is occupied when the data is stored in the memory.

If you need to sort the final result, you must pass the data to the ExternalSorter again for processing, occupying the execution space in the heap.

In ExternalSorter and Aggregator in, Spark will use something called AppendOnlyMap的哈希表within the implementation of the heap memory to store data, but in the course of the Shuffle does not have all the data saved to the hash table,当这个哈希表占用的内存会进行周期性地采样估算,当其大到一定程度,无法再从 MemoryManager 申请到新的执行内存时,Spark 就会将其全部内容存储到磁盘文件中,这个过程被称为溢存(Spill),溢存到磁盘的文件最后会被归并(Merge)。

Spark's storage memory and execution memory have completely different management methods :

(1) For storage memory, Spark uses a LinkedHashMap to centrally manage all Blocks, which are transformed from the Partition of the RDD that needs to be cached

(2) For execution memory, Spark uses AppendOnlyMap to store data in the Shuffle process. In Tungsten sorting, it is even abstracted into page memory management, opening up a new JVM memory management mechanism.

Guess you like

Origin blog.csdn.net/weixin_43520450/article/details/108627435