Detailed explanation of Spark memory management

This article directory:

 

  1. Spark Shuffle Evolution History

  2. On-heap and off-heap memory planning

  3. memory space allocation

  4. storage memory management

  5. perform memory management

Hundreds of high-quality big data books, with a must-read list (big data treasure) 

foreword

As a memory-based distributed computing engine, Spark's memory management module plays a very important role in the entire system . Understanding the fundamentals of Spark's memory management helps you better develop Spark applications and perform performance tuning. The purpose of this article is to sort out the context of Spark memory management, to introduce some ideas, and to elicit readers' in-depth discussion on this topic. The principles described in this article are based on Spark 2.1 version. To read this article, readers need to have a certain Spark and Java foundation, and understand RDD, Shuffle, JVM and other related concepts.

When executing a Spark application, the Spark cluster will start two JVM processes, Driver and Executor. The former is the master process, responsible for creating the Spark context, submitting Spark jobs (Job), and converting the jobs into computing tasks (Task). Each Executor process coordinates the scheduling of tasks, the latter is responsible for executing specific computing tasks on the worker nodes, returning the results to the Driver, and providing storage functions for RDDs that need to be persistent. Since the memory management of the Driver is relatively simple, this article mainly analyzes the memory management of the Executor. The Spark memory below refers to the memory of the Executor.

1. Spark Shuffle evolution history

In the MapReduce framework, shuffle is the bridge between Map and Reduce. The output of Map must go through the shuffle link in Reduce. The performance of shuffle directly affects the performance and throughput of the entire program. Spark, as an implementation of the MapReduce framework, naturally also implements the logic of shuffle.

Shuffle is a specific phase in the MapReduce framework. It is between the Map phase and the Reduce phase. When the output result of the Map is to be used by Reduce, the output result needs to be hashed by key and distributed to each Reducer. This process is shuffle. Since shuffle involves disk read and write and network transmission, the performance of shuffle directly affects the running efficiency of the entire program.

The following picture clearly describes the entire process of the MapReduce algorithm, where the shuffle phase is between the Map phase and the Reduce phase.

 

Conceptually, shuffle is a bridge to communicate data connection, so how is shuffle (partition) actually implemented? Let's take Spark as an example to talk about the implementation of shuffle in Spark.

Let's take the picture as an example to briefly describe the whole process of shuffle in Spark:

  • First, each Mapper creates a corresponding bucket according to the number of Reducers. The number of buckets is MM×RR, where MM is the number of Maps and RR is the number of Reducers.

  • Secondly, the results generated by Mapper will be filled into each bucket according to the set partition algorithm. The partition algorithm here can be customized. Of course, the default algorithm is to hash to different buckets according to the key.

  • When the Reducer starts, it will obtain the corresponding bucket from the remote or local block manager as the input of the Reducer according to the id of its own task and the id of the Mapper it depends on.

The bucket here is an abstract concept. In the implementation, each bucket can correspond to a file, which can correspond to a part of the file or others.

The Shuffle process of Apache Spark is similar to the Shuffle process of Apache Hadoop, and some concepts can be directly applied. For example, in the Shuffle process, the end that provides data is called the Map end, and each task that generates data on the Map end is called Mapper. Correspondingly, the side that receives the data is called the Reduce side, and each task that pulls data on the Reduce side is called the Reducer. The Shuffle process essentially divides the data obtained from the Map side using the partitioner, and sends the data to The corresponding Reducer process.

2. On-heap and off-heap memory planning

As a JVM process, Executor's memory management is built on the JVM's memory management, and Spark allocates the JVM's on-heap space in more detail to make full use of memory. At the same time, Spark introduces off-heap memory, so that it can directly open up space in the system memory of the worker node, which further optimizes the use of memory.

Schematic diagram of in-heap and out-of-heap memory:

 

2.1 In-heap memory

The size of the on-heap memory, configured by the or parameter –executor-memorywhen the Spark application is started . spark.executor.memoryThe concurrent tasks running in the Executor share the JVM heap memory. The memory occupied by these tasks when caching RDD data and broadcasting data is planned as Storage memory, and the memory occupied by these tasks when executing Shuffle is planned as Execution memory, the remaining part is not specially planned, those object instances inside Spark, or object instances in user-defined Spark applications, all occupy the remaining space. In different management modes, the space occupied by these three parts is different (introduced in Section 2 below).

Spark's management of in-heap memory is a logical "planning" management, because the application and release of memory occupied by object instances are completed by the JVM. Spark can only record the memory after application and before release. Let's see Its specific process:

  • Request memory :

  1. Spark new an object instance in the code

  2. JVM allocates space from on-heap memory, creates objects and returns object references

  3. Spark saves a reference to the object and records the memory occupied by the object

  • Free up memory :

  1. Spark records the memory freed by the object and deletes the reference to the object

  2. Wait for the JVM's garbage collection mechanism to release the heap memory occupied by the object

We know that JVM objects can be stored in a serialized manner. The process of serialization is to convert objects into binary byte streams. In essence, it can be understood as converting chain storage in non-continuous space into continuous space or block storage. When accessing, the reverse process of serialization is required - deserialization, which converts the byte stream into an object. The serialization method can save storage space, but it increases the computational overhead when storing and reading.

For the serialized object in Spark, because it is in the form of a byte stream, the memory size occupied by it can be calculated directly, while for the non-serialized object, the memory occupied by it is estimated by periodic sampling, that is, and Not every time a new data item is added, the occupied memory size will be calculated once. This method reduces the time overhead but may have a large error, which may cause the actual memory at a certain moment to be far beyond expectations. In addition, it is very likely that object instances marked for release by Spark are not actually reclaimed by the JVM, resulting in the actual available memory being less than the available memory recorded by Spark. Therefore, Spark cannot accurately record the actual available on-heap memory, so it cannot completely avoid out-of-memory (OOM, Out of Memory) exceptions.

Although the application and release of in-heap memory cannot be precisely controlled, Spark can decide whether to cache new RDDs in storage memory and whether to allocate execution memory for new tasks through independent planning and management of storage memory and execution memory. To a certain extent, it can improve the utilization of memory and reduce the occurrence of exceptions.

2.2 Off-heap memory

In order to further optimize the use of memory and improve the efficiency of sorting during Shuffle, Spark introduces off-heap memory, so that it can directly open up space in the system memory of worker nodes to store serialized binary data. Using the JDK Unsafe API (starting from Spark 2.0, it is no longer based on Tachyon when managing the storage memory outside the heap, but is implemented based on the JDK Unsafe API like the execution memory outside the heap), Spark can directly operate the off-heap memory, reducing the It eliminates unnecessary memory overhead, as well as frequent GC scans and collections, improving processing performance. The off-heap memory can be accurately applied and released, and the space occupied by the serialized data can be accurately calculated, so compared with the on-heap memory, the management difficulty is reduced, and the error is also reduced.

By default, the off-heap memory is not enabled, it can be enabled through the configuration spark.memory.offHeap.enabledparameter , and spark.memory.offHeap.sizethe size of the off-heap space is set by the parameter. Except that there is no other space, the off-heap memory is divided in the same way as the on-heap memory, and all running concurrent tasks share storage memory and execution memory.

2.3 Memory Management Interface

Spark provides a unified interface for the management of storage memory and execution memory - MemoryManager. Tasks in the same Executor call the methods of this interface to apply for or release memory:

Listing 1: Main methods of the memory management interface

name method
1. Apply for storage memory def acquireStorageMemory(blockId: BlockId, numBytes: Long, memoryMode: MemoryMode): Boolean
2. Apply to expand memory def acquireUnrollMemory(blockId: BlockId, numBytes: Long, memoryMode: MemoryMode): Boolean
3. Apply for execution memory def acquireExecutionMemory(numBytes: Long, taskAttemptId: Long, memoryMode: MemoryMode): Long
4. Free up storage memory def releaseStorageMemory(numBytes: Long, memoryMode: MemoryMode): Unit
5. Release the execution memory def releaseExecutionMemory(numBytes: Long, taskAttemptId: Long, memoryMode: MemoryMode): Unit
6. Free unwind memory def releaseUnrollMemory(numBytes: Long, memoryMode: MemoryMode): Unit

We see that when calling these methods, you need to specify its memory mode (MemoryMode). This parameter determines whether the operation is done in the heap or out of the heap.

In terms of the specific implementation of MemoryManager, Spark 1.6 defaults to the Unified Memory Manager mode, and the Static Memory Manager mode used before 1.6 is still reserved, which can be enabled by configuring the spark.memory.useLegacyMode parameter. The difference between the two methods lies in the way of space allocation. The following section 2 will introduce these two methods respectively.

3. Memory space allocation

3.1 Static Memory Management

Under the static memory management mechanism originally adopted by Spark, the sizes of storage memory, execution memory and other memory are fixed during the Spark application running, but users can configure it before the application starts. The allocation of memory in the heap is shown in the figure below. Show:

Static memory management diagram - inside the heap :

 

As you can see, the size of the available heap memory needs to be calculated as follows:

Available storage memory =systemMaxMemory * spark.storage.memoryFraction * spark.storage.safetyFraction

Available execution memory =systemMaxMemory * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction

The systemMaxMemory depends on the size of the current JVM heap memory, and the last available execution memory or storage memory is multiplied by the respective memoryFraction and safetyFraction parameters on this basis. The significance of the two safetyFraction parameters in the above calculation formula is to logically reserve an insurance area such as 1-safetyFraction to reduce the risk of OOM caused by the actual memory exceeding the current preset range (mentioned above, for non-sequential memory sampling estimation of the object is subject to error). It is worth noting that this reserved insurance area is only a logical plan. Spark does not treat it differently when it is used. It is handed over to the JVM to manage like "other memory".

The space allocation outside the heap is relatively simple, only storage memory and execution memory, as shown in the following figure. The size of the space occupied by the available execution memory and storage memory is directly determined by the parameter spark.memory.storageFraction. Since the space occupied by the off-heap memory can be accurately calculated, there is no need to set the insurance area.

Static memory management diagram - off-heap:

The static memory management mechanism is relatively simple to implement, but if users are not familiar with Spark's storage mechanism, or do not make corresponding configurations according to specific data scale and computing tasks, it is easy to cause a situation of "half sea water, half flame", that is, storage One side of the memory and the execution memory has a lot of space left, while the other side is full early and has to retire or move out the old content to store the new content. Due to the emergence of the new memory management mechanism, this method is rarely used by developers. For the purpose of compatibility with older versions of applications, Spark still retains its implementation.

3.2 Unified Memory Management

The unified memory management mechanism introduced after Spark 1.6 differs from static memory management in that storage memory and execution memory share the same space and can dynamically occupy each other's free area, as shown in the following two figures

Unified memory management diagram - inside the heap:

Unified memory management diagram - off-heap:

 

One of the most important optimizations is the dynamic occupancy mechanism, the rules of which are as follows:

  • Set the basic storage memory and execution memory area ( spark.storage.storageFraction 参数), which determines the range of space each has

  • When the space of both parties is insufficient, it will be stored to the hard disk; if the space of one party is insufficient and the space of the other party is free, the space of the other party can be borrowed; (insufficient storage space means that it is not enough to put down a complete block)

  • After the execution memory space is occupied by the other party, the other party can transfer the occupied part to the hard disk, and then "return" the borrowed space

  • After the storage space is occupied by the other party, it cannot be "returned" by the other party, because many factors in the shuffle process need to be considered, which is more complicated to implement.

Dynamic occupancy mechanism diagram:

 

With the unified memory management mechanism, Spark improves the utilization of on-heap and off-heap memory resources to a certain extent, and reduces the difficulty for developers to maintain Spark memory, but it does not mean that developers can sit back and relax. For example, if the storage memory space is too large or the cached data is too large, it will lead to frequent full garbage collection and reduce the performance of the task execution, because the cached RDD data usually resides in the memory for a long time. Therefore, in order to give full play to the performance of Spark, developers need to further understand the management methods and implementation principles of storage memory and execution memory.

4. Storage memory management

4.1 Persistence Mechanism of RDD

Resilient Distributed Dataset (RDD), as the most fundamental data abstraction of Spark, is a collection of read-only partition records (Partition), which can only be created based on datasets in stable physical storage, or on other existing RDDs Perform a transformation (Transformation) operation to generate a new RDD. The resulting dependencies between the transformed RDD and the original RDD constitute the lineage. With lineage, Spark guarantees that every RDD can be re-recovered. But all transformations of RDD are lazy, that is, only when an action (Action) that returns a result to the Driver occurs, Spark will create a task to read the RDD, and then actually trigger the execution of the transformation.

When the Task reads a partition at the beginning of startup, it will first determine whether the partition has been persisted. If not, it needs to check the Checkpoint or recalculate according to the lineage. So if you want to perform multiple actions on an RDD, you can use the persist or cache method in the first action to persist or cache the RDD in memory or disk, so as to improve the calculation speed in subsequent actions. In fact, the cache method uses the default MEMORY_ONLY storage level to persist the RDD to memory, so the cache is a special kind of persistence. The design of on-heap and off-heap storage memory enables unified planning and management of the memory used when caching RDDs (other application scenarios of storage memory, such as caching broadcast data, are temporarily out of the scope of this article).

The persistence of RDD is in charge of Spark's Storage module, which realizes the decoupling of RDD and physical storage. The Storage module is responsible for managing the data generated by Spark in the computing process, and encapsulates the functions of accessing data in memory or disk, locally or remotely. In the specific implementation, the Storage modules on the Driver side and the Executor side constitute a master-slave architecture, that is, the BlockManager on the Driver side is the Master, and the BlockManager on the Executor side is the Slave. The Storage module logically uses Block as the basic storage unit, and each Partition of RDD is uniquely corresponding to a Block after processing (the format of BlockId is rdd_RDD-ID_PARTITION-ID). The Master is responsible for the management and maintenance of the metadata information of the Block of the entire Spark application, and the Slave needs to report the block update and other status to the Master, and receive commands from the Master, such as adding or deleting an RDD.

Storage module diagram:

 

When persisting RDDs, Spark specifies 7 different storage levels, such as , and the storage level is a combination of the following 5 variables MEMORY_ONLY:MEMORY_AND_DISK

class StorageLevel private(
private var _useDisk: Boolean, //磁盘
private var _useMemory: Boolean, //这里其实是指堆内内存
private var _useOffHeap: Boolean, //堆外内存
private var _deserialized: Boolean, //是否为非序列化
private var _replication: Int = 1 //副本个数
)

Through the analysis of the data structure, it can be seen that the storage level defines the storage method of RDD's Partition (also known as Block) from three dimensions:

  • Storage location : Disk/In-heap memory/Out-of-heap memory. For example, MEMORY_AND_DISK is stored on disk and in-heap memory at the same time, realizing redundant backup. OFF_HEAP is only stored in off-heap memory. Currently, when off-heap memory is selected, it cannot be stored in other locations at the same time.

  • Storage form : Whether the Block is in a non-serialized form after being cached in the storage memory. For example, MEMORY_ONLY is stored in non-serialized mode, and OFF_HEAP is stored in serialized mode.

  • The number of replicas : when it is greater than 1, remote redundant backup is required to other nodes. For example, DISK_ONLY_2 requires a remote backup copy.

4.2 The process of RDD caching

Before RDD is cached to the storage memory, the data in Partition is generally accessed as an iterator (Iterator) data structure, which is a method of traversing a data collection in the Scala language. Iterator can be used to obtain each serialized or non-serialized data item (Record) in the partition. The object instances of these Records logically occupy the space of the other part of the memory in the JVM heap. The space of different Records in the same Partition does not continuous.

After the RDD is cached in the storage memory, the Partition is converted into a Block, and the Record occupies a continuous space in the heap or off-heap storage memory. The process of converting Partition from discontinuous storage space to continuous storage space, Spark calls it "Unroll". Block has two storage formats, serialized and non-serialized, depending on the storage level of the RDD. The non-serialized Block is defined by a DeserializedMemoryEntry data structure, and an array is used to store all object instances, and the serialized Block is defined by the SerializedMemoryEntry data structure, using a byte buffer (ByteBuffer) to store binary data. The Storage module of each Executor uses a linked Map structure (LinkedHashMap) to manage the instances of all Block objects in the heap and off-heap storage memory. Adding and deleting this LinkedHashMap indirectly records the application and release of memory.

Because there is no guarantee that the storage space can accommodate all the data in the Iterator at one time, the current computing task needs to apply for enough Unroll space to the MemoryManager to temporarily occupy the space when Unrolling. For a serialized Partition, the required Unroll space can be calculated directly and applied at one time. Non-serialized Partitions need to be applied in turn during the process of traversing Records, that is, each time a Record is read, sample and estimate the required Unroll space and apply for it. When the space is insufficient, it can be interrupted and the occupied Unroll space can be released. If the final Unroll succeeds, the Unroll space occupied by the current Partition is converted into the normal cache RDD storage space, as shown in the following figure.

Schematic diagram of Spark Unroll :

 

As you can see in the section on static memory management above, during static memory management, Spark specially divides a piece of Unroll space in the storage memory, and its size is fixed. In unified memory management, there is no special distinction between the Unroll space. When the storage space is When insufficient, it will be processed according to the dynamic occupancy mechanism.

4.3 Elimination and placement

Since all computing tasks of the same Executor share limited storage memory space, when there is a new Block that needs to be cached but the remaining space is insufficient and cannot be dynamically occupied, the old Block in the LinkedHashMap must be Evictioned and eliminated. If the storage level of the block also includes the requirement to store to disk, it needs to be dropped (Drop), otherwise the block is directly deleted.

The elimination rules for storage memory are :

  • The old Block that has been eliminated must have the same MemoryMode as the new Block, that is, both belong to off-heap or on-heap memory

  • Old and new blocks cannot belong to the same RDD to avoid circular elimination

  • The RDD to which the old block belongs cannot be read to avoid consistency problems

  • Traverse the blocks in the LinkedHashMap and eliminate them in the order of least recently used (LRU) until the space required by the new block is satisfied. Where LRU is the characteristic of LinkedHashMap.

  • The process of placing the disk is relatively simple. If its storage level meets the condition that _useDisk is true, then judge whether it is in a non-serialized form according to its _deserialized, and if so, serialize it, and finally store the data to the disk. update its information in the module.

5. Perform memory management

5.1 Memory allocation among multitasking

The tasks running in the Executor also share the execution memory, and Spark uses a HashMap structure to save the mapping from tasks to memory consumption. The size of the execution memory that each task can occupy ranges from 1/2N to 1/N, where N is the number of running tasks in the current Executor. When each task is started, it needs to request at least 1/2N execution memory from MemoryManager. If the requirement cannot be met, the task will be blocked, and the task can be woken up until another task releases enough execution memory. .

5.2 Shuffle memory usage

The execution memory is mainly used to store the memory occupied by the task when executing Shuffle. Shuffle is the process of repartitioning RDD data according to certain rules. Let's look at the use of execution memory in the Write and Read phases of Shuffle:

Shuffle Write

  • If the normal sorting method is selected on the map side, the ExternalSorter will be used for outflow, and the execution space in the heap is mainly occupied when storing data in the memory.

  • If the sorting method of Tungsten is selected on the map side, ShuffleExternalSorter is used to directly sort the data stored in the serialized form. When storing data in memory, it can occupy the execution space outside the heap or in the heap, depending on whether the user has enabled the off-heap memory and Whether the off-heap execution memory is sufficient.

Shuffle Read

When the data on the reduce side is aggregated, the data must be handed over to the Aggregator for processing, and the execution space in the heap is occupied when the data is stored in the memory.

If the final result needs to be sorted, the data must be handed over to the ExternalSorter for processing again, occupying the execution space in the heap.

In ExternalSorter and Aggregator, Spark will use a hash table called AppendOnlyMap to store data in the heap execution memory, but not all data can be stored in the hash table during the Shuffle process, when the hash table occupies The memory will be sampled and estimated periodically. When it is too large to apply for new execution memory from the MemoryManager, Spark will store its entire content in the disk file. This process is called overflow ( Spill), files that are spilled to disk will eventually be merged (Merge).

The Tungsten used in the Shuffle Write phase is a plan proposed by Databricks to optimize memory and CPU usage for Spark, which solves some JVM performance limitations and drawbacks. Spark will automatically choose whether to use Tungsten sorting according to the Shuffle. The page-based memory management mechanism adopted by Tungsten is built on the MemoryManager, that is, Tungsten abstracts the use of execution memory one step, so that in the Shuffle process, you don't need to care whether the data is stored in the heap or out of the heap. Each memory page is defined by a MemoryBlock, and the two variables of Object obj and long offset are used to uniformly identify the address of a memory page in the system memory. The MemoryBlock in the heap is the memory allocated in the form of a long array. The value of obj is the object reference of the array. The offset is the initial offset address of the long array in the JVM. The two can be used together to locate the array. The absolute address in the heap; the MemoryBlock outside the heap is the memory block directly applied for, its obj is null, and the offset is the 64-bit absolute address of this memory block in the system memory. Spark uses MemoryBlock to subtly abstractly encapsulate in-heap and out-of-heap memory pages, and uses page table (pageTable) to manage the memory pages applied for by each task.

All memory under Tungsten page management is represented by a 64-bit logical address, consisting of a page number and an offset within the page:

  • Page number: It occupies 13 bits and uniquely identifies a memory page. Spark must apply for a free page number before applying for a memory page.

  • In-page offset: occupying 51 bits, it is the offset address of the data in the page when the memory page is used to store the data.

With a unified addressing method, Spark can use a 64-bit logical address pointer to locate the memory inside or outside the heap. The entire Shuffle Write sorting process only needs to sort the pointers and does not require deserialization. The whole process is very efficient. , which significantly improves memory access efficiency and CPU usage efficiency.

Spark's storage memory and execution memory have completely different management methods : for storage memory, Spark uses a LinkedHashMap to centrally manage all Blocks, which are converted from the Partition of the RDD that needs to be cached; and for execution memory, Spark uses AppendOnlyMap is used to store the data in the Shuffle process, and even abstracted into page-based memory management in Tungsten sorting, opening up a new JVM memory management mechanism.

reference

Hundreds of high-quality big data books, with a must-read list (big data treasure) 

Guess you like

Origin blog.csdn.net/helloHbulie/article/details/124125842