The Container memory of Spark's Executor consists of two parts: off-heap memory and Executor memory
A) Off-heap memory (spark.yarn.executor.memoryOverhead)
Mainly used for the overhead of the JVM itself. Default: MAX(executorMemory * 0.10, 384m)
B) Executor memory (spark.executor.memory)
Execution: memory used for computations such as shuffle, sorting, aggregation, etc.
Storage: memory used for buffering and disseminating internal data in the cluster (cache, broadcast variables)
The above off-heap memory and excutor memory can be used interchangeably
Two important parameters:
spark.memory.fraction
Used to set Execution and Storage memory in memory (this memory is JVM heap memory - 300M, this 300M is reserved memory)
The proportion of the default is 60% (spark 2.x). That is, the sum of Execution and Storage memory size accounts for the proportion of heap memory.
Spark 1.6.x's spark.memory.fraction defaults to 75%
Reference link: http://spark.apache.org/docs/1.6.3/tuning.html search for sp spark.memory.fraction
http://spark.apache.org/docs/latest/tuning.html
The remaining 40% is used for user data structures, Spark metadata and reserved memory to prevent OOM.
pspark.memory.storageFraction
Indicates the proportion of the sum of Execution and Storage memory in Storage. Setting this parameter prevents buffered data blocks from being flushed out
RAM.
spark 1.6.3
Memory usage in Spark largely falls under one of two categories: execution and storage. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster. In Spark, execution and storage share a unified region (M). When no execution memory is used, storage can acquire all the available memory and vice versa. Execution may evict storage if necessary, but only until total storage memory usage falls under a certain threshold (R). In other words, R
describes a subregion within M
where cached blocks are never evicted. Storage may not evict execution due to complexities in implementation.
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:
spark.memory.fraction
expresses the size ofM
as a fraction of the (JVM heap space - 300MB) (default 0.75). The rest of the space (25%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.spark.memory.storageFraction
expresses the size ofR
as a fraction ofM
(default 0.5).R
is the storage space withinM
where cached blocks immune to being evicted by execution.
Overview of Memory Management
Most of the memory usage in Spark falls into two categories: execution and storage. Execution memory refers to memory used for computations in shuffling, joins, sorting and aggregation, while storage memory refers to memory used to cache and propagate internal data across the cluster. In Spark, execution and storage share a unified region (M). When execution memory is not in use, the store can acquire all available memory and vice versa. Execution may evict storage space if necessary, but only if total storage memory usage falls below a certain threshold (R). In other words,R
describes partitions where M
cache blocks will never be evicted . Storage may not perform due to execution complexity.
This design ensures several desirable properties . First, applications that do not use the cache can execute using the entire space, avoiding unnecessary disk overflows. Second, an application using the cache can reserve a minimum amount of storage space (R) so that its data blocks are not evicted. Finally, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring the user's expertise to partition memory internally.
While there are two related configurations, typical users will not need to tune them, as the defaults are suitable for most workloads:
spark.memory.fraction
IndicatesM
the size of a fraction (default 0.75) of (JVM heap space - 300MB). The rest of the space (25%) is reserved for user data structures, internal metadata in Spark, and to guard against OOM errors in the case of sparse and unusually large records.spark.memory.storageFraction
R
Represents a fraction of sizeM
(default 0.5).R
is the storage space for cache blocks whoseM
cache blocks are not subject to eviction .
Spark 2.3
Memory Management Overview
Memory usage in Spark largely falls under one of two categories: execution and storage. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster. In Spark, execution and storage share a unified region (M). When no execution memory is used, storage can acquire all the available memory and vice versa. Execution may evict storage if necessary, but only until total storage memory usage falls under a certain threshold (R). In other words, R
describes a subregion within M
where cached blocks are never evicted. Storage may not evict execution due to complexities in implementation.
This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.
Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:
spark.memory.fraction
expresses the size ofM
as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.spark.memory.storageFraction
expresses the size ofR
as a fraction ofM
(default 0.5).R
is the storage space withinM
where cached blocks immune to being evicted by execution.
The value of spark.memory.fraction
should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. See the discussion of advanced GC tuning below for details.
内存管理概述
Spark中的内存使用大部分属于两类:执行和存储。执行内存是指用于在混洗,连接,排序和聚合中进行计算的内存,而存储内存指的是用于跨群集缓存和传播内部数据的内存。在Spark中,执行和存储共享统一区域(M)。当不使用执行内存时,存储可以获取所有可用内存,反之亦然。如有必要,执行可能会驱逐存储空间,但只能在总存储内存使用量低于特定阈值(R)时才执行。换句话说,R
描述了M
缓存块永远不会被驱逐的分区域。由于执行的复杂性,存储可能不会执行。
这种设计确保了几个理想的特性 首先,不使用缓存的应用程序可以使用整个空间执行,避免不必要的磁盘溢出。其次,使用高速缓存的应用程序可以保留最小的存储空间(R),使其数据块不会被驱逐。最后,这种方法为各种工作负载提供了合理的开箱即用性能,而不需要用户在内部划分内存的专业知识。
虽然有两种相关配置,但典型用户不需要调整它们,因为默认值适用于大多数工作负载:
spark.memory.fraction
表示M
(JVM堆空间 - 300MB)的一部分的大小(默认值为0.6)。其余空间(40%)保留用于用户数据结构,Spark中的内部元数据,以及在稀疏和异常大的记录情况下防止OOM错误。spark.memory.storageFraction
表示大小R
为M
(默认0.5)的一部分。R
是M
缓存块不受执行驱逐的缓存块的存储空间。
spark.memory.fraction
为了在JVM的旧时代或“终身”时代中适应这种堆空间,应该设置它的价值。有关详细信息,请参阅下面的高级GC调整讨论。
怎么配置呢? 如下
spark-shell \
--master yarn-client \
--num-executors 3 \
--driver-memory 10g \
--executor-memory 3g \
--executor-cores 2 \
--conf spark.yarn.executor.memoryOverhead=1024m
--conf spark.memory.storageFraction=0.5
================= 堆外内存====
堆外内存:除了前面介绍的Executor的堆外内存,Driver、ApplicationMaster进程也有堆外内存。
Driver的堆外内存设置
spark.driver.memoryOverhead
默认: MAX(Driver memory * 0.10, 384m)
Application Master的堆外内存设置
spark.yarn.am.memoryOverhead
默认: MAX(AM memory * 0.10, 384m)
Application Master的内存也是--conf参数设置
spark2-shell --master yarn \
--master yarn-client \
--num-executors 4 \
--conf spark.yarn.am.memory=1000m \
--conf spark.yarn.am.memoryOverhead=1000m \
--conf spark.driver.memoryOverhead=1g