Spark内存调优

Memory Tuning

Memory Tuning

There are three considerations in tuning memory usage: the amount of memory used by your objects (you may want your entire dataset to fit in memory), the cost of accessing those objects, and the overhead of garbage collection (if you have high turnover in terms of objects).

在调优内存使用时有三个考虑事项:对象使用的内存量(你可能希望将整个数据集装入内存)、
访问这些对象的成本和垃圾收集的开销(如果对象的周转率很高)。

By default, Java objects are fast to access, but can easily consume a factor of 2-5x more space than the “raw” data inside their fields. This is due to several reasons:

*默认情况下，Java对象的访问速度很快，但是很容易占用比字段内的“原始”数据多2-5倍的空间。这有几个原因:*

1.Each distinct Java object has an “object header”, which is about 16 bytes and contains information such as a pointer to its class. For an object with very little data in it (say one Int field), this can be bigger than the data.

*每个不同的Java对象都有一个“object header”，大约16个字节，包含诸如指向其类的指针之类的信息。
对于一个只有很少数据的对象(比如一个Int字段)，这可能比数据更大。*

2.Java Strings have about 40 bytes of overhead over the raw string data (since they store it in an array of Chars and keep extra data such as the length), and store each character as two bytes due to String’s internal usage of UTF-16 encoding. Thus a 10-character string can easily consume 60 bytes.

*Java Strings在原始字符串数据上有大约40个字节的开销
(因为它们将它存储在一个char数组中并保留额外的数据，比如长度)，
由于string内部使用了UTF-16编码，每个字符存储为两个字节。
因此，一个10个字符的字符串可以很容易地消耗60个字节。*

3.Common collection classes, such as HashMap and LinkedList, use linked data structures, where there is a “wrapper” object for each entry (e.g. Map.Entry). This object not only has a header, but also pointers (typically 8 bytes each) to the next object in the list.

*常见的集合类，如HashMap和LinkedList，使用链接数据结构，其中每个条目都有一个“包装器”对象(例如Map.Entry)。
这个对象不仅有一个header，而且还指向列表中的下一个对象(通常每个8个字节)。*

4.Collections of primitive types often store them as “boxed” objects such as java.lang.Integer.

*基本类型的集合通常将它们存储为“装箱”对象，如java.lang.Integer。*

This section will start with an overview of memory management in Spark, then discuss specific strategies the user can take to make more efficient use of memory in his/her application. In particular, we will describe how to determine the memory usage of your objects, and how to improve it – either by changing your data structures, or by storing data in a serialized format. We will then cover tuning Spark’s cache size and the Java garbage collector.

*本节将首先概述Spark中的内存管理，然后讨论用户可以采取的特定策略，以便在其应用程序中更有效地使用内存。
特别是，我们将描述如何确定对象的内存使用，以及如何改进它——通过更改数据结构，或通过以序列化格式存储数据。
然后我们将讨论调优Spark缓存大小和Java垃圾收集器。*

1.Memory Management Overview（内存管理概述）

Memory usage in Spark largely falls under one of two categories: execution and storage. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster. In Spark, execution and storage share a unified region (M). When no execution memory is used, storage can acquire all the available memory and vice versa. Execution may evict storage if necessary, but only until total storage memory usage falls under a certain threshold ®. In other words, R describes a subregion within M where cached blocks are never evicted. Storage may not evict execution due to complexities in implementation.

Spark中的内存使用基本上属于以下两类:执行和存储。
执行内存指的是用于shuffle、连接、排序和聚合中的计算，而存储内存指的是用于跨集群缓存和传播内部数据的内存。
在Spark中，执行和存储共享一个统一的区域(M)，当不使用执行内存时，存储可以获取所有可用内存，反之亦然。
如果有必要，执行可能会驱逐存储，但只有当存储内存的总使用量低于某个阈值(R)时才会。
换句话说，阈值描述了区域中的一个子区域，缓存的块永远不会被驱逐。由于实现的复杂性，存储可能不会收回执行。

This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space ® where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.

这种设计确保了几个理想的特性。
首先，不使用缓存的应用程序可以使用整个空间来执行，从而避免不必要的磁盘溢出。
其次，使用缓存的应用程序可以预留最小的存储空间(R)，这样它们的数据块就不会被逐出。
最后，这种方法为各种工作负载提供了合理的开箱即用的性能，而不需要用户了解如何在内部划分内存。

Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:

虽然有两个相关的配置，典型的用户不需要调整它们，因为默认值适用于大多数工作负载:

1.spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.

spark.memory.fraction表示M的大小为(JVM堆空间- 300MiB)(默认0.6)的一部分。
剩余的空间(40%)用于用户数据结构、Spark中的内部元数据，以及在记录稀疏和异常大的情况下防止OOM错误。

2.spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.

spark.memory.storageFraction将R的大小表示为M的一部分(默认0.5)。
R是M中的存储空间，缓存的块不会被执行逐出。

The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. See the discussion of advanced GC tuning below for details.

应该设置spark.memory.fraction的值，以便在JVM的老代或“终身”代中舒适地容纳这个堆空间的数量。
有关详细信息，请参阅下面关于高级GC调优的讨论。

2.Determining Memory Consumption（确定内存消耗）

The best way to size the amount of memory consumption a dataset will require is to create an RDD, put it into cache, and look at the “Storage” page in the web UI. The page will tell you how much memory the RDD is occupying.

最好的方法是创建一个RDD，把它放到缓存中，然后在web UI中查看“存储”页面。
该页面会显示RDD占用了多少内存。

To estimate the memory consumption of a particular object, use SizeEstimator’s estimate method. This is useful for experimenting with different data layouts to trim memory usage, as well as determining the amount of space a broadcast variable will occupy on each executor heap.

要估算特定对象的内存消耗，请使用SizeEstimator的估算方法。
这对于试验不同的数据布局来削减内存使用以及确定广播变量将在每个执行器堆上占用的空间量非常有用。

3.Tuning Data Structures（优化数据结构）

The first way to reduce memory consumption is to avoid the Java features that add overhead, such as pointer-based data structures and wrapper objects. There are several ways to do this:

减少内存消耗的第一种方法是避免增加开销的Java特性，例如基于指针的数据结构和包装器对象。
有几种方法可以做到这一点:

1.Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library.

将你的数据结构设计成对象数组和原始类型，而不是标准的Java或Scala集合类(例如HashMap)。

2.Avoid nested structures with a lot of small objects and pointers when possible.

尽可能避免嵌套结构中包含大量的小对象和指针。

3.Consider using numeric IDs or enumeration objects instead of strings for keys.

考虑使用数字id或枚举对象代替字符串作为键。

I4.f you have less than 32 GiB of RAM, set the JVM flag -XX:+UseCompressedOops to make pointers be four bytes instead of eight. You can add these options in spark-env.sh.

如果你的RAM小于32 GiB，设置JVM标志-XX:+UseCompressedOops使指针为4字节而不是8字节。您可以在spark-env.sh中添加这些选项。

4.Serialized RDD Storage

When your objects are still too large to efficiently store despite this tuning, a much simpler way to reduce memory usage is to store them in serialized form, using the serialized StorageLevels in the RDD persistence API, such as MEMORY_ONLY_SER. Spark will then store each RDD partition as one large byte array. The only downside of storing data in serialized form is slower access times, due to having to deserialize each object on the fly. We highly recommend using Kryo if you want to cache data in serialized form, as it leads to much smaller sizes than Java serialization (and certainly than raw Java objects).

当您的对象仍然太大而无法有效地存储时(尽管进行了此调优)，一种更简单的减少内存使用的方法是：
将它们以序列化的形式存储，使用RDD持久化API中的序列化的StorageLevels，例如MEMORY_ONLY_SER。
然后Spark会将每个RDD分区存储为一个大字节数组。以序列化形式存储数据的唯一缺点是访问时间较慢，因为必须动态地反序列化每个对象。
如果您想以序列化的形式缓存数据，我们强烈建议使用Kryo，因为它比Java序列化(当然也比原始Java对象)小得多。

5.Garbage Collection Tuning（垃圾收集调优）

JVM garbage collection can be a problem when you have large “churn” in terms of the RDDs stored by your program. (It is usually not a problem in programs that just read an RDD once and then run many operations on it.) When Java needs to evict old objects to make room for new ones, it will need to trace through all your Java objects and find the unused ones. The main point to remember here is that the cost of garbage collection is proportional to the number of Java objects, so using data structures with fewer objects (e.g. an array of Ints instead of a LinkedList) greatly lowers this cost. An even better method is to persist objects in serialized form, as described above: now there will be only one object (a byte array) per RDD partition. Before trying other techniques, the first thing to try if GC is a problem is to use serialized caching.

当你的程序存储的rdd出现大的“混乱”时，JVM垃圾收集可能会成为一个问题。
(对于只读一次RDD，然后在其上运行许多操作的程序来说，这通常不是问题。)
当Java需要清除旧对象以为新对象腾出空间时，它将需要跟踪所有Java对象，并找到未使用的对象。
这里要记住的要点是，垃圾收集的成本与Java对象的数量成正比，所以使用对象较少的数据结构(例如，一个int数组而不是一个LinkedList)可以大大降低这个成本。
一种更好的方法是将对象以序列化的形式持久化，如上所述:现在每个RDD分区只有一个对象(一个字节数组)。
在尝试其他技术之前，如果GC有问题，首先要尝试的是使用序列化缓存。

GC can also be a problem due to interference between your tasks’ working memory (the amount of space needed to run the task) and the RDDs cached on your nodes. We will discuss how to control the space allocated to the RDD cache to mitigate this.

由于任务的工作内存(运行任务所需的空间量)和节点上缓存的rdd之间的干扰，GC也可能成为一个问题。
我们将讨论如何控制分配给RDD缓存的空间来减轻这种情况。

1.Measuring the Impact of GC（GC的影响度量）

The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of time spent GC. This can be done by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the Java options. (See the configuration guide for info on passing Java options to Spark jobs.) Next time your Spark job is run, you will see messages printed in the worker’s logs each time a garbage collection occurs. Note these logs will be on your cluster’s worker nodes (in the stdout files in their work directories), not on your driver program.

GC调优的第一步是收集有关垃圾收集发生频率和GC花费时间的统计信息。
这可以通过在Java选项中添加-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps来实现。
(关于将Java选项传递给Spark作业的信息，请参阅配置指南。)
下次运行Spark作业时，每次发生垃圾收集时，您将在工作者的日志中看到打印的消息。
注意，这些日志将在集群的工作节点上(在其工作目录中的stdout文件中)，而不是在驱动程序上。

2.Advanced GC Tuning（进一步GC优化）

1.To further tune garbage collection, we first need to understand some basic information about memory management in the JVM:

为了进一步优化垃圾收集，我们首先需要了解JVM中内存管理的一些基本信息:

（1）Java Heap space is divided in to two regions Young and Old.

The Young generation is meant to hold short-lived objects while the Old generation is intended for objects with longer lifetimes.

Java堆空间分为两个区域Young和Old。
Young代用于保存寿命较短的对象，而Old代用于保存寿命较长的对象。

（2）The Young generation is further divided into three regions [Eden, Survivor1, Survivor2].

年轻一代进一步分为三个区域[Eden, Survivor1, Survivor2]。
（补充：他们的比例是8：1：1）

（3）A simplified description of the garbage collection procedure:

When Eden is full, a minor GC is run on Eden and objects that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor regions are swapped. If an object is old enough or Survivor2 is full, it is moved to Old. Finally, when Old is close to full, a full GC is invoked.

对垃圾收集过程的一个简化描述是：
当Eden满时，在Eden上运行一个小GC（YGC），从Eden和Survivor1中存活的对象被复制到Survivor2中。
幸存者区域被交换。如果一个对象足够老，或者Survivor2已经满了，它就被移到old。
最后，当Old接近全部时，将调用一个完整的GC（FGC）。

2.Some steps which may be useful are:

The goal of GC tuning in Spark is to ensure that only long-lived RDDs are stored in the Old generation and that the Young generation is sufficiently sized to store short-lived objects. This will help avoid full GCs to collect temporary objects created during task execution. Some steps which may be useful are:

Spark中GC调优的目标是确保只有长期存在的rdd存储在老代中，并且年轻代的大小足以存储短期存在的对象。
这将有助于避免使用完整的gc来收集任务执行期间创建的临时对象。一些可能有用的步骤是:

（1）Check if there are too many garbage collections by collecting GC stats.

If a full GC is invoked multiple times for before a task completes, it means that there isn’t enough memory available for executing tasks.

通过收集GC统计信息来检查是否有太多的垃圾收集。
如果在一个任务完成之前多次调用一个完整的GC，这意味着没有足够的内存可用来执行任务。

（2）If there are too many minor collections but not many major GCs, allocating more memory for Eden would help.

You can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden is determined to be E, then you can set the size of the Young generation using the option -Xmn=4/3*E. (The scaling up by 4/3 is to account for space used by survivor regions as well.)

如果有太多的次要收集而没有太多的主要gc，那么为Eden分配更多的内存将会有所帮助。
您可以将Eden的大小设置为对每个任务所需内存的过高估计。
如果Eden的大小被确定为E，那么您可以使用选项-Xmn=4/3*E设置Young代的大小。
(放大4/3也是为了计算幸存者区域所使用的空间。)

（3）In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching by lowering spark.memory.fraction;

it is better to cache fewer objects than to slow down task execution. Alternatively, consider decreasing the size of the Young generation. This means lowering -Xmn if you’ve set it as above. If not, try changing the value of the JVM’s NewRatio parameter. Many JVMs default this to 2, meaning that the Old generation occupies 2/3 of the heap. It should be large enough such that this fraction exceeds spark.memory.fraction.

在打印的GC统计信息中，如果OldGen接近满，则通过降低spark.memory.fraction来减少用于缓存的内存量;
缓存更少的对象比降低任务执行速度更好。或者，考虑减少年轻一代的规模。这意味着降低-Xmn。
如果没有，请尝试更改JVM的NewRatio参数的值。许多jvm默认为2，这意味着老一代占用堆的2/3。
它应该足够大，使这个部分超过spark.memory.fraction。

（4）Try the G1GC garbage collector with -XX:+UseG1GC.

It can improve performance in some situations where garbage collection is a bottleneck. Note that with large executor heap sizes, it may be important to increase the G1 region size with -XX:G1HeapRegionSize

使用-XX:+UseG1GC尝试G1GC垃圾收集器。
在某些以垃圾收集为瓶颈的情况下，它可以提高性能。
注意，对于较大的执行程序堆大小，使用-XX:G1HeapRegionSize增加G1区域大小可能很重要

（5）As an example, if your task is reading data from HDFS, the amount of memory used by the task can be estimated using the size of the data block read from HDFS.

Note that the size of a decompressed block is often 2 or 3 times the size of the block. So if we wish to have 3 or 4 tasks’ worth of working space, and the HDFS block size is 128 MiB, we can estimate size of Eden to be 43128MiB.

例如，如果你的任务是从HDFS读取数据，可以通过从HDFS读取数据块的大小来估算任务所占用的内存。
注意，解压后的块的大小通常是块大小的2或3倍。
所以如果我们希望有3个或4个任务的工作空间，HDFS块大小为128MiB，我们可以估算Eden的大小为4*3*128MiB。

（6）Monitor how the frequency and time taken by garbage collection changes with the new settings.

监视垃圾收集所花费的频率和时间如何随着新设置而变化。

Our experience suggests that the effect of GC tuning depends on your application and the amount of memory available. There are many more tuning options described online, but at a high level, managing how frequently full GC takes place can help in reducing the overhead.

我们的经验表明，GC调优的效果取决于您的应用程序和可用内存量。
在线描述了更多的调优选项，但从较高的层次来说，管理完全GC（FGC）发生的频率有助于减少开销。

GC tuning flags for executors can be specified by setting spark.executor.defaultJavaOptions or spark.executor.extraJavaOptions in a job’s configuration.

可以通过在作业配置中设置spark.executor.defaultJavaOptions
或spark.executor.extraJavaOptions来指定executor的GC调优标志。

Spark源码学习——Memory Tuning（内存调优）