Spark tuning

Spark Tuning

Because most Spark programs are "in-memory" by nature, all resources in the cluster: CPU, network bandwidth, or memory may become the bottleneck of Spark programs. Usually, if the data is fully loaded into memory then the network bandwidth will become a bottleneck, but you still need to optimize the program, such as saving RDD data (Resilient Distributed Datasets) in a serialized way, in order to reduce memory usage. This article mainly covers two topics: data serialization and memory optimization. Data serialization can not only improve network performance but also reduce memory usage. At the same time, we also discussed several other minor topics.
Data Serialization

Serialization plays a very important role in improving the performance of distributed programs. A bad serialization method (such as the serialization mode is very slow or the serialization result is very large) will greatly reduce the calculation speed. In many cases, this is your first choice for optimizing your Spark application. Spark tries to strike a balance between convenience and performance. Spark provides two serialization libraries:

Java Serialization: By default, Spark uses Java's ObjectOutputStream to serialize an object. This method applies to all classes that implement java.io.Serializable. By subclassing java.io.Externalizable, you can further control serialization performance. Java serialization is very flexible, but slow, and in some cases the serialized result is relatively large.
Kryo Serialization: Spark can also serialize objects using Kryo (version 2). Not only is Kryo extremely fast, it also produces more compact results (typically 10x faster). The disadvantage of Kryo is that it does not support all types. For better performance, you need to register the classes used in the program in advance.
You can switch the serialization mode to Kryo by calling System.setProperty("spark.serializer", "spark.KryoSerializer") before creating the SparkContext. The only reason Kryo can't be the default is to require user registration; however, we recommend it for any "network-intensive" application.

Finally, in order to register a class with Kryo, you need to extend spark.KryoRegistrator and set the system property spark.kryo.registrator to point to the class as follows:

01
import com.esotericsoftware.kryo.Kryo
02

03
class MyRegistrator extends spark.KryoRegistrator {
04
override def registerClasses(kryo: Kryo) {
05
kryo.register(classOf[MyClass1])
06
kryo.register(classOf[MyClass2])
07
}
08
}
09

10
// Make sure to set these properties *before* creating a SparkContext !
11
System.setProperty("spark.serializer", "spark.KryoSerializer")
12
System.setProperty("spark.kryo.registrator", "mypackage.MyRegistrator")
13
val sc = new SparkContext(...)
Kryo documentation There are many advanced options for easy registration, such as adding user-defined serialization code.

If the object is very large, you also need to increase the value of the property spark.kryoserializer.buffer.mb. The default value for this property is 32, but the property needs to be large enough to hold the largest object that needs to be serialized.

Finally, if you don't register your classes, Kryo will still work, but you need to save the full class name for each object, which is very wasteful.

Memory Optimization

Memory optimization has three considerations: the memory used by the object (you may want to load all the data into memory), the cost of accessing the object, and the cost of garbage collection.

In general, Java objects are faster to access, but usually take up 2-5 times more space than the property data inside them. This is mainly due to the following reasons:

each Java object contains an "object header" (object header), the object header is about 16 bytes, including pointers to the corresponding class (class) of the object and other information. If the object itself contains very little data, the object header may be larger than the object data.
Java String requires about 40 bytes of extra overhead in addition to the actual string data (because String saves the string in a Char array, and other data like length, etc. needs to be saved additionally); at the same time, because it is Unicode encoding, Each character takes two bytes. So, a string of length 10 takes 60 bytes.
Common collection classes, such as HashMap, LinkedList, etc., all use a linked list data structure, and each entry is wrapped (wrapper). Each entry contains not only the object header, but also a pointer (usually 8 bytes) to the next entry.
Collections of primitive types are usually stored as corresponding classes, such as java.lang.Integer.
This chapter discusses how to estimate the memory occupied by an object and how to improve it - by changing the data structure or using serialization. Then, we'll discuss how to optimize Spark's cache and Java garbage collection.

Determining memory

consumption The best way to determine how much memory an object needs is to create an RDD, put it in the cache, and finally read the SparkContext logs in the driver program. The log will tell you how much memory each part occupies; you can collect this information to determine the final size of the RDD's consumed memory. The log information is as follows:

1
INFO BlockManagerMasterActor: Added rdd_0_1 in memory on mbk.local:50311 (size: 717.5 KB, free: 332.3 MB)
This information indicates that the first part of RDD0 consumes 717.5KB of memory.

Optimize data structures

The first way to reduce memory usage is to avoid some Java features that add overhead, such as pointer-based data structures to repack objects, etc. There are many ways:

use arrays of objects and arrays of primitive types instead of Java or Scala collection classes. The fastutil library provides very convenient collection classes for primitive data types and is compatible with the Java standard class library.
Avoid using nested data structures with pointers to store small objects as much as possible.
Consider using numeric IDs or enums instead of String primary keys.
If the memory is less than 32G, set the JVM parameter -XX:+UseCompressedOops to modify the 8-byte pointer to 4 bytes. Meanwhile, in Java 7 or later, set the JVM parameter -XX:+UseCompressedStrings to encode each ASCII character with 8 bits. You can add these options to spark-env.sh.
Serialized RDD storage

After the above optimizations, if the object is still too large to be stored efficiently, there is a simple way to reduce memory usage - serialization, using the serialized StorageLevel of the RDD persistence API, such as MEMORY_ONLY_SER. Spark saves each part of the RDD as a byte array. The only downside of serialization is that it slows down access because the object needs to be deserialized. If you need to use serialization to cache data, we strongly recommend using Kryo. Kryo serialization results are smaller than Java standard serialization (in fact, it is smaller than the original data inside the object).

Optimize memory reclamation

JVM memory reclamation can be a problem if you need to constantly "flip" through the RDD data saved by the program (usually, if you only need to do a single RDD read and then operate it will not cause problems). When old objects need to be reclaimed to make room for new ones, the JVM needs to keep track of all Java objects to determine which ones are no longer needed. One thing to keep in mind is that the cost of memory reclamation is positively related to the number of objects; therefore, using a data structure with a smaller number of objects (such as using an int array instead of a LinkedList) can significantly reduce this cost. Another and better approach is to use object serialization, as described above; this way, each part of the RDD is stored as a unique object (a byte array). If memory reclamation is a problem, try using serialized caching first before trying other methods.

The working memory of each task and the RDD cached on the node will affect each other, and this effect will also bring about memory reclamation problems. Below we discuss how to allocate space for RDDs in order to mitigate this effect.

Estimating the impact of memory collections The first step in

optimizing memory collections is to obtain some statistics, including the frequency of memory collections, the time it takes for memory collections, and so on. To get these statistics, we can add the parameters -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the environment variable SPARK_JAVA_OPTS. After the setup is complete, when the Spark job is running, we can see the information of each memory reclamation in the log. Note that these logs are stored on the cluster's worker nodes (work nodes) and not your driver program (driver program).
Optimizing cache size How much memory to

use to cache RDDs is a very important configuration parameter for memory reclamation. By default, Spark uses 66% of the running memory (executor memory, spark.executor.memory or SPARK_MEM) for RDD caching. This shows that 33% of the memory is available for object creation during task execution.

If tasks are running slowly and the JVM is doing memory reclamation frequently, or if there is not enough memory space, then lowering the cache size setting can reduce memory consumption. To change the cache size to 50%, you can call the method System.setProperty("spark.storage.memoryFraction", "0.5"). Combined with the serialization cache, using a smaller cache is sufficient to solve most problems with memory reclamation. If you are interested in further optimizing Java memory recycling, please continue reading the following article.

Advanced Memory Reclamation Optimization

To further optimize memory reclamation, we need to understand some basics of JVM memory management.

The Java heap space is divided into two parts: the young generation and the old generation. The young generation is used to save objects with a short life cycle; the old generation is used to save objects with a long life cycle.
The new generation is further divided into three parts [Eden, Survivor1, Survivor2]
A brief description of the memory recovery process: if the Eden area is full, minor GC is performed in Eden and the objects that are still active in Eden and Survivor1 are copied to Survivor2. Then swap Survivor1 and Survivor2. If the object has been active long enough or the Survivor2 area is full, the object will be copied to the Old area. Eventually, if the Old region is exhausted, a full GC is performed.
The goal of Spark's memory reclamation optimization is to ensure that only long-lived RDDs are stored in the old generation area; at the same time, the young generation area is large enough to save objects with shorter lifetimes. This way, a full GC can be avoided during task execution. Here are some steps to perform that may be useful:

Check if memory collections are too frequent by collecting GC information. If the full GC is performed many times before the task ends, it indicates that the task execution is running out of memory space.
In the printed memory reclamation information, if the old generation is nearly exhausted, reduce the memory space used for caching. But this can be done through the property spark.storage.memoryFraction. It's well worth it to reduce cache objects to improve execution speed.
Allocating larger memory for Eden is beneficial if there are too many minor GCs instead of full GCs. You can allocate more memory space for Eden than the task needs to execute. If the size of Eden is determined to be E, then you can set the size of the new generation by -Xmn=4/3*E (expanding the memory to 4/3 is to consider the space required by the survivor).
For example, if a task reads data from HDFS, the memory space required by the task can be estimated from the number of blocks read. Note that the decompressed blcok is usually 2-3 times larger than before decompression. So, if we need to execute 3 or 4 tasks at the same time, and the block size is 64M, we can estimate the size of Eden to be 4*3*64MB.
Monitor the frequency of memory recycling and the time it takes to modify the corresponding parameter settings.
Our experience shows that effective memory reclamation optimization depends on your program and memory size. There are many other optimization options available online, and overall, effectively controlling the frequency of memory reclamation is very helpful in reducing overhead.

Other Considerations

Parallelism

Clusters cannot be utilized efficiently unless a sufficiently high degree of parallelism is set for each operation. Spark will automatically set the number of "Map" tasks running in the file according to the size of each file (you can also control it through the SparkContext configuration parameters); for distributed "reduce" tasks (such as group by key or reduce by key) ), the maximum number of partitions of the RDD is used. You can pass the parallelism as a second parameter (read the documentation for spark.PairRDDFunctions ) or change the default value by setting the system parameter spark.default.parallelism. Generally speaking, in a cluster, we recommend assigning 2-3 tasks to each CPU core.

Memory usage of Reduce Task

Sometimes , you will encounter OutOfMemory errors, not because your RDD cannot be loaded into memory, but because the data set of the task execution is too large, such as a reduce task that is performing a groupByKey operation. Spark's "shuffle" operations (sortByKey, groupByKey, reduceByKey, join, etc.) create a hash table for each task in order to complete the grouping. Hash tables can potentially be very large. The easiest fix is to increase the degree of parallelism, so that the input of each task becomes smaller. Spark can support a period of time tasks (such as 200ms) very effectively, because it will reuse the JVM for all tasks, which can reduce the consumption of task startup. Therefore, you can safely make the parallelism of the task much larger than the number of CPU cores in the cluster.

Broadcasting "Large Variables"

Using SparkContext's broadcasting function can effectively reduce the size of each task and the cost of starting jobs in the cluster. If the task will use a larger object in the driver program (such as a static lookup table), consider making it a broadcastable variable. Spark prints the serialized size of each task on the master, so you can use it to check if the task is too large. Generally speaking, tasks larger than 20KB are probably worth optimizing.

Summarize

This article points out several key points to focus on in Spark program optimization - the most important being data serialization and memory optimization. For most programs, using the Kryo framework and serialization solves most of the performance-related problems. You are very welcome to ask optimization-related questions on the Spark mailing list.

Address of this article: https://www.oschina.net/translate/spark-tuning
Original address: http://spark.incubator.apache.org/docs/latest/tuning.html
All translations in this article are only for learning and For the purpose of communication, please be sure to indicate the translator, source, and link of the article when reprinting.
Our translation work follows the CC agreement. If our work violates your rights, please contact us in time

Guess you like