Spark performance optimization
Tuned mainly for memory usage
Spark performance optimization technology
- Use high-performance serialization libraries
- Optimize data structure
- Persist and checkpoint for multiple RDDs
- Persistence level: MEMORY_ONLY-> MEMORY_ONLY_SER serialization
- Java virtual machine garbage collection tuning
- Shuffle tuning
I. Judge Spark memory usage
First of all, we must see the memory usage in order to carry out targeted optimization
(1) Memory cost
- Every Java object has an object header, which takes up 16 bytes and contains some meta information about the object, such as a pointer to its class. If the object itself is small, such as int, but its object head is larger than the object itself
- Java's String object will be 40 bytes more than the original data in its memory. The char array used inside String holds the internal string sequence, and also saves information such as output length. Char uses UTF-16 encoding, each character will occupy 2 bytes. For example, a String containing 10 characters, 2 * 10 + 30 = 60 bytes
- Collection types in Java, such as HashMap and LinkedList, use linked list data structures internally. Each data in the linked list is wrapped with Entry objects. Entry object, not only the object header, but also the pointer that just wants the next Entry, occupying 8 bytes
- The element type is the original data type (int), and the packaging type (Integer) of the original data type is usually used internally to store the element
- Set the parallelism of RDD (parellelize and textFile two methods to create RDD, in these two methods, pass in the second parameter, set the number of RDD partitions. Set a parameter in SparkConfig: spark.default.parallelism can be set uniformly The number of partitions of all RDDs in this application)
- Cache RDD
- Observation log: driver log (under Spark's work folder)
- Add this MemoryStore memory information, it is RDD memory
- Optimize the size of the cache (if the registered custom type is extremely large (100 fields), it will cause the object to be serialized to be too large. At this time, the kyro itself needs to be optimized. Because the internal cache of kryo may not be stored so large Class object. Set the spark.kryoserializer.buffer.max parameter and increase it)
- Register custom types in advance (although kryo can work normally without registering custom types, it will save a copy of his fully qualified class name and consume memory. It is recommended to pre-register custom types to be serialized)
- Prefer to use arrays and strings rather than collection classes (preferably use Array instead of ArrayList, LinkedList, HashMap), use int [] will save memory than List
- Convert object to string
- Avoid using multiple layers of nested object structures
- For scenarios that can be avoided, try to use int instead of String
(2) Judge the memory consumption
Second, use high-performance serialization class library
1. Data serializationData serialization is to convert objects or data structures into a specific format so that they can be transmitted on the network or stored in memory or files
Deserialization is the opposite operation, which restores the object from the serialized data
The serialized data format can be any format such as binary, xml, json, etc.
The focus of object and data serialization is on data exchange and transmission
In any distributed system, serialization plays an important role
If the serialization technology is used, the operation is very slow, or the amount of data after serialization is still very large, which will reduce the performance of distributed system applications.
Spark itself defaults to serializing data in some places, such as Shuffle. In addition, we used external data (custom type), but also to make it serializable
By default: Spark prefers the convenience of serialization, using the serialization mechanism provided by Java itself, which is very convenient to use. However, the performance of the Java serialization mechanism is not high, the serialization speed is slow, the serialized data is large, and it takes up memory space
2. PerformSpark supports serialization using kryo class library
Fast speed, smaller footprint, 10 times smaller than Java serialized data
3. Use the kryo serialization mechanism(1) Set Spark conf
spark.master spark://6.7.8.9:7077
spark.executor.memory 4g
spark.eventLog.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
(2) Use kryo, need to serialize the class, register in advance to get high performance
conf.registerKryoClasses(Array(classOf[Count],...))
4. Optimization of kryo libraryThird, optimize the data structure
Overview
To reduce memory consumption, in addition to using efficient serialization libraries, data structures must also be optimized to avoid additional memory overhead caused by Java syntax
Core: optimize the local data used inside the operator function or the data outside the operator function
Purpose: To reduce the consumption and occupation of memory
practice
Fourth, the tuning of the Java virtual machine
I. Overview
If a large amount of data is persisted when the RDD is persisted, then the garbage collection of the Java virtual machine may become a bottleneck
The Java virtual machine periodically performs garbage collection. At this time, all Java objects are tracked, and during garbage collection, those objects that are no longer in use are found, old objects are cleaned up, and space is made for new objects.
The performance overhead of garbage collection is proportional to the number of objects in memory
Before doing Java virtual machine tuning, you need to do the previous tuning work before it makes sense.
Second, Spark GC principle
3. Monitoring garbage collection
Monitor how often the time spent on the first level of garbage collection and so on
spark-submit脚本中,添加--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimesStamps"
Worker logs: logs folder under spark
Driver log: worker folder under spark
Fourth, optimize the Executor memory ratio
Purpose: To reduce the number of GC
For GC tuning, the most important thing is to adjust the ratio of the space occupied by the RDD cache to the memory space occupied by the object created when the operator is executed
For the default case, Spark uses 60% of the memory space of each Executor to cache the RDD, and only 40% of the memory space is used to store objects created during the task.
配置:conf.set("spark.storage.memoryFraction",0.5)