Spark GC advanced optimization

In the java heap space, it will be divided into two areas, Young and Old. The Young part mainly stores short-lived objects; while the Old part mainly stores longer-lived objects
The Young part can be subdivided into three parts, namely Eden, Survivor1, and Survivor2
Briefly describe the process of gc: When the Eden part is full, a small GC on Eden is triggered, the objects on Eden and Survivor1 are copied to Surviv2, and the Survivor area will be switched. If an object has existed for enough time or the space of Survivor2 is full, move the object to the Old area. In the end, the complete GC is triggered when the Old area is almost exhausted, which will consume more performance
- The gc optimization goal of Spark is to ensure that RDDs reserved for a long time are stored in the Old area, and there is enough space in the Young area to store objects stored for a short time. Try to avoid the temporary objects generated in task execution from triggering full GC, you can use the following steps to optimize
1: By observing the GC stats, check whether there are multiple full GCs before the task is finished. If it is found many times, it means that the memory is insufficient when the task is executed.
2: In the GC stats, if the small GC is found to be triggered more frequently, You need to increase the size of the Eden area to optimize. You can set the amount of memory more than the task requires to set the size of the Eden part. Assuming the estimated value is E, you can set the size of the Young area -Xmn=4/3 E (expanded to 4/3 because the Survivor part needs to be taken into account)
3: By looking at gc stats, it is found that the space of the OldGen part is close After consumption, reduce the total amount of cache memory spark.memory.fraction; reduce the task execution speed by caching fewer objects. Or consider reducing the size of the Young area. If there is a -Xmn value set above, then reduce the value appropriately. If you don't change the value of -Xmn, try to modify the NewRatio parameter of the JVM. The default value set in most JVMs is 2, which means that the Old part accounts for 2/3 of the entire heap, and it must be large enough to exceed spark. The value of memory.fraction
4: Turn on G1GC. Through the -XX:+UseG1GC option, performance can be significantly improved when gc becomes a bottleneck. Note that when you need to perform tasks that consume too much memory, increasing the size of G1 region is very important—XX:G1HeapRegionSize
5: For example, when your task reads data from HDFS, how much memory a task needs to consume can be passed HDFS block size is estimated. Note that the size of the decompressed block will generally increase by 23 times. We hope that there can be 34 tasks in the same work process, and the block size of HDFS is 128MB, so the estimated value E can be obtained Eden value (34*128MB)
6: After setting, monitor whether the gc stats information has changed

PS：监控gc的情况  -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the Java options.

For a line of logs collected by the young generation, the basic content is as follows:

2014-07-18T16:02:17.606+0800: 611.633: [GC 611.633: [DefNew: 843458K->2K(948864K), 0.0059180 secs] 2186589K->1343132K(3057292K), 0.0059490 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]

Its meaning is roughly as follows:

2014-07-18T16:02:17.606+0800 (current timestamp): 611.633 (timestamp): [GC (for Young GC) 611.633: [DefNew (single-threaded Serial young generation GC): 843458K (before young generation garbage collection Size) -> 2K (the size after the collection of the young generation) (948864K (total size of the young generation)), 0.0059180 secs (the time of this collection)] 2186589K (the size before the collection of the entire heap) -> 1343132K (the collection of the entire heap) After size) (3057292K (total heap size)), 0.0059490 secs (recovery time)] [Times: user=0.00 (user time) sys=0.00 (system time), real=0.00 secs (actual time)]

The logs recovered in the old generation are as follows:

2014-07-18T16:19:16.794+0800: 1630.821: [GC 1630.821: [DefNew: 1005567K->111679K(1005568K), 0.9152360 secs]1631.736: [Tenured:
2573912K->1340650K(2574068K), 1.8511050 secs] 3122548K->1340650K(3579636K), [Perm : 17882K->17882K(21248K)], 2.7854350 secs] [Times: user=2.57 sys=0.22, real=2.79 secs]

The last in the gc log looks like a snapshot before the system is completed:

Heap
def new generation total 1005568K, used 111158K [0x00000006fae00000, 0x000000073f110000, 0x0000000750350000)
eden space 893888K, 12% used [0x00000006fae00000, 0x0000000701710e90, 0x00000007316f0000)
from space 111680K, 3% used [0x0000000738400000, 0x000000073877c9b0, 0x000000073f110000)
to space 111680K, 0% used [0x00000007316f0000, 0x00000007316f0000, 0x0000000738400000)
tenured generation total 2234420K, used 1347671K [0x0000000750350000, 0x00000007d895d000, 0x00000007fae00000)
the space 2234420K, 60% used [0x0000000750350000, 0x00000007a2765cb8, 0x00000007a2765e00, 0x00000007d895d000)
compacting perm gen total 21248K, used 17994K [0x00000007fae00000, 0x00000007fc2c0000, 0x0000000800000000)
the space 21248K, 84% used [0x00000007fae00000, 0x00000007fbf92a50, 0x00000007fbf92c00, 0x00000007fc2c0000)
No shared spaces configured.

Spark GC advanced optimization

Guess you like