Spark learning from 0 to 1 (10)-Spark tuning (3)-memory tuning

1. Memory Tuning

Insert picture description here

The JVM heap memory is divided into a larger Eden and two smaller Survivor. Only use Eden and one piece of Survivor each time. When recycling, copy the surviving IDE objects in Eden and Survivor to another Survivor, and finally clean up Eden and the Survivor just used. That is to say, when tasks are created, objects will be stored in Eden and Survivor1. Survivor2 is free. When the areas of Eden and Survivor1 are full, minor gc garbage collection will be triggered to clean up objects that are no longer used. The surviving objects will be put into Survivor2.

If the size of the surviving object is greater than the size of Survivor2, then the JVM will put the extra object directly into the old generation.

If the memory of the young generation is not very large at this time, minor gc will be performed frequently. Frequent minor gc will result in some surviving objects in a short period of time (multiple garbage collections are not reclaimed, and the ones that have been used cannot be Release, this kind of object survives every time a minor gc passes) frequent toppling, will cause these short-lived objects (not necessarily long-term use) to be one year older each time they are garbage collected. If the age is too old, the default is 15 years old, and the garbage collection will go to the old generation if it is not recycled.

This will result in storing a large number of objects with a short life cycle in the old age. The old age should store a relatively small number of objects that will be used for a long time, such as database connection pool objects. In this case, the old age will overflow (full gc because there are very few objects in the old age, full gc is rarely performed, so a less complicated garbage collection algorithm that consumes performance and time is adopted). Either minor gc or full gc will cause the JVM worker thread to stop.

Summary-the impact of insufficient heap memory:

  • Frequent minor GC.
  • A large number of short-lived objects in the old age lead to full GC.
  • More GC will affect the performance and running speed of Spark.

Spark JVM tuning is mainly to reduce the GC time, and you can modify the Executor memory ratio parameter.

RDD cache and task-defined running operator functions may create many objects, which will occupy a lot of heap memory. Frequent GC after the memory is full. If the GC cannot meet the memory needs, it will report OOM.

For example, a task creates N objects when it runs, and these objects must first be put into the JVM youth belt.

For example, when storing data, we use foreach to write data to memory. Each piece of data will be encapsulated into an object and stored in the database, so how many pieces of data will create as many objects in the JVM.

2. How to optimize memory in Spark?

Stored in Spark Executor heap memory (take static memory management as an example):

  • RDD cache data and broadcast variables:

    spark.storage.memoryFraction 0.6

  • Shuffle aggregate memory:

    spark.shuffle.memoryFraction 0.2

  • Task running memory: 0.2

So how to tune it?

  1. Increase the overall memory size of Executor.
  2. Reduce the proportion of storage memory or reduce the proportion of aggregate memory.

3. How to check GC?

Spark WEBUI 中 job -> stage -> task。

Guess you like

Origin blog.csdn.net/dwjf321/article/details/109056283