Spark learns from 0 to 1 (6)-Spark memory management


When Spark executes an application, the Spark cluster will start two JVM processes, Driver and Executor. Driver is responsible for creating SparkContext context, submitting tasks, scheduling tasks, etc. Executor is responsible for task calculation tasks and returns the results to Driver. At the same time, it is necessary to provide storage for RDDs that need to be persisted. The memory management on the Driver side is relatively simple. The Spark memory management mentioned here is for the memory management on the Executor side.

Spark memory management is divided into static memory management and unified memory management. Before Spark 1.6, static memory management was used, and after Spark 1.6, unified memory management was introduced.

1. Static memory management

The size of storage memory/execution memory and other memory in static memory management are fixed during the running of the Spark application, but users can configure it before the application starts.

Spark1.6 and above versions use the same memory management by default, and you can spark.memory.useLegacyModeuse static memory management by setting the parameter to true (default is false).

1.1 Static memory management distribution map

Insert picture description here

1.2 Detailed explanation of static memory management

  • 60% of the memory is used for spark storage. 10% of the memory in this part of memory is reserved to prevent OOM exceptions. The other 90% of the memory is used to store data. 20% of the 90% of the memory is used to decompress and serialize data, and the remaining 80% of the memory is used to store RDD cache data and broadcast variables.
  • 20% of the memory is used for spark shuffle. 80% of this part of memory is used for shuffle aggregation, and the other 20% is reserved memory to prevent OOM exceptions.
  • The remaining memory is used for task calculations.

2. Unified memory management

The difference between unified memory management and static memory management is that storage memory and execution memory share the same space and can borrow each other's space.

2.1 Unified memory management distribution map

Insert picture description here

2.2 Detailed explanation of unified memory management

  • The total memory is set aside 300M for the operation of the JVM itself.
  • 60% of the remaining memory is used for spark, half of which is used to store RDD cache data and broadcast variables, and the other half is used for shuffle aggregation.
  • 40% of the remaining memory is used for task calculations.

3. How to deal with OOM in reduce?

  1. Reduce the amount of data pulled each time
  2. Increase the memory ratio of shuffle aggregation
  3. Increase the total memory of Executor

Guess you like

Origin blog.csdn.net/dwjf321/article/details/109048241