spark performance tuning | memory optimization

Let’s first understand what kind of memory there is

 1.storage内存   存储数据,缓存         可预估
 2.shuffle内存   计算join groupby     不可预估
 spark1.6之前  静态管理的,spark1.6之后变成动态管理  默认0.5

Kind tips

Try not to write RDD in the company (poor performance)

RDD demonstration (spark version 2.1.1)

We convert it to rdd to run the task and see how much memory it occupies
Insert image description here
Insert image description here
We can also go to excutor to see the memory size
It shows red because I Wrote a while loop
Insert image description here

RDD optimization

See the official website
https://spark.apache.org/docs/2.4.5/configuration.html#compression-and-serialization
We use kryo(只支持rdd)
Insert image description here
Insert image description here
We need to look at the cache level of rdd
https://spark.apache.org/docs/2.4.5/rdd-programming-guide .html#which-storage-level-to-choose
Using the serialized cache level
Insert image description here
Insert image description here
I found that 1.7g directly became 270m. The optimization is still quite big. !

Df and Ds for demonstration

See the official website
https://spark.apache.org/docs/2.4.5/sql-getting-started.html#creating-datasets
Ds will specifically use its own bias code for serialization
Insert image description here
Insert image description here
Memory size 34.2M
Insert image description here
We can also serialize (little change)
Insert image description here
33.9M after optimization
Insert image description here

Guess you like

Origin blog.csdn.net/qq_46548855/article/details/112533018