SparkSQL Performance Tuning and Optimization Guide

spark memory leak

1.The specific performance of memory leaks in high concurrency situations

Unfortunately, the design architecture of Spark is not designed for high concurrent requests. We tried to perform 100 concurrent queries in a cluster with poor network conditions, and found a memory leak after 3 days of stress testing.

a) During the pressure test of a large number of small SQL, it was found that a large number of activejobs have been in the pending state on the spark ui and will never end, as shown in the following figure

 

b) and found that the driver memory is full

 

c) Analyzed the following with the memory analysis analysis tool

 

2.Memory leak of WEB UI caused by AsynchronousListenerBus under high concurrency

SPARK submits a large amount of SQL in a short period of time, and there are a large number of unions and joins in SQL, which will create a large number of event objects, making the number of events here exceed 10,000 events

. Events are used to recycle resources, and resources cannot be recycled if they are discarded
. In response to this problem on the UI page, we have canceled the limit on the length of the queue.

 

 

 

 

 

 

3.Memory leak caused by AsynchronousListenerBus itself

packet capture


 

 

 

These events are passed through the post method and written to the queue

 

 

But it is also done by a single thread postToAll

 

 

However, in the case of high concurrency, the speed of single-threaded postToAll is not as fast as that of post, which will cause more and more events to accumulate in the queue. If it is a persistent high-concurrency SQL query, it will lead to memory leaks.

 

接下来我们在分析下postToAll的方法里面,那个路径是最慢的,导致事件处理最慢的逻辑是那个?

 

 

 

 


可能您都不敢相信,通过jstack抓取分析,程序大部分时间都阻塞在记录日志上

 

可以通过禁用这个地方的log来提升event的速度

 

log4j.logger.org.apache.spark.scheduler=ERROR

 


 

 

 

4.高并发下的Cleaner的内存泄露

       说道这里,Cleaner的设计应该算是spark最糟糕的设计。spark的ContextCleaner是用于回收与清理已经完成了的 广播boradcast,shuffle数据的。但是高并发下,我们发现这个地方积累的数据会越来越多,最终导致driver内存跑满而挂掉。

l我们先看下,是如何触发内存回收的

 

      没错,就是通过System.gc() 回收的内存,如果我们在jvm里配置了禁止执行System.gc,这个逻辑就等于废掉(而且有很多jvm的优化参数一般都推荐配置禁止system.gc 参数)

lclean过程

这是一个单线程的逻辑,而且每次清理都要协同很多机器一同清理,清理速度相对来说比较慢,但是SQL并发很大的时候,产生速度超过了清理速度,整个driver就会发生内存泄露。而且brocadcast如果占用内存太多,也会使用非常多的本地磁盘小文件,我们在测试中发现,高持续性并发的情况下本地磁盘用于存储blockmanager的目录占据了我们60%的存储空间。

 

 

我们再来分析下 clean里面,那个逻辑最慢

 

真正的瓶颈在于blockManagerMaster里面的removeBroadcast,因为这部分逻辑是需要跨越多台机器的。

 

针对这种问题,

l我们在SQL层加了一个SQLWAITING逻辑,判断了堆积长度,如果堆积长度超过了我们的设定值,我们这里将阻塞新的SQL的执行。堆积长度可以通过更改conf目录下的ya100_env_default.sh中的ydb.sql.waiting.queue.size的值来设置。

 

l建议集群的带宽要大一些,万兆网络肯定会比千兆网络的清理速度快很多。

l给集群休息的机会,不要一直持续性的高并发,让集群有间断的机会。

l增大spark的线程池,可以调节conf下的spark-defaults.conf的如下值来改善。

 

 

 

5.线程池与threadlocal引起的内存泄露

       发现spark,Hive,lucene都非常钟爱使用threadlocal来管理临时的session对象,期待SQL执行完毕后这些对象能够自动释放,但是与此同时spark又使用了线程池,线程池里的线程一直不结束,这些资源一直就不释放,时间久了内存就堆积起来了。

针对这个问题,延云修改了spark关键线程池的实现,更改为每1个小时,强制更换线程池为新的线程池,旧的线程数能够自动释放。

 

6.文件泄露

      您会发现,随着请求的session变多,spark会在hdfs和本地磁盘创建海量的磁盘目录,最终会因为本地磁盘与hdfs上的目录过多,而导致文件系统和整个文件系统瘫痪。在YDB里面我们针对这种情况也做了处理。

 

7.deleteONExit内存泄露

 

 

 

 

 

为什么会有这些对象在里面,我们看下源码

 

 

 

 

 

 

 

 

8.JDO内存泄露

多达10万多个JDOPersistenceManager

 


 

 

 


 


 

 

 

 

 

 

 

 

9.listerner内存泄露

通过debug工具监控发现,spark的listerner随着时间的积累,通知(post)速度运来越慢

发现所有代码都卡在了onpostevent上

 

 

 

 

 

jstack的结果如下


 

 

研究下了调用逻辑如下,发现是循环调用listerners,而且listerner都是空执行才会产生上面的jstack截图

 

 

通过内存发现有30多万个linterner在里面

 

 

发现都是大多数都是同一个listener,我们核对下该处源码

 

 

最终定位问题

确系是这个地方的BUG ,每次创建JDBC连接的时候 ,spark就会增加一个listener, 时间久了,listener就会积累越来越多  针对这个问题 我简单的修改了一行代码,开始进入下一轮的压测

 

 

 

 

二十二、spark源码调优

      测试发现,即使只有1条记录,使用 spark进行一次SQL查询也会耗时1秒,对很多即席查询来说1秒的等待,对用户体验非常不友好。针对这个问题,我们在spark与hive的细节代码上进行了局部调优,调优后,响应时间由原先的1秒缩减到现在的200~300毫秒。

      

以下是我们改动过的地方

1.SessionState 的创建目录 占用较多的时间

 

 

另外使用Hadoop namenode HA的同学会注意到,如果第一个namenode是standby状态,这个地方会更慢,就不止一秒,所以除了改动源码外,如果使用namenode ha的同学一定要注意,将active状态的node一定要放在前面。

2.HiveConf的初始化过程占用太多时间

频繁的hiveConf初始化,需要读取core-default.xml,hdfs-default.xml,yarn-default.xml

,mapreduce-default.xml,hive-default.xml等多个xml文件,而这些xml文件都是内嵌在jar包内的。

第一,解压这些jar包需要耗费较多的时间,第二每次都对这些xml文件解析也耗费时间。

 

 

 

 

 

 

 

 

 

 

 

 

3.广播broadcast传递的hadoop configuration序列化很耗时

lconfiguration的序列化,采用了压缩的方式进行序列化,有全局锁的问题

lconfiguration每次序列化,传递了太多了没用的配置项了,1000多个配置项,占用60多Kb。我们剔除了不是必须传输的配置项后,缩减到44个配置项,2kb的大小。

 

 

 

 

 

 

4.对spark广播数据broadcast的Cleaner的改进

 

由于SPARK-3015 的BUG,spark的cleaner 目前为单线程回收模式。

大家留意spark源码注释

 

 

 

其中的单线程瓶颈点在于广播数据的cleaner,由于要跨越很多机器,需要通过akka进行网络交互。

如果回收并发特别大,SPARK-3015 的bug报告会出现网络拥堵,导致大量的 timeout出现。

Why is the amount of recycling so large? In fact, it is because the essence of the cleaner is through system.gc(), which is executed regularly. The default accumulation of 30 minutes or after the gc is performed, the cleaner is triggered, which will lead to an instant, a large number of concurrent execution of akka, centralized release, and the network will not be paralyzed in an instant. Not surprisingly.

However, single-threaded recycling means that the recycling speed is
constant . If the query concurrency is large, the recycling speed cannot keep up with the speed of the cleaner, which will cause the cleaner to accumulate a lot, which will lead to process OOM (YDB has been modified to limit the concurrency of foreground queries).

Whether it is OOM or limiting concurrency is not what we want to see, so in the case of high concurrency, this single-threaded recycling speed cannot meet the needs of high concurrency.


For the official approach, we express that it is not a perfect cleaner solution. Concurrent recycling must be supported, as long as the timeout problem of akka is solved.
So this question needs to be carefully analyzed. Why does akka timeout because the cleaner occupies too many resources, so can we control the concurrency of the cleaner? For example, what about using 4 concurrent threads instead of filling up all concurrent threads by default? Isn't it better to solve the recovery speed of the cleaner and solve the problem of akka in this way?

In response to this problem, we finally chose to modify the spark's ContextCleaner object and change the recycling of broadcast data to a multi-threaded method, but now the concurrent number of threads has been solved, thus solving the problem.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326353019&siteId=291194637