Memory leak memory leak investigation of the leak threads

Memory leak investigation of the leak threads

If you only care about the specific process, direct return to the right path processing logic
description link: https://www.cnblogs.com/guozp/p/10597327.html

basis

Memory leaks (Memory Leak)

  1. java jvm in memory are the responsibility of the management, garbage collection by the gc, so the memory leak problem does not occur under normal circumstances, it is easy to be overlooked.
  2. A memory leak means useless objects (objects no longer used) continuous or useless objects occupy memory, the memory can not be released in time, resulting in a waste of memory space called a memory leak. Memory leak sometimes serious and difficult to detect, so developers do not know there is a memory leak, you need to self-observation, more serious when there is no memory can be allocated directly oom.
  3. The main distinction and do overflow.

Memory leak phenomenon

  • heap or perm / metaspace area is growing, not declining trend continued last trigger FullGC, or even crash.
  • If the low-frequency applications, it may be difficult to find, but in the end the situation is still the same and the above description, the same memory growth

perm / metaspace leak

  • Here storage class, method-related objects, and runtime constant object. If an application is loaded a lot of class, then the information stored in the Perm region in general will be relatively large. In addition a large number of intern String objects can lead to the area continue to grow.
  • More common is a dynamically compiled Groovy class cause leakage. Here is not launched

heap leak

The more common memory leaks
  1. Static collections cause a memory leak
  2. Listener: But often when releasing the object did not remember to remove these listeners, thus increasing the chance of a memory leak.
  3. Various connections, databases, networks, IO, etc.
  4. Internal and external modules such as class references: reference to the inner class is relatively easy to forget one, and once release may not lead to a series of follow-up class object is not released. Objects non-static inner classes will be strong implicit reference to its peripheral objects, so in the internal class is not released, the periphery of the object will not be released, resulting in a memory leak
  5. Singleton: Incorrect use of the Singleton pattern is a common problem caused by memory leaks, singleton objects exist (as static variables), if held outside the singleton object in the JVM throughout the life cycle after being initialized reference to the object, then the object will not be outside the normal recovery jvm, memory leak
  6. Other third-party classes

The present embodiment (thread leak)

In this case the phenomenon

  1. + Memory usage rate of 80% or so, and continued to rise, the highest point of 94%
    Memory footprint

  2. yongGC more frequently, when memory is relatively high, with FullGC
    gc times

  3. Threads everybody more, the highest point reached 2w + (this is more important, but behind it is to pay attention to this point)
    Thread

  4. Log accompanied by a large number of exceptions, mainly three types
    • fastJosn error
      fastJson error.

    • Language translation service call interface identification error
      translation service

      Translation Error Codes

    • Docking algorithm provided two square error packet request
      predict error

      Algorithms call error

Detour just started to go wrong

  1. Just beginning to discover the machine take up more memory, more than 80% +, this time related to logical thinking and memory
  2. After this time did not go to observe the number of threads, according to the phenomenon 1,2,4 ,, this process did not find the phenomenon 3, fruitless investigation, re-positioning of the phenomenon found 3
  3. Due to the error log phenomenon 4 more, plus high memory usage, resulting in the idea that (in this case due to the many services through consumption began mq)
    • The phenomenon results in an increase mq 4 error retry queue task, the backlog of messages leading to increased consumption mq task queue, resulting in increased memory
    • Task due to the increased abnormal, an abnormal code logic to retry the thread pool, resulting in the task queue length has increased, leading to increase in memory

The detour to solve doubts

  • Abnormal positioning
    • fastJson resolve anomalies , look at the error will feel stepped on fastJson of bug (fastJson in previous versions, written to Map Long type, the default is resolved by Int parser when parsing, resulting in an overflow error. But this bug in later versions fix, and now even into the Long type, if the limit is less than int, int default is resolved, int exceed the limit, the default long. class variables as Long. direct parse, directly Long type), but the business class code is used directly parse, we found two square package class uses int, int but the message value exceeds some value
    • eas算法链路调用错误,之前就有(404),但是没有定位到具体原因,有知道的望指点下,这里用try catch做了处理
    • 翻译服务异常,这里没定位到具体原因,重启应用后恢复,这里忘记了做try catch,看来依赖外部服务需要全部try下
  • 确认是否是业务逻辑中错误重试队列问题
    • 否,和业务相关才会走入重试流程,还在后面
  • 确认是否是Mq消息队列积压,以及Mq重试队列消息积压导致,确认是否是线程自动调整(metaq/rocketmq)
    • 否,Mq做了消费队列安全保护
    • consumer异步拉取broker中的消息,processQueue中消息过多就会控制拉取的速率。对于并发的处理场景, 存在三种控制的策略:
      1. queue中的个数是否超过1000
      2. 估算msg占用的内存大小是否超过100MB
      3. queue中仍然存在的msg(多半是消费失败的,且回馈broker失败的)的offset的间隔,过大可能表示会有更多的重复,默认最大间隔是2000。
    • 流控源码类:com.alibaba.rocketmq.client.impl.consumer.DefaultMQPushConsumerImpl#pullMessage,圈中的变量在默认的类中都有初始值
      Flow Control Source
  • metaq也会自己做动态线程调整,理论上当线程不够用时,增加线程,adjustThreadPoolNumsThreshold默认值10w,当线程比较多时,减少线程,但是代码被注释了,理论上应该没有自动调整过程,所以这里也不会因为任务过多增加过多线程
    • 在start启动的时候,启动了一批定时任务
      mqStart

    • 定时任务中启动了调整线程的定时任务
      Start timing adjustment

    • 启动调整任务
      Adjustment
      Adjust the specific code.

回归正途的处理逻辑

  • 经过上述分析,发现并不是因为异常导致的任务队列增加过多导致,这个时候,发现了现象3,活动线程数明显过多,肯定是线程泄露,gc不能回收,导致内存一直在增长,所以到这里,基本上就已经确认是问题由什么导致,接下来要做的就是确认是这个原因导致,以及定位到具体的代码块
  • 如果没有具体的监控,一般就是看机器内存状况,cpu,以及jvm的heap,gc,有明显线程状况的,可jstack相关线程等,最终依然无法定位到具体代码块的可以dump后分析
登录涉事机器
  • top,观察内存占用率(这里图是重启之后一段时间的)但是cpu占用率比较高,很快就降下去了,这里耽误了一下时间,top -Hp pid,确认那个线程占用率高,jstack看了下对应的线程在作甚
    top

  • 确认线程是否指定大小,未发现指定,使用的默认值大小

    gc parameters

  • 查看heap,gc状况
  • 查看线程状况,可jstack线程,发现线程较多,也能定位到,但是为了方便,遂dump一份数据详细观察堆栈
    • 线程个数
      • cat /proc/{pid}/status (线程数竟然这么多)
        Command Line number of threads

      • 由于线程数比较多,而依然可以创建,查看Linux普通用户所允许创建的进程数,使用命令:cat /etc/security/limits.d/90-nproc.conf ,值比较到,远超当前的个数

    • 线程信息
      The number of threads

    • 线程状态
      Thread State

    • 定位到问题线程
      • AbstractMultiworkerIOReactor ==》 httpAsycClient ==》如图所示不能直接定位到代码块,所以maven定位引用jar的服务 ==> 具体二方包
    • 如果每次都new线程而不结束,gc中线程是root节点,如果线程没有结束,不会被回收,所以如果创建大量运行的线程,会导致内存占用量上升,但是线上到底能创建多少线程呢?

    • 问题代码块
      • 方法开始(每次都初始化一个新的客户端,底层封装使用httpAsyncClient,httpAsyncClient使用NIO模型,初始化包含一个boss,10个work线程)
        The method begins

      • Method ends (method ends are called shutdow)
        End method

    • According to the phenomenon and the corresponding thread stack information, you can determine the thread is here overflow, shutDown method the client thread pool is closed failure, resulting in since the initial thread is NIO mode, not the end, so the thread has been increased backlog, you can modify a single mode embodiment, a restriction system using the thread pool, to solve the problem on the line

httpAsyncClient part of the source code parsing
  • start up
    • Permanent thread
      • Reactor Thread 负责 connect
      • Worker Thread 负责 read write
    • http启动线程
    • Thread pool named, is pool- appear above -thread- thread
      普通线程池命名
    • ioEventDispatch thread
      • start up
        启动
      • worker thread
        worker线程
      • worker thread name
        worker线程名称
      • IO worker run detailed
      • worker thread implementation


  • shutdown analyzed here do not, after the call, the thread will be out of the loop, the end of the thread, and so many close links cleanup action
doubt
    • Although the method is called every time a new client is new, but in the end finally have called shutDown, why close fails, use a singleton above, just to cover up the reason why every new client and then shutdown failure
    • httpAsyncClient client request in case of failure, httpclient.close () here will cause the main thread to block found by the internal source close method Close connection pool after the thread, the thread has a corresponding httpAsyncClient is running, but has been blocked in epollWait , see above thread state, there are currently no threads to determine why the call failed to close down after the shutdown, and there is no exception log, but this is the main cause of the leak threads
    • Normally closed local test shutdown method, very strange. If you have to know the specific reasons, hope the exhibitions
    • Description link

Guess you like

Origin www.cnblogs.com/Leo_wl/p/10943302.html