Remember once Open-Falcon-Graph frequently OOM troubleshooting

This article describes the Falcon-Graph module of the investigation through frequent OOM and solutions.
Previous Article Review: Talking about the security support Dcoker (Part II)

Business Background

Falcon-Graph is responsible for monitoring data persistence for users to later inquiry, summary and other functions.

In early 4, Open-Falcon business volume increased gradually increased from 0.29billion counter to the current 0.32 billion, resulting graph memory cluster accounted for an average increase of 8% (now: 73%), machine load (load: 1min) increased by an average 5% (now: 18).

Summary of site conditions found three days at 20:00 clusters memory will be an overall increase, part of the machine OOM phenomenon occurs, while parts of the machine will OOM phenomenon occurs in the non-fixed point in time.

Investigation process

1, the investigation service itself

Calls go pprof conduct performance analysis service itself, caught in the field question the following information:

cpu:


mem:


Under normal conditions cpu comparison found no major changes in the allocation of each function, but mem going up.

Since the data is stable inflow, it was suspected that persisted during the block and other issues led to decreased disk write speed, memory data accumulation.

go pprof query block information is shown:

total information is 0, the function block has ruled out the service live the writing process. Other services started investigation on the machine

2, investigation and other services on the machine

(1) investigation found at the scene to clean up every day at 20:00 discovery service (graph-clean) for a short time consume a lot of cpu load leads to a rapid rise (> 30), as shown below:

(2) and colleagues discuss the investigation site and service discovery data transfer (Transfer) have large transient + tcp connection cpu data check results in consumption surge, resulting in instantaneous load up to 32. As shown below:

solution

1, for graph-clean code is irrational

  • Modify the code graph-clean, average peak, reduce the frequency, thereby reducing the overhead cpu

  • Test test cluster (Completed)

  • 1 machine observe the gray line (Completed, consume a lot of cpu solve short-term problems, the machine did not cause this problem occurs after you deploy graph service oom)

  • Other machine-line gradation gradually (has been completed, this problem did not occur was observed one week results in graph service OOM)

2, for the transfer / graph service mixed cloth

  • Open transfer / graph separately deployed (internationalization process of gradual split)

3, the code graph carding, modify the data structure of the unreasonable FIG tune, reduce system overhead.

This article first appeared in public No. "millet cloud technology", click to read the original .


Reproduced in: https: //juejin.im/post/5cecd1256fb9a07f0c46620e

Guess you like

Origin blog.csdn.net/weixin_34114823/article/details/91472442