Remember once ETCD OOM troubleshooting

This article documents the process of locating a line of investigation OOM problem occurs frequently ETCD cluster, is now into an article for everyone to share.

The story begins on a June evening, the night came from work area downstairs, boss received information department, said the problem occurred OOM one database-dependent business clusters ETCD frequent recently, there are significant security risks, need investigation, because I am responsible for the development of an electricity supplier company database, June is just 6.18 each electricity supplier vigorously promote period of time, we are also the core database business, service the entire company's data storage business, so the problem becomes severe.

At that time I still do not understand ETCD, without reading its source code, but the general architecture and design I know, and therefore begin to analyze the entire analysis takes two days, it can be divided into four stages. The following is an analysis of the entire process of positioning.

The first stage . Because I know that after ETCD 3.0 storage used boltDB. boltDB is familiar to me, it's a write transaction is global exclusive, as it has been for a long time not touched boltDB code, so simple reasoning (not carrying the day to go home to work computer), suspected to be caused by not reading and writing to seize the lock, because if there is a large takes a long time to write the transaction, the transaction will read all starve to death. Therefore, combined with analysis of the current night, initially thought to be caused by obstruction to read and write, most likely a node in doing snapshots, because generally after this will involve a lot of affairs, because etcd is go language, if you use the system locks provide, then the client end can not terminate the read request, the client continues to retry timeout increasing the memory footprint (the whole process lasted about three hours). This conclusion seems reasonable.

The figure is a screenshot of the monitoring node ETCD

13632999-4262462cd9087ca2.png

The second stage . In order to verify the conclusions of last night, I began to collect ETCD relevant design information, including a written very good, helped me a lot, is recommended for students in need. Highly available distributed storage etcd implementation principle . Fortunately ETCD code is easier to read than the cockroach code reading much good, but I personally think that the code quality cockroach ETCD better than an order of magnitude, whether it is architecture design or coding style.

Closer to home, carefully read the ETCD associated with a snapshot of the code and the code to read and write, found last night's reasoning does not hold, ETCD do not simply use boltDB storage, do not start speaking here this part, in short, the problem is not caused because the read-write lock .

The third stage . I began to analyze the log (do not know if the code logic ETCD to see the log does not get more help), we found that during the growing emergence of memory, and the system is no exception, CPU, network, disk are normal, TCP connections the number is also very smooth. This time an impasse. But turning quickly come. ETCD colleagues responsible for maintaining the cluster nodes OOM remove a cluster often happens, there is a bit of a disk failure reportedly happened that this node, re-add a node, several colleagues try to remove the proxy (that can be considered etcd client), At first this important information is not being I know, and then the magic happens, the cluster is not OOM occurs, each about thirty-four hours will take place once before. Things seem this is over. But look at the following chart of this monitoring


13632999-6d97753ffcd64095.png

Although OOM did not happen, but the memory is still a lot of growth in the new node, this time ETCD doing? Careful analysis of this node corresponds to a log period of time, find a point in time memory-style cliff fall in the event, there are a lot of network link is broken log log. The key issue here, in conjunction with several other nodes to increase memory occurred but fortunately no OOM's log and found that all the breaks on cliff-style memory declines the existence of such a network of abnormal log.

So I began to notice all the breaks on the OOM, OOM different nodes of the phenomenon there is a continuous line on the timeline, see below


13632999-488c986e3e6c4bd0.png

You will find that when a node because the OOM crash, another memory node immediately began to grow until OOM, so the cycle. After a preliminary analysis can be concluded that the cause of the client read (write very little actual business), this business is mainly topology information stored in the database cluster, there are many watch the operation, which explains why this time in the OOM read pressure transfer.

The fourth stage . Since it is due to watch, I began to analyze ETCD of watch mechanisms, but in "reasonable" under the deduction can not be perfectly interpret monitoring data, we note that there is a HA proxy between the proxy and etcd server, it responsible for load balancing. Relevant colleagues tell me exactly happened HA proxy closed the client, but did not close the question the server side, so we began to wonder whether the problem is HA proxy, the authentication method is very simple, we restart the HA proxy, found no miracle , that is to say the problem is not HA proxy. This time an important monitoring data brought to our attention, an unusual number of monitoring statistics ETCD the watcher, watch links (client) only a few hundred, but the number watcher does have tens of thousands, this is totally inconsistent with business logical, because the business did not watch so much, we are increasingly close to the truth of the month. So we started to pay attention and use the code etcd client, and no doubt the reason why a start here because on the one hand this proxy service is running for a long time there has been no problem, on the other hand the relevant code is really very simple, just a few lines, but it is these but there is a specific line of code pit.

A start because there is no doubt that this problem a few simple lines of code, I was attracted to another question, I began to research methods Why ETCD client did not cancel the watch, spending half a day during the sort code logic etcd client, and will not be confirmed client's own bug led to watch more. At the same time I discovered proxy to use when etcd client did not cancel the watch mechanism. Specifically, we look at the code:


13632999-9eb9077b3ea7dc48.png

I began to focus on a close watcher will appear under what circumstances (watcher is a channel), because the watcher when closed automatically triggers re-watch, maybe that's led to a lot of reasons watcher. But re-read and verify ETCD client code, I found that if there watcher is closed, then we will cancel your subscription. So I started to see if the external logic repeated calls to the watch API exists, then the truth emerged.


13632999-6e32166209d58097.png


13632999-dd8c0ad94e2df2af.png

We can see from the code when an error occurs, the program did not cancel the watch before, but directly restart watch. This is leading to an increasing number of watch causes. Find a problem, then the solution is very simple, not repeat them here.

to sum up

The investigation positioning experience, familiarize yourself with the design and implementation of ETCD future encounters problems can be quickly analyzed and positioned. On the other hand ETCD also see some details to note are summarized as follows:

a. watch too much can cause explosion etcd server memory, OOM will appear in extreme cases.

 b. Each client creates the watch when not in use will need to unsubscribe (be sure to do so).

Specific operations can refer to: http://holys.im/2016/10/19/how-to-stop-etcd-watcher/

        etcd client allows simultaneously watch multiple key (may be the same or different key), but first watch the watcher has permission to cancel all watcher subscriptions, when the first watcher unsubscribe, all the other watcher automatically unsubscribed.

        Other watcher can only cancel their subscription, this cancellation is only canceled in etcd client level, this time etcd server-side still hold this subscription, until needed to push key new changes, the client will be triggered to the true service end application to cancel the subscription.

       If you do not cancel your subscription watch, watch without consumption data queue, it will cause an increase in memory consumption, there is no limit etcd client (TODO).

    c. etcd server is not suitable for storing large amounts of data, one of the reasons is that it's raft snapshot processing relatively rough, caches the whole amount of data in memory, resulting in increased memory footprint, especially in containers, easily lead to OOM.

In the process of locating the problem, do not ignore all the details, the truth is often hidden in them, so maybe that is the so-called black light. In addition there is the more simple code hidden perhaps bigger (basically as long as there is a big problem), the more in need of special attention.

Another point is that usually accumulate some knowledge of related fields, when one day want to use and will not look ignorant ratio.

Reproduced in: https: //www.jianshu.com/p/05486908814d

Guess you like

Origin blog.csdn.net/weixin_33674976/article/details/91087349