Dry goods! A kafka caton accident investigation process

     After a function was launched, the amount of data dropped sharply, which made us nervous! The investigation process is also a learning process (most of which are the credit of the leaders, but it should be okay to share it with everyone, ᐓ)

1. Confirm the authenticity of the question?

      I was informed by the data department that a certain amount of data had fallen seriously, and I knew the seriousness of the problem at that time. And the problem occurred after my function was launched. The first reaction was, where did I write the code wrong? However, it is necessary to follow the process and compare the request volume and the actual landing volume through various dimensional data. confirm question!

      In fact, in the process, we did not confirm that our data volume declined. However, this is not free from the data decline. Only go to the next step!

2. Check the code, find experienced students, and compare the differences between the original functions

      In fact, this step is a bit blind. Because the first step of the investigation did not find enough proof that the problem lies with us, but the problem is that only we have been online during the period, so we can only reflect on ourselves.

       But fortunately, this process is really useful. I really found a pit that I buried. This pit will indeed lead to a decline in the amount of data. Fix it now!

       Then he breathed a sigh of relief and thought it was done. In fact, otherwise, the amount of data is still not up. This is embarrassing!

       I have begun to doubt life, is the code not posted? Isn't it online and local somewhere? The test environment is repeated and tested correctly. I really want to put the test environment code directly online, hey, forget it, many things will not be transferred by human will, let's be rational! Don't find a way out!

3. Sit directly next to the dba, let us keep an eye on the amount of data

      Self-examination can no longer save yourself, go to dba. Please help me count the changes in the amount of data after going online, but the result is not much difference. I thought it might be that the time was too short to see any changes, so I'll count it later. Still no change! My God, the pot is still there.

      A large amount of data is not enough, then I will use my own account to test it. After the operation is completed, I will observe the data and find that it is sometimes there and sometimes it is not! Well, nothing to say.

4. Local debugging, right?

      I originally thought that it was an online problem, and it would be fine to deal with it urgently. However, the fact exceeded my expectations. It is irresponsible for the user and irresponsible for the data to pass the verification directly to the online. Let's start locally.

      It's a bit annoying to use VPN for local debugging, but it still runs anyway. no problem! This is embarrassing.

      Then, on to the next topic!

5. Is the online environment configuration different from the test environment?

      Then we try to find the difference, even if there is one more file, the change time of a certain file is inconsistent, we want to try it! Of course, for the sake of safety, we still can't verify it directly online unless there is enough evidence to show that the online configuration is problematic. Of course, we did not find such evidence in the end, just moved everything online to the test environment to verify, and the result was unobstructed!

      There is another reason to prove that this road does not work. Will something that ran well in the previous configuration break by itself? impossible. No way!

6. It's really not working, can only change the code to debug online?

      The first step of debugging, each log! Add the complete log where the previous request was incomplete, and post another version! With the log, there is evidence, but it is really a mistake. The log is wrong, and it is enough to print the parameters as memory addresses.

      After the log is changed, test it and continue to use your own account. It's still the same, sometimes you can enter and sometimes you can't (monitoring means is to set up a temporary kafka consumer for dba, and then pull the data out to see) how to fix it?

      Is some machine broken? Requests assigned to the bad machine fail, and requests assigned to the correct machine are correct. Then, after a long time, I thought this was the direction, but I was beaten back.

7. No, let's grab the bag, shall we?

     tcpdump, a packet capture artifact, lsof assists.

     The packet capture is just to confirm a problem, the client machine has sent a request to the server machine, and the network flow is running normally! Then it is proved that the client machine has a large number of long connections to the server, and the data stream is sent and received normally (syn). This at least shows that there is no problem with the client! Then there is only one problem left, that is, there is a problem with the server! We firmly believe that, of course, there must be evidence.

      In the same way, we perform reverse packet capture on the server machine, and then capture the packets from the client, which is very smooth! Forehead. . .

8. No, no idea, restart the machine?

      No, I'm talking about restarting the service. Hasn't there been a change recently, and it stands to reason that whoever changes it will restart it. However this is of no use as the previous releases have already restarted n times. So how to fix it. The only thing left is to restart the server, and kafka will serve it. Let the dead horse be a living horse doctor!

      After restarting, verify it. As a result, it seems that there are still successes and failures!

9. Change an asynchronous request to a synchronous request?

      I’m out of ideas again, I’m not reconciled, why is the test environment good, but not online? Think about the difference?

      The conclusion is that online concurrency is large, and there is no test environment. Then I found that this piece of code was done by asynchronous threads. Could there be a problem here?

      No matter, try changing it to a synchronous request. Another edition!

      Not to mention, after changing to synchronization, although user requests are basically slow to death, it is found that kafka requests do exist. Is it really because of this, then we can't change it like this, user experience is the first, for this matter to change asynchronous to synchronous, we have to worry about it. Change it back and continue with the rest!

10. Go back to the test environment, stress test concurrently?

      After the change is made asynchronous, it returns to the original situation of success and failure.

      Since it is suspected that high online concurrency is caused, why not test the high concurrency in the test environment? I quickly wrote a cyclic request script with shell script. After a large number of requests were sent to kafka, there was no exception, and the concurrency problem was cancelled.

11. Let’s check the code again, shall we?

      I don't know how many times I've checked it, but I still have to check it, otherwise, how to fix it, a few people will look at the code together!

      However this was of no use.

12. Manipulate requests directly from the command line, regardless of user behavior?

     Although user behavior is the most authentic verification, it is also a more troublesome verification.

     We put aside various intermediate links and directly initiate a request to the kafka server!

     There are two ways, 1 use the current code to request, 2 use the request method that comes with kafka to request. As a result, two different results were obtained. The data requested by the code method was unsuccessful. When Kafka's own request method was used, the response was in milliseconds. Hey, does this make me doubt the code again?

13. With nowhere else to go, let's take another look at the data, shall we?

     I really have no idea, I can only look at the data again and pass the time.

     Accidents happen when you don't expect them. The data has returned to normal! I wipe!

     Pushing back the time and pushing the event backwards is due to the restart of kafka, which causes the data to pick up.

     Well, the problem has been located, and kafka is stuck. We can't stand it anymore, send a conclusion email, let's go back to wash and sleep first!

14. Why does kafka freeze?

     This is the root of the problem! It's just that we didn't have the energy to go any further!

     Because the volume of topic requests is too large and the partition is too small, throughput decreases. After the partition was enlarged, it finally returned to normal!

Eh, it seems to have done a lot of useless work, no way!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325199262&siteId=291194637