Take an inventory of some unusual pits that I have stepped on in the past two years with Kafka

Preface

My last company was a catering system. During the peak hours of lunch and evening meals, the concurrency of the system should not be underestimated. To be on the safe side, the company stipulates that all departments must take turns on duty during meal hours to prevent online problems from being dealt with in a timely manner.

I was in the rear kitchen display system team at that time, the system is the downstream business of the order. After the user orders the food and places the order, the order system will send a kafka message to our system. After the system reads the message, it does business logic processing, persists the order and dish data, and then displays it to the menu client. In this way, the chef knows which order to make which dishes, and some dishes can be prepared through the system. The system automatically informs the waiter to serve the dishes. If the waiter finishes the dishes and changes the serving status, the user will know which dishes have been served and which have not yet been served. This system can greatly improve the efficiency of the back kitchen to the user.

Take an inventory of some unusual pits that I have stepped on in the past two years with Kafka

 

Facts have proved that the key to all this is the message middleware: kafka, if it has a problem, it will directly affect the function of the back kitchen display system.

Next, I will talk with you about what pits have been stepped on in the two years of using Kafka?

Order problem

1. Why should the order of messages be guaranteed?

In the beginning, there were few merchants in our system. In order to realize the functions quickly, we didn't think too much. Since it is the message middleware kafka communication, the order system puts the detailed order data in the message body when sending a message. Our back kitchen display system can obtain relevant message data as long as it subscribes to the topic, and then process its own business.

However, this scheme has a key factor:  to ensure the order of the messages  .

why?

There are many statuses of an order, such as: order placement, payment, completion, cancellation, etc. The message that it is impossible to place an order has not been read, just read the message of payment or cancellation first, if this is the case, the data will not be confused ?

Well, it seems necessary to ensure the order of messages.

2. How to ensure the order of messages?

We all know that Kafka's topic is unordered, but a topic contains multiple partitions, and each partition is internally ordered.

Take an inventory of some unusual pits that I have stepped on in the past two years with Kafka

 

In this way, the idea becomes clear: as long as it is ensured that when the producer writes the message, it writes to the same partition according to certain rules, and different consumers read the messages of different partitions, and the order of production and consumer messages can be guaranteed. .

This is what we did at the beginning. Messages with the same merchant number are written to the same partition, 4 partitions are created in the topic, and then 4 consumer nodes are deployed to form a consumer group. One partition corresponds to one consumer node. In theory, this scheme can guarantee the order of messages.

Take an inventory of some unusual pits that I have stepped on in the past two years with Kafka

 

Everything seemed "seamless", and we went online "smoothly".

3. There is an accident

This feature has been online for a while, but it was quite normal at first.

However, the good times did not last long, and soon received complaints from users, saying that some orders and dishes have not been seen in the dishing client and cannot be arranged.

I located the reason. During that time, the company's network was often unstable, and the business interface reported timeouts from time to time, and business requests would not connect to the database from time to time.

This situation can be said to be devastating to the blow to sequential messages.

Why do you say that?

Suppose the order system sends three messages: "Order", "Payment", and "Completed".

Take an inventory of some unusual pits that I have stepped on in the past two years with Kafka

 

The "order" message our system failed to process due to network reasons, and the data of the next two messages cannot be stored in the database, because only the data of the "order" message is complete data, and other types of messages only update status.

In addition, we did not implement a failure retry mechanism at the time, which magnified this problem. The problem becomes: once the data of the "order" message fails to be stored in the database, the user will never see the order and the dish.

So how can this urgent problem be solved?

4. Resolution process

At the beginning, our idea was: when the consumer processes the message, if the processing fails, immediately retry 3-5 times. But what if some requests cannot succeed until the 6th time? It is impossible to retry all the time. This synchronous retry mechanism will block the reading of other merchants’ order messages.

Obviously, the above synchronization retry mechanism will seriously affect the consumption speed of message consumers and reduce its throughput in abnormal situations.

It seems that we have to use the asynchronous retry mechanism.

If an asynchronous retry mechanism is used, the failed message must be saved in the retry table.

But a new question immediately appeared:  How to ensure the order of only one message?

It is true that the order of storing a message cannot be guaranteed. If the "order" message fails, there is no time to retry asynchronously. At this time, the "payment" message is consumed, and it must not be consumed normally.

At this time, the "payment" message should be waiting forever, and judge once every certain period. Has the previous message been consumed?

If you really do this, there will be two problems:

  1. The "Payment" message is followed by the "Order" message. This situation is relatively simple. But if there are N kinds of messages in front of a certain type of message, how many times need to be judged. This judgment is too coupled with the order system, which is equivalent to moving part of their system logic to our system.
  2. Affect consumers' consumption speed

At this time, a simpler solution emerges: when consumers process a message, they first determine whether the order number has data in the retry table, and if so, save the current message directly to the retry table. If not, proceed to business processing, if an exception occurs, save the message to the retry table.

Later, we used elastic-job to establish a failure retry mechanism. If it still fails after 7 retries, the status of the message will be marked as failed and the developer will be notified by email.

Finally, due to the instability of the network, the problem that users have been unable to see some orders and dishes in the menu client has been solved. Merchants now delay seeing dishes at most occasionally, which is much better than not seeing them all the time.

Message backlog

With the marketing promotion of the sales team, there are more and more merchants in our system. What follows is that the number of messages is getting larger and larger, causing consumers to be unable to deal with it, and there is often a backlog of messages. The impact on merchants is very intuitive, and orders and dishes on the dishing client may not be seen until half an hour later. It can be tolerated for a minute or two, and the delay of half a message can not be tolerated by some violent merchants, and complaints came over immediately. During that time, we often received complaints from merchants that orders and dishes were delayed.

Although the problem can be solved by adding server nodes, according to the company's practice to save money, system optimization must be done first, so we started the journey of solving the message backlog problem.

1. The message body is too large

Although Kafka claims to support million-level TPS, sending messages from the producer to the broker requires a network IO. The broker needs a disk IO (write operation) to write data to the disk, and the consumer gets the message from the broker through a disk IO (read operation). , And go through the network IO again.

Take an inventory of some unusual pits that I have stepped on in the past two years with Kafka

 

A simple message needs to go through 2 times of network IO and 2 times of disk IO from production to consumption. If the message body is too large, it will inevitably increase the time-consuming of IO, which will affect the speed of Kafka production and consumption. As a result of consumers being too slow, there will be a backlog of messages.

In addition to the above problems, if the message body is too large, the disk space of the server will be wasted. If you don't pay attention, there may be insufficient disk space.

At this point, we have reached the time when we need to optimize the problem of the message body being too large.

How to optimize it?

We reorganized the business, there is no need to know the intermediate state of the order, just know the final state.

So good, we can design it like this:

  1. The message body sent by the order system only contains key information such as id and status.
  2. After the back kitchen displays the system consumption message, the order details query interface of the order system is called to obtain the data through the id.
  3. The back kitchen display system judges whether there is the order data in the database, if not, it will be stored in the warehouse, and if there is, it will be updated.

Take an inventory of some unusual pits that I have stepped on in the past two years with Kafka

 

Sure enough, after such adjustments, the message backlog problem did not reappear for a long time.

2. Unreasonable routing rules

Don't be too happy. One day at noon, a merchant complained that orders and dishes were delayed. When we checked the topic of kafka, there was a backlog of news.

But this time is a bit weird, not all messages on the partition have a backlog, but only one.

Take an inventory of some unusual pits that I have stepped on in the past two years with Kafka

 

At first, I thought it was caused by something wrong with the node consuming the partition message. However, after investigation, no abnormalities were found.

This is weird. Where is the problem?

Later, I checked the log and database and found that several merchants had extremely large orders. It happened that these merchants were allocated to the same partition, which made the volume of messages in this partition much more than that of other partitions.

Only then did we realize that the rules for routing partitions according to the merchant number when sending messages are unreasonable, which may cause too many partitions to have too many messages for consumers to process, while some partitions have too few messages and consumers become idle. .

In order to avoid this uneven distribution, we need to adjust the routing rules for sending messages.

We thought about it, and the routing with order numbers is relatively more even, and there will not be a situation where there are too many messages sent for a single order. Unless it is a situation where someone keeps adding food, but adding food costs money, so in fact, there are not many messages for the same order.

After adjustment, it is routed to different partitions according to the order number. Messages with the same order number are sent to the same partition every time they arrive.

Take an inventory of some unusual pits that I have stepped on in the past two years with Kafka

 

After the adjustment, the problem of message backlog did not reappear for a long time. During this period of time, the number of our merchants has grown very fast and more and more.

3. Chain reaction caused by batch operation

In a high-concurrency scenario, the message backlog problem can be said to go hand in hand, and there is really no way to solve it fundamentally. On the surface, it has been resolved, but I don’t know when, it will appear once, such as this time:

One afternoon, the product came over and said: Several merchants complained, and they said that the dishes were delayed. Check the reason quickly.

This time the problem appeared a bit strange.

Why do you say that?

First of all, this time point is a bit strange. Usually, problems occur, aren't they all during the peak dining period at noon or evening? Why did the problem appear in the afternoon this time?

Based on the experience accumulated in the past, I directly looked at the topic data of kafka, and it turned out that there was a backlog of news above, but this time each partition has a backlog of hundreds of thousands of messages without consumption, which is a few hundred more than the number of pressurized messages in the past. Times. The backlog of news this time is extremely unusual.

I hurried to check the service monitoring to see if the consumer was down. Fortunately, it didn't. Checked the service log again and found no abnormalities. At this moment, I was a little bit confused. I tried my luck and asked if anything happened to the order group in the afternoon. They said that there was a promotion in the afternoon, and they ran a JOB to update the order information of some merchants in batches.

At this time, I suddenly woke up like a dream. The problem was caused by their batch messaging in JOB. Why didn't we notify us? It's too bad.

Although we know the cause of the problem, how should we deal with the hundreds of thousands of backlogs in front of us?

At this time, it is not possible to directly increase the number of partitions. Historical messages have been stored in 4 fixed partitions, and only new messages will be added to the new partition. What we need to deal with is the existing partition.

It is also not possible to add service nodes directly, because Kafka allows multiple partitions in the same group to be consumed by one consumer, but does not allow one partition to be consumed by multiple consumers in the same group, which may cause waste of resources.

It seems that only multi-threaded processing is used.

In order to solve the problem urgently, I changed to use the thread pool to process messages, the core thread and the maximum number of threads are configured to 50.

After the adjustment, sure enough, the backlog of news continued to decrease.

But at this time there was a more serious problem: I received an alarm email, and two order system nodes were down.

Soon, my colleague from the order group came to me and said that the amount of concurrency in calling their order query interface by our system had increased dramatically, which exceeded the estimate several times, causing two service nodes to go down. They integrated the query function into a single service, deployed 6 nodes, and suspended 2 nodes. If they do not process it, the other 4 nodes will also be suspended. Order service can be said to be the company's core service. If it fails, the company will lose a lot and the situation is extremely urgent.

In order to solve this problem, the number of threads can only be reduced first.

Fortunately, the number of threads can be dynamically adjusted through zookeeper. I adjusted the number of core threads to 8, and the number of core threads to 10.

Later, the operation and maintenance restarted the two nodes linked to the order service and returned to normal. Just in case, two more nodes were added. In order to ensure that there will be no problems with the order service, the current consumption rate is maintained. The backlog of messages in the back kitchen display system has returned to normal after 1 hour.

Take an inventory of some unusual pits that I have stepped on in the past two years with Kafka

 

Later, we held a review meeting and concluded that:

  1. The batch operation of the order system must be notified to the downstream system team in advance.
  2. The downstream system team must perform pressure testing when calling the order query interface in multiple threads.
  3. This time it sounded the alarm for the order query service. As the company's core service, it is not good enough to deal with high concurrency scenarios and needs to be optimized.
  4. Monitor the backlog of messages.

By the way, for scenarios that require strict guarantee of message order, the thread pool can be changed to multiple queues, and each queue is processed by a single thread.

4. The table is too large

In order to prevent the message backlog problem from recurring in the future, consumers have been using multithreading to process messages later.

But one day at noon we still received a lot of alarm emails, reminding us that there is a backlog of topic messages in Kafka. We are investigating the reason. At this time, the product ran over and said: Another merchant complained that the dishes were delayed, so hurry up and take a look. This time she looked a little impatient, and she did optimize many times, but the same problem still appeared.

From a layman's point of view:  Why can't the same problem be solved all the time?

In fact, they don't know the bitterness of technology.

On the surface, the symptoms of the problem are the same, they are all delayed dishes, what they know is because of the backlog of news. But they don't know the underlying reasons. There are actually many reasons for the backlog of news. This may be a common problem of using message middleware.

I was silent and could only bite the bullet and locate the cause.

Later, I checked the log and found that it took up to 2 seconds for the consumer to consume a message. It used to be 500 milliseconds, how can it become 2 seconds now?

It's weird, the consumer code hasn't made major adjustments, why does this happen?

I checked the online dish table and found that the amount of data in the single table reached tens of millions. The same is true for other dishes. Now the single table saves too much data.

Our team sorted out the business. In fact, only the dishes from the last 3 days can be displayed on the client.

This is easy, we have redundant data on the server, it is better to archive the redundant data in the table. Therefore, the DBA helped us archive the data and only kept the data for the last 7 days.

After such adjustments, the backlog of news was resolved and the peace of the past was restored.

Primary key conflict

Don't be too happy. There are other problems. For example, alarm emails often report database exceptions: Duplicate entry '6' for key'PRIMARY', saying that the primary key conflicts.

This kind of problem is generally due to the fact that there are more than two SQLs with the same primary key, and data is inserted at the same time. After the first insert is successful, the second insert will report a primary key conflict. The primary key of the table is unique and does not allow duplicates.

I checked the code carefully and found that the code logic will first query whether the order exists from the table based on the primary key, update the status if it exists, and insert the data when it does not exist, no problem.

This judgment is useful when the amount of concurrency is not large. However, if in a high-concurrency scenario, two requests find that the order does not exist at the same time, and one request inserts data first, and the other request inserts data again, there will be a primary key conflict exception.

The most common way to solve this problem is: lock.

I was thinking the same way at the beginning. Adding database pessimistic lock is definitely not enough, it affects performance too much. Adding database optimistic lock, based on the version number, is generally used for update operations, and such insert operations are basically not used.

The rest can only use distributed locks. Our system is using redis. You can add redis-based distributed locks to lock the order number.

But after thinking about it carefully:

  1. Adding distributed locks may also affect the message processing speed of consumers.
  2. Consumers rely on redis. If redis has a network timeout, our service will be a tragedy.

Therefore, I do not intend to use distributed locks.

Instead, choose to use mysql's INSERT INTO ...ON DUPLICATE KEY UPDATE syntax:

INSERT INTO table (column_list)
VALUES (value_list)
ON DUPLICATE KEY UPDATE
c1 = v1, 
c2 = v2,
...;

It will first try to insert data into the table and update the fields if the primary key conflicts.

After modifying the previous insert statement, the primary key conflict problem did not occur again.

Database master-slave delay

One day shortly after, I received a complaint from a merchant saying that after placing an order, I could see the order on the dishing client, but the dishes I saw were incomplete, and sometimes even the order and dish data were not visible.

This problem is different from the past. Based on past experience, let’s first look at whether there is a backlog of messages in Kafka's topic, but this time there is no backlog.

I checked the service log again, and found that some of the data returned by the order system interface were empty, and some only returned the order data, but did not return the dish data.

This is very strange. I went straight to find a colleague in the order group. They carefully checked the service and found no problems. At this time, we all thought about whether there was a problem with the database, and we went to the DBA together. Sure enough, the DBA found that the master database of the database synchronizes data to the slave database. There is occasional delay due to network reasons, sometimes the delay is 3 seconds.

If our business process takes less than 3 seconds from sending a message to consuming a message, when the order details query interface is called, the data may not be found, or the data found is not the latest data.

This problem is very serious and will lead to direct errors in our data.

In order to solve this problem, we have also added a retry mechanism. When calling the interface to query the data, if the returned data is empty, or only the order without dishes is returned, add the retry table.

After the adjustment, the complaint from the merchant was resolved.

Repeat consumption

Kafka supports three modes when consuming messages:

  • at most onece mode at most once. After ensuring that each message is successfully committed, consumption is processed. The message may be lost, but it will not be repeated.
  • at least onece mode at least once. After ensuring that each message is successfully processed, commit is performed. The message will not be lost, but it may be repeated.
  • The exactly onece mode is passed exactly once. Treat the offset as the unique id with the message at the same time, and ensure the atomicity of processing. The message will only be processed once, not lost or repeated. But this way is difficult to do.

The default mode of kafka is at least onece, but this mode may cause repeated consumption problems, so our business logic must be idempotent.

Our business scenario uses the INSERT INTO ...ON DUPLICATE KEY UPDATE grammar when saving data, inserting when it does not exist, and updating when it exists, which naturally supports idempotence.

Multiple environmental consumption issues

Our online environment was divided into: pre (pre-release environment) and prod (production environment). The two environments shared the same database and shared the same Kafka cluster.

It should be noted that when configuring Kafka's topic, prefixes must be added to distinguish different environments. The pre environment starts with pre_, such as pre_order, and the production environment starts with prod_, such as prod_order, to prevent messages from being stringed in different environments.

But there was a time when O&M switched nodes in the pre environment, and when configuring the topic, the configuration was wrong, and it became the topic of prod. On that day, we had a new feature on the pre environment. The result was tragic. Some messages of prod were consumed by consumers in the pre environment, and due to the adjustment of the message body, the consumers in the pre environment failed to process messages all the time.

As a result, some news was lost in the production environment. Fortunately, the consumer in the production environment finally solved the problem by resetting the offset and re-reading that part of the message without causing much loss.

postscript

In addition to the above problems, I have also encountered:

  • Kafka  consumer cpu usage rate is 100%
  • A broker node in the Kafka cluster hung up, and it hung up after restarting.

These two questions are a bit complicated, so I won’t list them one by one.

Thank you very much for the experience of using the message middleware kafka in the past two years. Although I have encountered many problems, stepped on many pits, and made many detours, I have accumulated a lot of valuable experience and grew rapidly.

In fact, Kafka is a very good messaging middleware. Most of the problems I have encountered are not Kafka's own problems (except for the 100% CPU usage rate is caused by a bug in it).

Author: Susan

Original link: https://mp.weixin.qq.com/s?__biz=MzUxODkzNTQ3Nw==&mid=2247486202&idx=1&sn=23f249d3796eb53aff9cf41de6a41761

If you think this article is helpful to you, you can pay attention to my official account and reply to the keyword [Interview] to get a compilation of Java core knowledge points and an interview gift package! There are more technical dry goods articles and related materials to share, let's learn and make progress together!

Guess you like

Origin blog.csdn.net/weixin_48182198/article/details/113925760