Kafka actual combat pit

Reprinted from: wx public account: Java soulmate

Preface

My last company was a catering system. During the peak hours of lunch and evening meals, the concurrency of the system should not be underestimated. To be on the safe side, the company stipulates that all departments must take turns on duty during meal hours to prevent online problems from being dealt with in a timely manner.

I was in the rear kitchen display system team at that time, the system is the downstream business of the order. After the user point of menu order, the order will be sent via kafkamessage to our system, the system reads the message, do business logic processing, data persistence orders and dishes, and show to draw vegetables client. In this way, the chef knows which order to make which dishes, and some dishes can be prepared through the system. The system automatically informs the waiter to serve the dishes. If the waiter finishes the dishes and changes the serving status, the user will know which dishes have been served and which have not yet been served. This system can greatly improve the efficiency of the back kitchen to the user.

image

Facts have proved that the key to all this is the message middleware: kafkaIf it has a problem, it will directly affect the function of the back kitchen display system.

Next, let me talk to you about kafkawhat pits have been stepped on in the past two years?

Order problem

1. Why should the order of messages be guaranteed?

In the beginning, there were few merchants in our system. In order to realize the functions quickly, we didn't think too much. Since it is a message middleware kafkacommunication, the order system puts the detailed order data in the message body when the order system sends a message. As long as our back kitchen display system subscribes topic, we can get the relevant message data and then process our own business.

However, this scheme has a key factor: to ensure the order of the messages .

why?

Orders have many states, such as: order placement, payment, completion, cancellation, etc. If impossible 下单messages are not read, just read the 支付or 撤销message first . If this is the case, will the data be confused?

Well, it seems necessary to ensure the order of messages.

2. How to ensure the order of messages?

We all know kafkathat it topicis disordered, but one topiccontains many partition, and each partitionis ordered internally.

image

In this way, the idea becomes clear: as long as it is ensured that when the producer writes a message, it writes to the same one according to certain rules partition, and different consumers read different partitionmessages, then the order of production and consumer messages can be guaranteed.

We are just beginning to do so, with a 商户编号message written to the same partition, topiccreated in 4two partition, and then deployed 4a consumer nodes, constitute 消费者组a partitioncorresponds to a consumer node. In theory, this scheme can guarantee the order of messages.image

Everything seemed "seamless", and we went online "smoothly".

3. There is an accident

This feature has been online for a while, but it was quite normal at first.

However, the good times did not last long, and soon received complaints from users, saying that some orders and dishes have not been seen in the dishing client and cannot be arranged.

I located the reason. During that time, the company's network was often unstable, and the business interface reported timeouts from time to time, and business requests would not connect to the database from time to time.

The 顺序消息blow of this situation can be said to be yes 毁灭性.

Why do you say that?

Suppose the order system sends three messages: "Order", "Payment", and "Completed". imageThe "order" message our system failed to process due to network reasons, and the data of the next two messages cannot be stored in the database, because only the data of the "order" message is complete data, and other types of messages only update status.

Plus, we didn't do it at the time 失败重试机制, which magnified the problem. The problem becomes: once the data of the "order" message fails to be stored in the database, the user will never see the order and the dish.

So how can this urgent problem be solved?

4. Resolution process

At the beginning, our idea was: when the consumer processes the message, if the processing fails, immediately retry 3-5 times. But what if some requests cannot succeed until the 6th time? It is impossible to retry all the time. This synchronous retry mechanism will block the reading of other merchants’ order messages.

Obviously, using the above 同步重试机制abnormal situation will seriously affect the consumption speed of the message consumer and reduce its throughput.

So it seems that we have to use 异步重试机制it.

If an asynchronous retry mechanism is used, the failed message must be saved 重试表.

But a new question immediately appeared: How to ensure the order of only one message?

It is true that the order of storing a message cannot be guaranteed. If the "order" message fails, there is no time to retry asynchronously. At this time, the "payment" message is consumed, and it must not be consumed normally.

At this time, the "payment" message should be waiting forever, and judge once every certain period. Has the previous message been consumed?

If you really do this, there will be two problems:

  1. The "Payment" message is followed by the "Order" message. This situation is relatively simple. But if there are N kinds of messages in front of a certain type of message, how many times need to be judged. This judgment is too coupled with the order system, which is equivalent to moving part of their system logic to our system.

  2. Affect consumers' consumption speed

At this kind of simpler solution surfaced: processing the message consumer, the first determination 订单号on 重试表whether data, if the current message directly to save 重试表. If not, proceed to business processing, if an exception occurs, save the message to 重试表.

Later, we elastic-jobset it up with it 失败重试机制. If 7it still fails after retries , we will mark the status of the message as 失败and send an email to notify the developer.

Finally, due to the instability of the network, the problem that users have been unable to see some orders and dishes in the menu client has been solved. Merchants now delay seeing dishes at most occasionally, which is much better than not seeing them all the time.

Message backlog

With the marketing promotion of the sales team, there are more and more merchants in our system. What follows is that the number of messages is getting larger and larger, causing consumers to be unable to deal with it, and there is often a backlog of messages. The impact on merchants is very intuitive, and orders and dishes on the dishing client may not be seen until half an hour later. It can be tolerated for a minute or two, and the delay of half a message can not be tolerated by some violent merchants, and complaints came over immediately. During that time, we often received complaints from merchants that orders and dishes were delayed.

Although adding 服务器节点can solve the problem, according to the company's practice in order to save money, the system must be optimized first, so we started the 消息积压problem solving journey.

1. The message body is too large

Although it is kafkaclaimed to be supported 百万级的TPS, it takes one time from producersending a message to brokera network IO, and brokerwriting data to a disk requires one disk IO(write operation). consumerFrom brokerobtaining a message, it first goes through the disk IO(read operation) and then through the network IO.image

A simple message needs to go through 2次网络IOand be reconciled from production to consumption 2次磁盘IO. If the message body is too large, it will inevitably increase the time-consuming of IO, which will affect the speed of Kafka production and consumption. As a result of consumers being too slow, there will be a backlog of messages.

In addition to the above problems, 消息体过大it will also waste the disk space of the server. If you don't pay attention, there may be insufficient disk space.

At this point, we have reached the time when we need to optimize the problem of the message body being too large.

How to optimize it?

We reorganized the business, there is no need to know the order 中间状态, just know one 最终状态.

So good, we can design it like this:

  1. The message body sent by the order system only contains key information such as id and status.

  2. After the back kitchen displays the system consumption message, the order details query interface of the order system is called to obtain the data through the id.

  3. The back kitchen display system judges whether there is the order data in the database, if not, it will be stored in the warehouse, and if there is, it will be updated.

image

Sure enough, after such adjustments, the message backlog problem did not reappear for a long time.

 

2. Unreasonable routing rules

Don't be too happy too early. One day at noon, a merchant complained that orders and dishes were delayed. When we checked the topic of kafka, there was a backlog of news.

But this time it’s a bit weird. Not all partitionthe news has a backlog, but only one.

image

At first, I thought it was partitioncaused by something wrong with the node consuming that message. However, after investigation, no abnormalities were found.

This is weird. Where is the problem?

Later, I checked the log and database and found that several merchants had extremely large orders, and it happened that these merchants were assigned to the same one partition, which made the partitionamount of messages much larger than others partition.

Only then did we realize that the unreasonable 商户编号routing partitionrules when partitionsending messages may cause some messages to be too many for consumers to process, while some partitionmessages are too few and consumers become idle.

In order to avoid this uneven distribution, we need to adjust the routing rules for sending messages.

We thought about it, and the routing with order numbers is relatively more even, and there will not be a situation where there are too many messages sent for a single order. Unless it is a situation where someone keeps adding food, but adding food costs money, so in fact, there are not many messages for the same order.

, Adjusting 订单号routed to different partitionmessages of the same order number, sent to the same every time partition.

image

After the adjustment, the problem of message backlog did not reappear for a long time. During this period of time, the number of our merchants has grown very fast and more and more.

3. Chain reaction caused by batch operation

In a high-concurrency scenario, the message backlog problem can be said to go hand in hand, and there is really no way to solve it fundamentally. On the surface, it has been resolved, but I don’t know when, it will appear once, such as this time:

One afternoon, the product came over and said: Several merchants complained, and they said that the dishes were delayed. Check the reason quickly.

This time the problem appeared a bit strange.

Why do you say that?

First of all, this time point is a bit strange. Usually, problems occur, aren't they all during the peak dining period at noon or evening? Why did the problem appear in the afternoon this time?

Based on the accumulated experience in the past kafkaand topicthe data I directly looked at , it turned out that there was a backlog of news above, but this time every partitionbacklog 十几万of news was not consumed, and the number of pressurized news increased compared to the past 几百倍. The backlog of news this time is extremely unusual.

I hurried to check the service monitoring to see if the consumer was down. Fortunately, it didn't. Checked the service log again and found no abnormalities. At this moment, I was a little bit confused. I tried my luck and asked if anything happened to the order group in the afternoon. They said that there was a promotion in the afternoon, and they ran a JOB to update the order information of some merchants in batches.

At this time, I suddenly woke up like a dream. The problem was caused by their batch messaging in JOB. Why didn't we notify us? It's too bad.

Although we know the cause of the problem, 十几万how should we deal with the backlog of news?

At this time, it partitionis not possible to increase the number directly . The historical messages have been stored in 4 fixed ones partition, and only new messages will be new ones partition. What we need to deal with is the existing partition.

It is also not possible to add a service node directly, because kafkamultiple users in the same group are allowed to partitionbe consumerconsumed by one , but one partitionis not allowed to be consumed by multiple users in the same group consumer, which may cause a waste of resources.

It seems that only multi-threaded processing is used.

In order to solve the problem urgently, I changed to use the 线程池processing message, the core thread and the maximum number of threads are configured 50.

After the adjustment, sure enough, the backlog of news continued to decrease.

But at this time there was a more serious problem: I received an alarm email, and two order system nodes were down.

Soon, my colleague from the order group came to me and said that the amount of concurrency in calling their order query interface by our system had increased dramatically, which exceeded the estimate several times, causing two service nodes to go down. They integrated the query function into a single service, deployed 6 nodes, and suspended 2 nodes. If they do not process it, the other 4 nodes will also be suspended. Order service can be said to be the company's core service. If it fails, the company will lose a lot and the situation is extremely urgent.

In order to solve this problem, the number of threads can only be reduced first.

Fortunately, the number of threads can be zookeeperdynamically adjusted. I adjusted the number of core threads to 8one, and the number of core threads to 10one.

Later, the operation and maintenance restarted the two nodes linked to the order service and returned to normal. Just in case, two more nodes were added. In order to ensure that there will be no problems with the order service, the current consumption rate is maintained. The backlog of messages in the back kitchen display system has returned to normal after 1 hour.image

Later, we held a review meeting and concluded that:

  1. The batch operation of the order system must be notified to the downstream system team in advance.

  2. The downstream system team must perform pressure testing when calling the order query interface in multiple threads.

  3. This time it sounded the alarm for the order query service. As the company's core service, it is not good enough to deal with high concurrency scenarios and needs to be optimized.

  4. Monitor the backlog of messages.

 

By the way, for scenarios that require strict guarantee of message order, the thread pool can be changed to multiple queues, and each queue is processed by a single thread.

4. The table is too large

In order to prevent the message backlog problem from recurring in the future, consumers have been using multithreading to process messages later.

But one day at noon we still received a lot of alarm emails, reminding us that there is a backlog of topic messages in Kafka. We are investigating the reason. At this time, the product ran over and said: Another merchant complained that the dishes were delayed, so hurry up and take a look. This time she looked a little impatient, and she did optimize many times, but the same problem still appeared.

From a layman's point of view: Why can't the same problem be solved?

In fact, they don't know the bitterness of technology.

On the surface, the symptoms of the problem are the same, they are all delayed dishes, what they know is because of the backlog of news. But they don't know the underlying reasons. There are actually many reasons for the backlog of news. This may be a common problem of using message middleware.

I was silent and could only bite the bullet and locate the cause.

Later, I checked the log and found that it took a long time for consumers to consume a message 2秒. It used to be 500毫秒, how can it become 2秒now?

It's weird, the consumer code hasn't made major adjustments, why does this happen?

I checked the online dish table, and the amount of data in the single table has arrived 几千万. The same is true for other dishes. Now the single table saves too much data.

Our team sorted out the business, in fact, only the most recent dishes 3天can be displayed on the client side .

This is easy to handle, our server is stored 多余的数据, it is better to archive the excess data in the table. Therefore, the DBA helped us archive the data, keeping only the most recent 7天data.

After such adjustments, the backlog of news was resolved and the peace of the past was restored.

Primary key conflict

Don't be too happy, there are other problems, such as: alarm emails often report database exceptions:,  Duplicate entry '6' for key 'PRIMARY'saying that the primary key conflicts.

This kind of problem is generally due to the fact that there are more than two SQLs with the same primary key, and data is inserted at the same time. After the first insert is successful, the second insert will report a primary key conflict. The primary key of the table is unique and does not allow duplicates.

I checked the code carefully and found that the code logic will first query whether the order exists from the table based on the primary key, update the status if it exists, and insert the data when it does not exist, no problem.

This judgment is useful when the amount of concurrency is not large. However, if in a high-concurrency scenario, two requests find that the order does not exist at the same time, and one request inserts data first, and the other request inserts data again, there will be a primary key conflict exception.

The most conventional approach to solve this problem is: 加锁.

I was thinking the same way at the beginning. Adding database pessimistic lock is definitely not enough, it affects performance too much. Adding database optimistic lock, based on the version number, is generally used for update operations, and such insert operations are basically not used.

The rest can only use distributed locks. Our system is using redis. You can add redis-based distributed locks to lock the order number.

But after thinking about it carefully:

  1. Adding distributed locks may also affect the message processing speed of consumers.

  2. Consumers rely on redis. If redis has a network timeout, our service will be a tragedy.

Therefore, I do not intend to use distributed locks.

Instead, choose to use the mysql INSERT INTO ...ON DUPLICATE KEY UPDATEsyntax:

INSERTINTOtable (column_list)
VALUES (value_list)
ONDUPLICATEKEYUPDATE
c1 = v1, 
c2 = v2,
...;

It will first try to insert data into the table and update the fields if the primary key conflicts.

After the previous insertstatement was transformed, the primary key conflict problem did not occur again.

Database master-slave delay

One day shortly after, I received a complaint from a merchant saying that after placing an order, I could see the order on the dishing client, but the dishes I saw were incomplete, and sometimes even the order and dish data could not be seen.

This question is different from the past. Based on past experience, I first looked kafkaat topicwhether there was a backlog of Chinese news, but this time there was no backlog.

I checked the service log again, and found that some of the data returned by the order system interface were empty, and some only returned the order data, but did not return the dish data.

This is very strange. I went straight to find a colleague in the order group. They carefully checked the service and found no problems. At this time, we all thought about whether there might be a problem with the database, so let's look for it together DBA. Sure enough, it was DBAdiscovered that the synchronization of data from the master database to the slave database was occasionally delayed due to network reasons, and sometimes there was a delay 3秒.

If our business process takes less time from sending a message to consuming a message 3秒, when calling the order details query interface, the data may not be found, or the data found is not the latest data.

This problem is very serious and will lead to direct errors in our data.

In order to solve this problem, we also added 重试机制. When calling the interface to query data, if the returned data is empty, or only the order is returned without dishes, add it 重试表.

After the adjustment, the complaint from the merchant was resolved.

Repeat consumption

 

kafkaThree modes are supported when consuming messages:

  • at most once mode at most once. After ensuring that each message is successfully committed, consumption is processed. The message may be lost, but it will not be repeated.

  • at least once mode at least once. After ensuring that each message is successfully processed, commit is performed. The message will not be lost, but it may be repeated.

  • The exactly once mode is passed exactly once. Treat the offset as the unique id with the message at the same time, and ensure the atomicity of processing. The message will only be processed once, not lost or repeated. But this way is difficult to do.

kafkaThe default mode is at least once, but this mode may cause repeated consumption problems, so our business logic must be idempotent design.

And our business scenario uses INSERT INTO ...ON DUPLICATE KEY UPDATEgrammar when saving data, inserts when it does not exist, and updates when it exists, which naturally supports idempotence.

 

Multiple environmental consumption issues

Our online environment was divided into: pre(pre-release environment) and  prod(production environment). The two environments shared the same database and shared the same Kafka cluster.

Note that, in the configuration kafkaof the topictime, to add the prefix used to distinguish between different environments. The pre environment starts with pre_, such as pre_order, and the production environment starts with prod_, such as prod_order, to prevent messages from being stringed in different environments.

But there are times in the operation and maintenance of precontext switching node configuration topictime, with the wrong with became prodof topic. On that day, we had a new function on the preenvironment. The result was tragic prod. Some of the messages were consumed by the preenvironment consumer, and due to the adjustment of the message body, the preenvironment's consumermessage processing failed.

As a result, some news was lost in the production environment. Fortunately, the consumer in the production environment finally offsetsolved the problem by resetting and re-reading that part of the message without causing much loss.

postscript

In addition to the above problems, I have also encountered:

  • kafkaThe consumeruse of automatic confirmation mechanism resulted in cpu使用率100%.

  • kafkaA brokernode in the cluster hung up, and it hung up after restarting.

These two questions are a bit complicated to talk about. I will not list them one by one. Friends who are interested can follow my official account and add my WeChat to chat with me privately.

Thank you very much for kafkathe experience of using message middleware in the past two years . Although I have encountered many problems, stepped on many pits, and made many detours, I have accumulated a lot of valuable experience and grew rapidly.

In fact, it kafkais a very good message middleware. Most of the problems I have encountered are not kafkatheir own problems (except for the 100% CPU usage rate is caused by a bug in it).

Guess you like

Origin blog.csdn.net/dmw412724/article/details/115037536
Recommended