Message queue kafka tips and common problems

Table of contents

【Overview of message queue】

【kafka】

message loss problem

message duplication problem

order of consumption

message backlog problem

kafka cluster deployment


【Overview of message queue】

Message queues mainly solve problems such as application coupling, asynchronous messages, and traffic cutting. They are indispensable middleware for large-scale distributed systems. The message producer only publishes the message to MQ regardless of who gets it, and the message consumer only gets the message from MQ regardless of who produced it, so neither the producer nor the consumer needs to know the existence of the other party.

For some scenarios where there are many transfer steps or the time is too long, you can use message queues. For example, after a user successfully places an order, they can send information asynchronously through the message queue, asynchronously process gift points, and so on.

Commonly used message queue middleware are as follows:

  • ActiveMQ: Open source message middleware produced by Apache.
  • RabbitMQ: An open source message queuing system developed using the Erlang language and implemented based on the AMQP protocol.
  • RocketMQ: Independently developed by Alibaba, Taobao's internal transaction system uses the Notify message middleware independently developed by Taobao, uses MySQL as the message storage medium, and can be fully scaled horizontally.
  • Kafka: Kafka is an excellent distributed message queue, developed using Scala and Java, characterized by a huge throughput (millions/second), horizontally scalable, and fault-tolerant message system.

What problems can be solved by using message queues:

  • Asynchronous processing : such as the bonus points mentioned above after placing an order, or collecting logs asynchronously;
  • Application decoupling : In the era of microservices, multiple services can transmit data through message queues without worrying about blocking problems;
  • Traffic peak shaving : In the seckill business, the data that cannot be processed in the future can be temporarily backlogged in the queue, and processed slowly in the background to relieve the pressure on the server;
  • Publish/Subscribe : A message can be broadcast to any listener, Producer is only responsible for sending Message, and Consumer can subscribe to Message at will.

Example: In the e-commerce system, the order service and payment service are deployed separately, and the order system, logistics system, and financial system all subscribe to the news of the payment system. When the order payment is successful, the payment system broadcasts a message: the user has confirmed the payment; after each subscriber receives the message, the order system changes the order status to paid, the logistics system starts shipping, and the financial system starts invoicing.... This process can be completed using message queues.

The introduction of message middleware will increase the complexity of the system, which will bring the following problems:

  • Message loss problem : Any system cannot be guaranteed to be foolproof. For example, if the Producer sends 10,000 messages and the Consumer only receives 9,999 messages, then it is necessary to evaluate whether the lost message can be accepted? If it is a SMS notification after the order is successful, it is acceptable to lose the message; if the user pays successfully but does not deliver the goods to others, then the user is probably in a hurry.
  • Message duplication problem : Similar to the above, if the Producer sends 10,000 messages, but the Consumer receives 10,001 messages, one of which is duplicated, can the business accept this duplicated message? It’s okay if one more SMS notification is sent after the order is successful; if the user successfully places an order but sends two products, then the seller will suffer.
  • The order of messages : For example, the order of sending by Producer is 1->2->3, and the message received by Consumer is 1->3->2. It is necessary to consider whether the consumer is sensitive to the order.
  • Consistency problem : If the message is lost and cannot be retrieved, it will cause the data of the two systems to be inconsistent eventually; if the message is delayed, it will cause a short-term inconsistency; in both cases, it is necessary to think of a countermeasure in advance.

【kafka】

Some keywords in Kafka:

  • Producer: The producer of the message, whoever creates the message is the producer.
  • Consumer: The consumer who consumes, whoever receives the message is the consumer.
  • Topic : Each message published to the MQ cluster has a category. This category is called topic, which can be understood as the name of a type of message. All messages are categorized by topic.
  • Partition : Kafka's physical partition concept, each Topic will be scattered in one or more Partitions, and each Partition is ordered. If the data of a topic is too large, it is divided into small pieces. Kafka introduces a multi-copy model for partitions. The design of "one leader and multiple followers" is adopted between copies, and automatic failover is realized through multiple copies to ensure availability.
  • Broker : It can be understood as a node of a server. The cluster contains one or more servers. This kind of server is called a broker. For the application, if the producer sends out the consumption, it doesn't matter; the consumer consumes at its own rate without haste. There may be a lot of news during this period, and consumer pressure is still within a certain range. The decoupling between producers and consumers is a cache service broker.
  • Kafka Cluster : Kafka's cluster is a collection of Brokers, and multiple Brokers form a high-availability cluster.

Compared with similar middleware RabbitMQ/ActiveMQ, Kafka supports batch pulling of messages, which greatly increases message throughput. Kafka is distributed and scalable, and the Kafka cluster can be expanded transparently, adding new servers to the cluster.

Kafka supports multiple sending scenarios: 1. Send and forget; 2. Synchronous sending; 3. Asynchronous sending + callback function. These three methods are mainly reflected in the difference in time. It does not mean that the faster the time, the better. Which method to use depends on the specific business scenario:

  • For example, the business requires that messages must be sent in order, and the second type of synchronous sending can be used, and it can only be sent on one partation;
  • If the business only cares about the throughput of messages, allows a small number of messages to fail to be sent, and does not care about the order in which messages are sent, then the first method of sending and forgetting can be used;
  • If the business needs to know whether the message is sent successfully, and does not care about the order of the message, then the third asynchronous + callback method can be used to send the message.

Why Kafka's throughput is much higher than other similar middleware ? Kafka is a high-throughput distributed message system and provides persistence; the architecture adopts distributed parallel processing and utilizes disk sequential IO batch processing.

  • It takes advantage of the fact that the continuous read and write performance of the disk is much higher than that of random read and write. It uses batch processing of messages internally and the zero-copy mechanism. The storage and acquisition of data is a sequential batch operation of the local disk, with a complexity of O (1), and the efficiency of message processing is very high.
  • The concurrency mechanism splits a topic into multiple partitions, and the unit of reading and writing in Kafka is a partition. Therefore, splitting a topic into multiple partitions can improve throughput. However, different partitions need to be located on different disks (can be on the same machine). If multiple partitions are located on the same disk, multiple processes will read and write multiple files on one disk at the same time, breaking the continuity of disk reads and writes.

message loss problem

Messages go through the following process from production to consumption:

  • Message production stage: As long as the Broker’s ack confirmation response can be received normally, it means that the sending is successful, so as long as the return value and exception are handled properly, there will be no message loss at this stage;
  • Message storage stage: This stage is usually directly handed over to MQ message middleware to ensure that, for example, Broker will make a copy to ensure that a message is synchronized with at least two nodes before returning ack;
  • Message consumption stage: the consumer pulls the message from the Broker, as long as the consumer does not send the consumption confirmation to the Broker immediately after receiving the message, but waits until the business logic is executed before sending the consumption confirmation, it can also ensure that the message will not be lost.

Which links may lose messages?

  • Messages may be lost during message production: the solution adopted is message retransmission. If the message queue is not faulty, or the network to the message queue is disconnected, retry 2 or 3 times. However, this scheme may result in duplication of messages
  • Messages may be lost in the message queue: messages are stored on local disks in Kafka, and in order to reduce random I/O to disks during message storage, messages are generally written to the Page Cache of the operating system first, and then refreshed to the disk at an appropriate time. For example, Kafka can be configured to refresh the disk when a certain time interval is reached, or when a certain number of messages are accumulated, that is, asynchronous disk flushing . If your system has a low tolerance for message loss, you can consider deploying Kafka services in a cluster, and deploy multiple copies to back up data to ensure that messages are not lost as much as possible.
  • In the process of consumption, messages may be lost: the consumption progress must be updated after the message is received and processed, but this will also cause the problem of repeated messages. For example, after a certain message is processed, the consumer happens to be down. Then, because the consumption progress is not updated, the consumer will consume this message repeatedly after restarting.

How do you know the message is lost?

  • A globally unique ID can be assigned to each sent message on the message producer side, or a continuously increasing version number can be added, and then the corresponding version verification can be done on the consumer side. Regarding the global unique ID mentioned here, uuid or snowflake algorithm can be used, please refer to my previous solution in the MySQL sub-database and sub-table solution: MySQL partition sub-database sub-table and distributed cluster

How to prevent message loss?

  • The interceptor mechanism can be used to inject the message version number into the message through the interceptor before sending the message at the production end, and then detect the continuity or consumption status of the version number through the interceptor after receiving the message at the consumer end. The advantage of this implementation is that the message detection code will not intrude into the business code, and the lost message can be located through a separate task.
  • It is also possible to save the message in the database in advance, and then wait for the consumer to clearly inform that the consumption has been completed, and then change the record in the database to completed, or delete it. If the message is lost, find the message from the database and consume it again. If you are worried that this series of operations will cause problems due to non-atomicity, you can use transactions, but the cost will be higher.

How to deal with failed data consumption?

  • After the consumer successfully executes the relevant task process, it will delete the current message. If the execution of the current message fails, it will not delete it. This may have a problem, that is, if the data of the message content itself is abnormal, the message will always occupy the head of the queue. The second solution: If the consumption fails, the message is also deleted, but a number of failures is recorded, and the message is put back to the end of the queue. If the number of failures reaches a certain number, an alarm message is sent for manual intervention.

message duplication problem

How to solve the problem of repeated messages ? First of all, the producer's message content must meet the "idempotent" condition, that is, the result is the same no matter how many times it is consumed repeatedly. It can be strong idempotent based on a unique identifier, such as user order number or serial number.

The so-called idempotence means that regardless of the Http or RPC interface with the same input parameters, no matter how many times the request is made, the result is the same, and the request result will not change due to the number of requests. For example, if the amount to be updated is set to 100 yuan, no matter how many times the update is made, the final result will be 100 yuan; but if an increment is passed, such as an increase of 20 yuan, then every additional request will add 20 yuan.

Common design schemes for idempotent interfaces:

  • Client button submission limit, each time a request is submitted, the button is disabled.
  • The logic layer of the background system processes, generates and saves a unique ID, and checks whether the ID already exists for each request. If it exists, it means a repeated operation, and directly returns the result of the last operation.
  • Token verification mechanism, the client applies for a token before requesting, the same token is only processed once, no token or the same token is not processed.
  • Distributed locks, such as the introduction of Redis distributed locks (set+nx), prevent other requests from repeating operations.
  • Request queue, introduce MQ queuing method to process requests in an orderly manner.

order of consumption

In the multi-cluster message architecture, if the consumer requires the received messages to be in order, how to solve the problem of sequential message consumption? For example, if the sending order of a message Producer is 1->2->3, then the message received by the Consumer should also be 1->2->3. Idea: Make the same message non-partitioned and single-threaded.

  • Producer: Let the producer send messages synchronously. Message 1 is confirmed to be sent successfully before sending message 2. It cannot be asynchronous, and the order of messages is guaranteed to be queued.
  • Server: Producer -> Kafka server -> Consumer One-to-one relationship, one-to-one service, this will definitely ensure that messages are consumed in order, then the problem comes: if there is a problem in any link, the entire link must be blocked. In addition, this single-channel model will have a performance bottleneck.
  • Topic non-partitioning: It means that the same topic topic enters a queue. In a distributed environment, if the same topic enters multiple partitions, the order of messages among multiple partitions cannot be guaranteed.
  • Consumer: Ensure that the consumer end is serial consumption, and multi-threading is prohibited. But this will sacrifice the performance and stability of the system. 
  • Considering the content of the message: an id is stored in the message. If the consumer finds that the received id is not continuous, it means that there is a problem with the order. However, the consumer needs to store and query the id, which will also consume performance.

message backlog problem

Sometimes the backlog of messages caused by bugs for various reasons is terrible. If the backlog of messages cannot be released for a long time, there will be big problems after a long time. Possible causes of message backlog:

  • Imperfect service monitoring: It may be that the Kafka server has not added a message backlog alarm. Although it is normal for some messages to be backlogged due to normal network jitter, a reasonable alarm threshold should be added according to the normal operation of the server. For example, when the message backlog exceeds 5000, an alarm message will be sent to notify the maintenance personnel in time to deal with the fault.
  • The problem was not exposed during the low-peak period of business: During the low-peak period of business, although there was a delay on the consumer end, it could be digested slowly, and the impact was controllable; during the peak period of business, the traffic increased rapidly, and the consumer end could not handle it for a while, and the impact was uncontrollable, resulting in a backlog of messages and a sharp increase in delay.

What should I do when there is a message backlog?

  • If it is a sudden online problem, it is necessary to temporarily expand the capacity to increase the number of consumers. At the same time, some non-core businesses can also be downgraded to resist traffic through capacity expansion and downgrade.
  • Troubleshoot and solve abnormal problems, such as analyzing whether there is a problem with the business logic code on the consumer side by checking monitoring, logs, etc., and then optimizing the business processing logic on the consumer side.

How to avoid or message backlog? The improvement plan is as follows:

  • Formulate plan B and add a queue compensation mechanism. When there is a problem with the queue, there are other ways to forward messages, such as direct connection.
  • Reduce the frequency of releases, and observe the online data in a timely manner after each release goes online.
  • Establish an effective alarm mechanism, and if there is a certain amount of news backlog, report to the relevant person in charge in time.
  • Most importantly, ensure the robustness of the code, especially the consumer side.

kafka cluster deployment

In the Kafka cluster, a Leader is responsible for writing and consuming messages, and multiple Followers can be responsible for data backup. There is a special set in the Follower called ISR (in-sync replicas). When the Leader fails, the newly elected Leader will be selected from the ISR. By default, the data of the Leader will be asynchronously copied to the Follower, so that when the Leader loses power or crashes, Kafka will consume messages from the Follower to reduce the possibility of message loss.
Since the default message is copied from the Leader to the Follower asynchronously, once the Leader goes down, those messages that have not yet been copied to the Follower will still be lost. In order to solve this problem, Kafka provides an option called "acks" for the producer. When this option is set to "all", each message sent by the producer will be sent to all ISRs in addition to the Leader, and must be confirmed by the Leader and all ISRs before it is considered successful. In this way, only when the Leader and all ISRs are down, the message will be lost.

  •  If you need to ensure that no message is lost, it is recommended not to enable the synchronous flushing of the message queue, but to use the cluster method to solve it. You can configure that all ISR Followers will return success only after receiving the message.
  • If there is a certain tolerance for message loss, it is recommended not to deploy a cluster. Even if it is deployed in a cluster mode, it is recommended to configure only one follower to return success.
  • The business system generally has a certain tolerance for the loss of messages. For example, because the message of giving points after the user places an order is lost, you can make up the points for the individual lost messages later.

Guess you like

Origin blog.csdn.net/rxbook/article/details/131053014