Analysis of RabbitMQ production failure problems

1. Cause of the problem

  The BI-collector-xx queue of a certain service is blocked, which has a serious impact on the entire rabbitMQ cluster service being unavailable. Multiple application MQ producer services appear suspended in state, which has a wide impact on the system and a great impact on the business. At that time, in order to handle the emergency and restore the system to availability, the operation and maintenance was relatively rough and cleared a bunch of blocking queue information, and then restarted the entire cluster.

During the whole process of reviewing the fault, I had a lot of doubts in my mind, at least the following problems existed:

  1. Why does queue blocking occur?
  2. Why does blocking in one queue affect the operation of other queues (that is, multiple queues interact with each other)?
  3. There is a problem with an application's MQ queue. Why does it cause the application to be unavailable?

2. Test queue blocking

One weekend at home, I found a test environment, installed rabbitmq, tried to reproduce the process, and did simulation tests.

Write two test application demos (assumed to be two project applications) with producers and consumers respectively, and use queues testA and testB respectively.

In order to restore the production situation as much as possible, the same vhost was used in the initial test, and different vhosts were set later.

Producer A, the sample code is as follows

Consumer A

MQ configuration

Producer B produces 100,000 messages each time

 

Consumer B, the code was written incorrectly on purpose (to simulate an abnormal situation), and it was not a normal json string, causing an exception to be thrown when interpreting json.

First understand the process of Rabbitmq client starting the connection, and capture and analyze the packets through wireshark, as follows

 

First, give a brief introduction to AMQP. The requested AMQP protocol method information. The AMQP protocol method includes class name + method name + parameters. This column mainly displays the class name and method name.

  • Connection.Start: Request the server to start establishing a connection
  • Channel.OpenRequest the server to establish a channel
  • Queue.Declaredeclare queue
  • Basic.ConsumeStart a consumer and request messages from the specified queue

For detailed methods, please check amqpthe official website https://www.rabbitmq.com/amqp-0-9-1-reference.html

Work process analysis:

Basic.Publish The client sends Basic.Publisha method request and publishes the message to the queue according to routing rules;exchangerabbitmq server

Basic.Deliver The server sends Basic.Delivera method request and delivers the message to the client consumer listening on the queue;

Basic.Ack The client sends Basic.Acka method request to inform the rabbitmq server that the message has been received and processed.

After the two applications are started, some parameters and monitoring indicators can be observed through the rabbitmq management console.

 

 

At the beginning, application A's production and consumption were normal.

The error code on the consumer side of B is abnormal and the error messages are reported wildly.

 

After running for about 30 minutes, observe that the A producer application console also has abnormal information.

 

Checking the server connection status shows a blocked situation, which is very similar to the production failure scenario.

 

At this time, the client is this machine, the CPU and memory have increased significantly, the fan sound is very loud, and there is obvious lag. After another 30 minutes, the application will be basically unavailable.

Analyze the reasons

The above error code shows that consumer B cannot ack, and the queue is blocked because no ack is performed. So the question is, why is this? In fact, this is a protection mechanism of RabbitMQ. This prevents a massive amount of messages from entering the consumer and causing the consumer to crash when messages surge.

 RabbitMQ provides a QOS (Quality of Service Guarantee) function, which limits the maximum number of unconfirmed messages that consumers on the channel can maintain on the premise of non-automatically confirmed messages. This can be achieved by setting prefetchCount to automatically confirm that the prefetchCount setting is invalid.

For example: it can be understood as adding a buffer container in front of the consumer. The maximum number of messages that the container can accommodate is PrefetchCount. If the container is not full, RabbitMQ will deliver the message to the container. If it is full, it will not deliver the message. When the consumer acks the message, it will remove the message and put in a new message.

Through the above configuration, I found that I only configured 2 for prefetch initially, and the concurrency configuration was only 1, so when I sent 2 error messages, these 2 messages were not acked due to the failure of parsing. When the buffer is full, RabbitMQ thinks that the consumer has no consumption power and will not continue to push messages to it, thus causing queue blocking.

Determine whether the queue is at risk of blocking.

  When ackthe mode is and messages manualappear online , there is no need to panic at this time. unackedBecause QOS limits the maximum number of unacknowledged messages that consumers channelon the channel can maintain. So unackedthe number of allowed occurrences can channelCount * prefetchCount *消费节点数量be derived by .

channlCountIt's up to concurrency,max-concurrencyyou.

  • min = concurrency * prefetch *消费节点数量
  • max = max-concurrency * prefetch *消费节点数量

It can be concluded from this

  • unacked_msg_count < min The queue will not block. But unackedthe message needs to be processed in a timely manner.
  • unacked_msg_count >= min Blockages may occur.
  • unacked_msg_count >= max The queue must be blocked.
Important attention

1unackedThe message consumerwill be automatically returned to the head of the queue if the connection is disconnected (such as restarted) and then reconnected.

2. If ackthe mode is changed to autoautomatic, QOS will not take effect. There will be a large influx of messages, consumerwhich may cause consumerthe risk of downtime.

Look back at the program configuration and do some analysis and adjustments.

Add a problem code on the consumer side of B. try-catch-finallyRegardless of any problems in the middle, the message will be signed and ACKed.

 

After the code adjustment, the two queues are running normally, and the two client applications are also running normally.

 

 

After a period of consumption, consumer B has finished consuming the accumulated messages.

 

3.     Analysis of the cause of the third problem

Or check the packet capture information

Basic.Reject: The client sends a Basic.Reject method request, indicating that the message cannot be processed and the message is rejected. At this time, the requeue parameter is true and the message is returned to the original queue;

Basic.Deliver: The server calls the Basic.Deliver method. Different from the first Basic.Deliver method, the redeliver parameter at this time is true, which means that the message is re-delivered to the consumer who is listening to the queue, and then these two steps will be repeated.

When the RabbitMQ message listener is abnormal, the consumer will send a message to the rabbitmq server Basic.Rejectto indicate that the message is rejected. Since Spring requeue-rejectedis configured by default true, the message will be re-enqueued, and then the rabbitmq server will re-deliver it. It is equivalent to an infinite loop, so it is easy to cause excessive resource usage on the consumer side, especially when the number of TCP connections, threads, and IO soar. If individual programs with transactions or database operations and other connection resources are not released, they will be full, resulting in The application is in a suspended state (when a problem occurs, check the problematic application's logs for a large number of connection timeout errors).

Therefore, some business scenarios (scenarios that do not emphasize strong data consistency, such as log collection) can be set default-requeue-rejected: falseaccordingly.

factory.setDefaultRequeueRejected(false);

  Depending on the exception type, it will be discarded directly or added to dead-letter-exchange .

It is very important for the consumer side to use manual confirmation of the sample structure code correctly!

try {
    // 业务逻辑。
}catch (Exception e){
    // 输出错误日志。
}finally {
    // 消息签收。
}

4.     Verify that the queue sets a maximum length limit

Set queueLengthLimit queue maximum length limit x-max-length=5

 

The producer originally wanted to produce 10 messages

 

Due to the maximum length limit of the queue, only 5 items are actually put into the queue.

 

There are only 5 messages submitted by consumers, ranging from NO.6 to NO.10

Change the consumer program so that the producer keeps generating messages. The consumer's consumption speed obviously cannot keep up with the producer's production speed .

 

 

From the consumer side, messages are randomly added to the queue. There are always a maximum of 5 messages in the queue. No matter how many messages are sent, they will not be entered. No exceptions will occur to the message sender and producer, but the messages will be lost randomly (not all Join the team).

The operation is good, except that not all messages are queued, and there are no exceptions.

 

Consumption is relatively slow. The CPU and memory indicators of this machine are normal and there are no abnormalities.

If an unack occurs due to an abnormal situation, the maximum queue length limit does not count the number of unacks, as shown in the figure below

 

After the exception, this observation MQ monitoring management background

 

The producer keeps producing messages and runs for 30 minutes. It is normal to observe the producer application, but the messages cannot be put into the queue.

 

 

5.   Check the actual business-side code

Let’s look at the consumer code of our business system. There are various non-standard writing methods on the consumer side. Here are a few typical examples:

1. Manual receipt has ACK, but there is no try-catch-finally structure. The consumer-side business code is as follows:

2. There is a try-catch-finally structure, but the deliverTag is a fixed value of 0, which will cause problems.

 

3. Automatic signature and confirmation, when there are a large number of messages, it is easy to kill the consumer application.

 

6. Summary

  • It is not recommended to use the automatic ack mode in a production environment , as this will prevent QOS from taking effect.
  • When using manual ack, you need to pay great attention to message signing. The business code uses try-catch-finally processing structure to prevent failure to sign when the business code is abnormal.
  • Standardize the mq client code and use Rabbitmq configuration correctly.
  • Setting different vhosts for different business projects can isolate some effects and improve rabbitmq resource usage.
  • Consider setting dead-letter-exchange . When requeue=false is set  , you can put dead-letter-exchange to quickly troubleshoot positioning problems.
  • The maximum length limit of Exchange and queue can be the limit of the number of messages (parameter: x-max-length), or the total number of bytes of the message (the total number of bytes represents the number of bytes of all message bodies, ignore the message attributes and any header information), or both are limited. The smaller value of the two takes effect. Only messages in the ready state are counted , and unconfirmed messages will not be counted and are subject to the limit. The maximum queue setting can limit the production end, but it will cause the risk of message loss. The maximum number of messages is limited and cannot completely solve the queue blocking problem.
  • Try to use Direct-exchange. Direct-type Exchange is the fastest to deliver messages.
    • Direct: Processing routing keys requires binding a queue to the switch and requiring the message to exactly match a specific routing key. This is a complete match. If a queue is bound to the switch and requires a routing key of "A", only messages with a routing key of "A" will be forwarded, and messages with a routing key of "B" will not be forwarded. Only messages with a routing key of "A" will be forwarded. ;
    • Topic: Match the routing key with a certain pattern. At this time the queue needs to be bound to a mode. The symbol "#" matches one or more words, and the symbol "*" can only match one word;
    • Fanout: Routing keys are not processed. Just simply bind the queue to the switch. A message sent to this type of switch will be broadcast to all queues bound to this switch;
    • Headers: does not process routing keys, but matches based on the headers attribute in the sent message content. Specify a set of key-value pairs when binding Queue and Exchange; when the message is sent to RabbitMQ, the headers of the message will be obtained and matched with the key-value pairs specified when binding Exchange; if there is a complete match, the message will be routed to the queue , otherwise it will not be routed to the queue.

Guess you like

Origin blog.csdn.net/weixin_45925028/article/details/132969251