Dry information | Low cost and small error, Ctrip’s practice of Kafka-based Serverless delay queue

About the Author

Pin, pay attention to cloud native technologies such as RPC, Service Mesh, and Serverless.

1. Background

As the cloud project continues to advance, a large number of applications need to be deployed on AWS, many of which rely on the delay queue function. On aws, we choose to use Kafka as the message queue, but Kafka itself does not support delay queues, so we need to think about how to implement delay queues based on Kafka.

2. Demand

After counting all the scenarios that require the use of delay queues, we have the following characteristics:

  • The delay time is not fixed. Some topics need to support a 5-minute delay, while others need to support a 7-day delay.

  • The number of delayed messages is small. The number of delayed messages involved in all scenarios per day does not exceed 100 million, and the size of each message does not exceed 1MB.

  • Delayed messages cannot be lost and ordering is not guaranteed.

  • The delay error is small. Latency error refers to the time difference between the time when the message is actually consumed and the time when the message is expected to be consumed. According to statistical business scenarios, the delay error is required to be within 2 seconds.

  • The peak value of production delayed messages is relatively high. In many cases, the business will create 10 million delayed messages at one time, and the delay duration of these delayed messages is consistent.

3. Goal

Since there are many ways to implement delay queues, we have set several goals on the premise of meeting the needs: low cloud cost, low operation and maintenance cost, low development cost, high stability and small delay error.

4. Product selection

Products that support message queues on AWS include RabbitMQ, Apache ActiveMQ and SQS. Among them, RabbitMQ and Apache ActiveMQ aws mainly host their installation and deployment, rather than providing external services in a Serverless manner. In addition, we have currently chosen to use Kafka as the message queue. If we replace the message queue just to meet the delay queue function, the cost will obviously be huge.

In addition, AWS also provides SQS to support delay queues. Although SQS is serverless, SQS has its own limitations: SQS supports a delay of up to 15 minutes, which is obviously unable to meet our needs.

It can be seen that existing products based on the cloud can no longer meet our needs. For this reason, we began to investigate the implementation of delayed messages to see if we can achieve our needs with a small amount of development.

5. Plan research

There are many solutions to implement the delay queue function in the industry. We have conducted a simple analysis on them, as follows:

5.1 RabbitMQ

RabbitMQ is implemented based on TTL+ dead letter queue. Specifically, by setting the TTL of the message, when the TTL is reached, the message has not been consumed and will be delivered to the dead letter queue. There are two types of TTL:

  • Queue level TTL: unified TTL for all messages

  • Message-level TTL: Each message can have a different TTL, but there is a head-of-line blocking problem

The advantage of this solution is that it is simple to implement, but the delay error is uncertain.

5.2 Apache ActiveMQ

Apache ActiveMQ is implemented based on scheduled scheduling. Specifically, the delay time or cron expression is configured to represent the message delivery strategy, and the Java-based Timer implementation stores the message hierarchically in files and memory.

The advantage of this solution is that it is simple to implement and the delay error is controllable, but it may occupy a lot of memory.

5.3 RocketMQ

RocketMQ is implemented based on timing scheduling + delay level. Specifically, the delayed message is sent to the specified delay level queue (there are 18 levels in total), and then a timer is used to poll these ConsumeQueue to achieve the delay effect. The specific implementation is as follows:

  • Modify the message topic name and queue information and deliver them to the ConsumeQueue of the corresponding level of delayed messages.

  • ScheduleMessageService consumes the messages in ConsumeQueue and re-delivers them to CommitLog.

  • Deliver the messages in the CommitLog to the target topic, and the consumer consumes the messages in the target topic.

The advantage of this solution is that the delay error is controllable, but the implementation is complicated.

5.4 Redis

There are many ways to implement delay queues based on Redis, two of which are briefly described here:

1) Regular polling

The general steps of the program are as follows:

  • Use the delayed timestamp of the message as the key of zset, and the ID of the message as the value of zset.

  • The message ID is used as the key, and the message body is serialized into String and stored in Redis as the value.

  • Poll zset regularly. If the time is greater than the current time, it will be delivered to the Redis List for consumer consumption.

2) Key expiration listener

Set an expiration time for each message, monitor the expiration event and then deliver the message to the target topic.

The advantage of implementing a delay queue based on Redis is that it is simple to implement, but messages may be lost and the storage cost is high.

6. Implementation plan

Since using a single cloud product cannot meet our needs, we can only consider implementing the Kafka-based delay queue function through a small amount of development and combining the features of cloud products. Specific implementation solutions include the following:

6.1 RabbitMQ or Apache ActiveMQ

RabbitMQ or Apache ActiveMQ are both products supported by AWS and can meet the needs from a functional perspective. The current message queue is implemented based on Kafka. If combined with RabbitMQ or Apache ActiveMQ to implement the delay queue function, the main problem is: the lack of technical reserves related to RabbitMQ or Apache ActiveMQ, because AWS only supports RabbitMQ or Apache ActiveMQ. It is only managed at the deployment level. When problems arise, developers need to troubleshoot them themselves. Therefore, this option was not considered.

6.2 Multi-level queue based on SQS

Since SQS already supports delay queues within 15 minutes, if you want to implement a longer delay queue, can you consider using a multi-level delay queue? The specific implementation plan is as follows:

  1. Add a field times to the delay message to indicate which round it is currently (drawing on the idea of ​​​​the time wheel algorithm).

  2. If the delay time of the delayed message is less than 15 minutes, set the times of the delayed message to 0 and deliver it directly to SQS.

  3. If the delay time of the delayed message is greater than 15 minutes, calculate the value of times (delay time/15 minutes), and then deliver it directly to SQS.

  4. If the Consumer consumes a delayed message from SQS and times is greater than 0, the times value is subtracted by 1 and delivered to SQS again. Repeat this until times is 0.

  5. If the Consumer consumes a delayed message from SQS and times is 0, it means that the message has reached the delay time, and the Consumer will directly deliver the message to the corresponding target topic.

Although this solution can realize the function of delay queue, and SQS itself is serverless, the maintenance cost is also relatively low.

However, we investigated the billing standards of SQS and found that SQS mainly charges based on the number of messages. In this way, if the delay time is longer, the number of messages will be amplified more seriously. In our actual business, the delay time is not within 15 minutes, usually 1 hour to 7 days, so this solution is not feasible.

6.3 Based on SQS and timing scheduling strategy

The biggest problem with using multi-level queues based on SQS is the cost problem on the cloud, and more specifically, the storage cost problem on the cloud. Because this solution stores all delayed messages in SQS, this is the main reason for the increased cost. In this case, can we consider writing messages with a delay greater than 15 minutes to a low-cost storage, and then query them and deliver them to SQS when the delay is less than 15 minutes. In this way, the length of the delay time will not affect the cost of SQS. You only need to consider how to choose a Serverless product with low storage cost and easy reading and writing as the storage of delayed messages.

Based on this idea, an implementation plan based on SQS and timing scheduling strategy was designed:

485d0820de6543ae17e6f10c8ad3516b.png

The specific process is as follows:

  1. Normal messages produced by Producers are directly delivered to the target topic of Kafka. If the messages are delayed, they are delivered to a Delay Message Topic of a delayed message in Kafka.

  2. Consumer consumes messages in Delay Message Topic. If the delay time of the message is less than 15 minutes, it is directly delivered to SQS (Delay Queue). If the delay time of the message is greater than 15 minutes, write the message directly to the Message Store.

  3. The Scheduler will regularly scan the messages in the Message Store. If the delay time is found to be less than 15 minutes, it will be delivered directly to SQS (Delay Queue). The Scheculer is triggered through the Event Bridge.

  4. Emitter will consume the message in SQS (Delay Queue) and deliver the message to the target topic.

The whole process is not complicated, and the AWS services involved are all Serverless. However, if there are too many services involved, troubleshooting will become more complicated.

Based on the above issues, we have improved and simplified the implementation of the solution, as follows:

fa7e6bb25e55260663e73542fa14df4a.png

The specific process is as follows:

  1. Normal messages produced by Producers are directly delivered to the target topic of Kafka. If the messages are delayed, they are delivered to a Delay Message Topic of a delayed message in Kafka.

  2. Service consumes messages in Delay Message Topic. If the delay time of the message is less than 15 minutes, it is directly delivered to SQS (Delay Queue). If the delay time of the message is greater than 15 minutes, write the message directly to the Message Store.

  3. The Service will regularly scan the messages in the Message Store. If the delay time is found to be less than 15 minutes, it will be directly delivered to SQS (Delay Queue).

  4. Service will consume the message in SQS (Delay Queue) and deliver the message to the target topic.

The simplified solution concentrates the logic of Consumer, Emitter and Scheduler in the Service service. The Service service is deployed in a cluster. All the logic of this solution is in the Service service, which is relatively more convenient when troubleshooting. After the general direction of the overall implementation plan is determined, the following issues need to be refined:

1) How messages are stored

We can see that the main function of the Message Store is to store delayed messages with a delay greater than 15 minutes and provide them for query by the Scheduler. The query is based on time. There are many services that support serverless storage. After research, I finally chose DynamoDB.

The partition key in DynamoDB is the delay time, and the sorted key selects the message id. This ensures that a message can be uniquely located through the partition key and sorted key without conflict. At the same time, when querying, you only need to query all the messages in the time segment based on the partition key, and there will be no hot spots or uneven partition problems.

Assume that the partition key is 1677400776 (which is the timestamp of 2023-02-26 16:39:35, accurate to seconds), then all messages corresponding to the partition key are delayed from 2023-02-26 16:39:35 Messages between 2023-02-26 16:39:36. Because each message has a unique message id, setting the sorted key to the message id will not cause message conflicts. The Scheduler only needs to pass in the timestamp to be queried when querying, and it can pull all the messages in the time period. If no messages are queried, it means that there are no delayed messages in the time period.

At the same time, TTL is also set for messages in DynamoDB to automatically delete data. The set TTL time is 24 hours longer than the delay time, which is mainly to facilitate troubleshooting. When the delayed message in DynamoDB is delivered to SQS, the API is called to delete the message. The data structure of messages in DynamoDB also includes topic, message body and other information.

2) Single point problem

The single point problem is mainly because when scanning delayed messages stored in DynomaDB for more than 15 minutes, there is a problem with the Scheduler that receives the scan notification, and the messages in this time period are not delivered to SQS, resulting in message loss. Now the functions of the Scheduler are integrated into the Service service, and the Service service is deployed in a cluster, so there is no single point problem for the Scheduler.

But another problem needs to be solved: How to ensure that only one Scheduler in the cluster scans the data in DynamoDB, and when a problem occurs with the Scheduler, other Schedulers in the cluster can continue to execute?

To solve this problem: we use SQS's FIFO queue. SQS supports two types of queues, one is Standard queue and the other is FIFO queue. The FIFO queue can strictly guarantee the order of messages and support the visibility of messages. That is to say, the message can only be visible to one consumer within a period of time, and other consumers cannot access it. At the same time, SQS's FIFO queue also supports the deduplication function. Based on these features of SQS's FIFI queue, it is easier to solve single point problems. The specific implementation plan is as follows:

  1. Start a Timer in the Service service to regularly deliver notification messages to the SQS FIFO queue once a minute. The message body of the notification message is the timestamp of the current time, accurate to the minute. In this way, even if n Timers deliver n messages to the SQS FIFO queue in the same minute, only one message will be successfully delivered to the SQS FIFO queue, and n-1 messages will be filtered by the deduplication function of the SQS FIFO queue. Lost.

  2. Visibility of postings to SQS's FIFO queue is set to 5 minutes (configurable). It is guaranteed that only one Scheduler can consume notification messages within 5 minutes. If the Scheduler fails, other Schedulers can continue to consume notification messages. When the Scheduler consumes a notification message, it will convert it into a timestamp based on the message content, query all messages within this timestamp range in DynamoDB, modify the delay time of the message, deliver it to the Standard queue of SQS, and finally delete the SQS This notification message in the FIFO queue.

Based on the above solution, single point problems can be solved very well.

3) Message loss problem

Because both Timer and Scheduler are in the Service service, they are both cluster problems and there is no single point of problem. Moreover, SQS's FIFO queue can ensure that messages are strictly ordered, so there is no problem of message loss. The only possible problem is excessive message delay due to the large backlog of messages.

4) How to query delayed messages

The message queried by the Scheduler must satisfy the delay time of the message is less than 15 minutes, so after receiving the notification message and converting it into the corresponding timestamp, it can query the message with the current timestamp + 14 minutes (delayed messages cannot exceed 15 minutes).

5) How to deploy Service service

For Service services, we use ECS+Fargate to deploy them. The entire code deployment is implemented through Terraform scripts to create resources such as Code Pipeline, DynamoDB, SQS and ECS. All resources are implemented through code. The design of the entire deployment solution is based on the idea of ​​gitOps.

After comprehensive evaluation of multiple solutions, we finally chose a solution based on SQS and timing scheduling strategies to implement delayed messages.

6.4 Performance optimization

During the practice of the above scheme, a lot of optimizations have been made, which can be roughly summarized into the following points:

1) Message backlog

Delayed messages that need to be processed will lead to a backlog of messages due to insufficient consumption capacity. Optimizing this problem mainly starts from the following aspects:

  • The partition of Delay Message Topic is set to 64. Improving the consumption capacity of Kafka consumption can be achieved by adding consumers, but the premise is to ensure that the number of partitions is greater than or equal to the number of consumers.

  • Reduce the service configuration of Service and increase the number of copies of Service service. The Service cluster consumes messages in the Delay Message Topic. The more copies, the stronger the consumption capability.

2) WCU and RCU in DynamoDB

A large part of DynamoDB's costs are calculated through WCU and RCU. WCU refers to the number of messages written per unit time, and RCU refers to the number of messages read per unit time. If the number of written messages per unit time exceeds the WCU limit, message writing will fail, and message reading will also fail.

If both WCU and RCU are set to peak values, it will certainly not cause read and write failures, but it will result in huge cost waste. To this end, we set WCU and RCU to dynamically expand and contract capacity. If a failure occurs during expansion, retry will be performed. After optimizing relevant parameters, an optimal status quo can now be reached.

3) ECS expansion and contraction settings

The smallest running unit in ECS is a task. Each task is required to expand quickly and shrink slowly. The biggest problem encountered in rapid task expansion is that it takes a long time to pull up the Service. For the Service service, we use golang to implement it. Expanding a task can basically be completed within 8 seconds. The expansion and contraction are set based on the peak usage of the CPU. Each expansion will expand 4 tasks, and each time the capacity will be reduced by 1 task.

4) Message smoothing processing

Since the peak value of messages written to the Delay Message Topic may be relatively large, if these messages are consumed quickly, subsequent read and write pressure on DynamoDB will be relatively high. Therefore, when consuming messages in Kafka's Delay Message Topic, the number of messages consumed by each Service will be controlled. Although multiple Services will be consumed at the same time, for a single Service, the number of written messages is small. For DynamoDB, each write is relatively smooth, and it does not write a large amount of data at once, so the write fails. The probability will be much smaller.

6.5 Practical results

It has been running stably in the production environment for 6 months, and all indicators are relatively healthy. The data for the last 4 weeks has been pulled.

1) Delayed message success rate

c8ec56d762b6c74e1931847f1f5e7565.png

As shown in the figure above, the success rate of delayed messages with a delay error within 2 seconds is basically 100%.

2) Number of delayed messages

1b93370a85dfe0daf3feaac1d057e0af.png

As shown in the figure above, the peak value of delayed messages reaches 150,000 within 5 minutes, which means that the peak value is 500 delayed messages per second.

3) DynamoDB performance indicators

cf24c60f9eb4412a2f6797bec051ee05.png

It can be seen from the PutItem ThrottledRequests indicator that there are no writing failures when writing messages through DynamoDB. It can be seen from the QueryThrottledRequests indicator that there is no query failure when querying messages through DynamoDB. As can be seen from the QueryReturnedItemCount indicator, the peak value of delayed messages is 3350 messages within 5 minutes, which is less than 60 messages per second. This is because we buffer write messages in the Service, thereby reducing concurrent read and write pressure.

4) Kafka message backlog

f4406cd3abba9f069ab3dcdc113c367f.png

As shown in the figure above, the peak message backlog in Kafka is 60,000 within 5 minutes, and the backlog of messages can be consumed quickly.

5) Timer performance indicators

f964e22f41206b5128996e070cdb3913.png

Timer will deliver a message to the FIFO queue of SQS every minute, and the number of messages is the same as the number of copies of the Service. As can be seen from the above figure, a maximum of 300 messages are delivered within 5 minutes (because the maximum number of copies of the Service is 64). But the last message received was only 5 messages received in 5 minutes, that is, 1 message was received in 1 minute.

7. Summary

Since this implementation is entirely based on Serverless, the maintenance cost is very low. Although somewhat complex to develop, this is a one-time cost investment. Judging from the data in recent months, the cost of using the cloud does not exceed US$200 per month, the error and delay are relatively small, and the overall operation so far is relatively stable.

[Recommended reading]

3d089904607c83f3a243aad4bde29354.jpeg

 “Ctrip Technology” public account

  Share, communicate, grow

Guess you like

Origin blog.csdn.net/ctrip_tech/article/details/131466933