How does RocketMQ deal with 150 billion data processing per day?

Tongcheng Yilong’s air tickets, train tickets, bus tickets, and hotel related businesses have been connected to RocketMQ for peak shaving during peak traffic to reduce back-end pressure.


image.png

At the same time, the conventional system is decoupled, some synchronous processing is changed to asynchronous processing, and 150 billion pieces of data are processed every day.


At the recent Apache RocketMQ Meetup, Cha Jiang, the architect of Tongcheng Yilong's ticket division, shared how Tongcheng Yilong's messaging system handles 150 billion data processing per day.


Through this article, you will learn:

  • Tongcheng Yilong messaging system usage

  • Application scenarios of Tongcheng Yilong messaging system

  • Technically stepped pit

  • Improvements based on RocketMQ


Tongcheng Yilong messaging system usage


image.png

The RocketMQ cluster is divided into two parts: Name Server and Broker. Name Server uses dual main mode. One is for performance and the other is for security. Broker of pure data is divided into many groups, and each group is divided into Master and Slave.


At present, our air tickets, train tickets, bus tickets, and hotel-related businesses have been connected to RocketMQ, which is used to cut peaks during peak traffic to reduce back-end pressure.


At the same time, the conventional system is decoupled, some synchronous processing is changed to asynchronous processing, and 150 billion pieces of data are processed every day.


The reasons for choosing RocketMQ are:

  • Easy access, fewer Java packages are introduced

  • Pure Java development, clear design logic

  • The overall performance is relatively stable. In the case of a large number of topics, the performance can be maintained


Application scenarios of Tongcheng Yilong messaging system


Unsubscribe system


This is an application scenario in our unsubscribe system. The user clicks the unsubscribe button on the front end, the system calls the unsubscribe interface, and then calls the supplier's unsubscribe interface to complete an unsubscribe function.

image.png

If the supplier's system interface is unreliable, it will cause the user to fail to unsubscribe. If the system is set to synchronize operation, it will cause the user to click again.


Therefore, we introduced RocketMQ to change from synchronous to asynchronous. The current end user sends an unsubscribe request. After the unsubscribe system receives the request, it will be recorded in the database of the unsubscribe system, indicating that the user is unsubscribing.


At the same time, the unsubscribe message is sent to the system connected with the supplier through the message engine to call the supplier's interface.


If the call is successful, the database will be identified, indicating that the subscription has been successfully unsubscribed. At the same time, a compensation script was added to retrieve the unsubscribed messages from the database and re-unsubscribe to avoid unsubscription failure caused by message loss.


Warehouse system


The second application scenario is our warehouse system. This is a relatively conventional message usage scenario. We collect some basic information data and detailed data of the hotel from the supplier, and then connect it to the message system, which is calculated by the back-end distribution system, minimum price system and inventory system.


At the same time, when the supplier changes the price, the price change event will also be transmitted to our back-end business system through the message system to ensure the real-time and accuracy of the data.


Subscription system for supply library


The subscription system of the database also uses the message application. Under normal circumstances, database synchronization is done through binlog to read the data inside, and then move it to the database.


During the handling process, we are most concerned about the order of the data, so on the basis of the database row mode, a new function has been added to ensure that the order in each Queue is unique.


Although the order in the Queue is naturally unique, we have a feature in use, that is, the messages with the same ID are placed in the same Queue.


For example, the message of id1 in the upper right corner of the figure, the main field of the database is id1, is unified in Queue1, and in order.


In Queue2, two id3s are separated by two sequential id2s, but when the actual consumption is read, it will also be sequential. Therefore, the order of multiple queues can be used to improve the overall concurrency.


Technically stepped pit


Supplier system scenario


image.png

In the above figure, one MQ corresponds to two consumers. They are in the same Group1. At first, everyone only has Topic1. At this time, consumption is normal.


However, if you add a Topic2 to the first consumer, you cannot consume or consume abnormally at this time.


This is a problem caused by RocketMQ's own mechanism, and Topic2 needs to be added to the second consumer for normal consumption. 


Scenario of payment transaction system


image.png

The other is a payment transaction system. In this scenario, there are also two applications. They are all under the same Group and the same topic. One is to consume Tag1 data and the other is to consume Tag2 data.


Under normal circumstances, it should be no problem to start, but one day we found that one application can't get up, and another application only consumes Tag2 data, but because of the RocketMQ mechanism, Tag1 data will be taken over. The data of Tag1 will be discarded.


This will cause the user to fail in the payment process. For this, we put Tag2 in Group2, and the two groups will not consume the same message.


I personally suggest that RocketMQ can implement a mechanism, that is, only accept its own Tag messages and not receive unrelated tags.


Scenarios where a large amount of old data is read



In the train ticket consumption scenario, we found that 20 billion pieces of old data have not been consumed. When our consumption starts, RocketMQ will start reading from the 0th data by default. At this time, the disk IO soars to 100%, which affects the reading of other consumer data, but after these old data is loaded, it has no practical effect. .


Therefore, the improved way to read a large amount of old data is:

  • For new consumption groups, consumption is from LAST_OFFSET by default.

  • When the single Topic stack in the Broker exceeds 10 million, consumption is prohibited and the administrator needs to be contacted to enable consumption.

  • The monitoring must be in place, and when the disk IO spikes, the consumer can be immediately contacted for processing.


Server scene


①Futex Kernel bug in CentOS 6.6 causes Name Server and Broker processes to hang frequently and cannot work normally


Solution: upgrade to 6.7


② The two threads of the server will create the same CommitLog and put it into the List, which will cause the calculation of the message offset error, the failure of parsing the message, and the failure to consume, and restarting will not solve the problem.


Solution: thread safety issue, change to single thread


③Resetting the consumption progress in Pull mode causes the server to fill a large amount of data into the Map, and the Broker CPU usage soars by 100%.


Solution: Map local variable scene is not used, delete


④Master recommends that when the client consumes on the Slave, if the data has not been synchronized to the Slave, the pullOffset will be reset, resulting in a large amount of repeated consumption.


Solution: Do not reset the offset


⑥ There is no MagicCode for synchronization. When the security group scans the synchronization port, the Master parses it incorrectly, causing some problems.


Solution: add magicCode verification during synchronization


Improvements based on RocketMQ


New client


Added .net client, native development based on Java source code; added HTTP client, implemented some functions, and connected to RocketMQ through Netty Server.

image.png

New message flow limit function


If the client code is written incorrectly and an infinite loop is generated, a large amount of duplicate data will be generated. At this time, the production thread will be full and the queue will overflow, which will seriously affect the stability of our MQ cluster and affect other businesses.

image.png

The above picture is a model diagram of current limiting. We add the current limiting function before Topic. The rate limit and size limit can be set through the current limiting function.


The rate limit is implemented through the token bucket algorithm, that is, how many tokens are put in the bucket per second, and the speed is consumed per second, or how much data is written to the topic. The above two configurations support dynamic modification.


Background monitoring


We have also made a monitoring background for the entire link process of monitoring messages, including:

  • Message full link tracking, covering the entire life cycle of message generation, consumption, and expiration

  • News production and consumption curve

  • Message production abnormal alarm

  • Messages accumulate to alarm, notify which IP is too slow to consume


Other functions:

  • Produce and consume messages in HTTP mode

  • Topic consumption permission setting, Topic can only be consumed by the designated group to prevent online disorderly subscription

  • Support new consumer groups to consume from the latest position (the default is to consume from the first one)

  • Broadcast mode consumption progress synchronization (server display progress)


The above is the practice of Tongcheng Yilong in the construction of the message system.



Guess you like

Origin blog.51cto.com/14410880/2551423