Four basic characteristics of IM system-consistency (and solutions)

Consistency: Generally speaking, it refers to the consistency of the sequence of messages.

The timing consistency of messages actually requires our messages to have "timing comparability", that is, messages can be compared with a common "timing benchmark". Therefore, a key issue to ensure the consistency of message timing is : Can we find such a timing benchmark to make our news "timing comparable"?

In engineering realization, we can divide it into such steps.

1. The first is: how to find the timing reference.

2. The second is: the availability of timing benchmarks.

3. Finally: With the timing reference, are there other errors? Is there any way to reduce these errors?

 1. How to find the timing reference?

Let's see if the local serial number and local clock of the sender can be used as a "timing reference"?

To explain here, the so-called local serial number and local clock of the sender means that when the sender sends a message, it carries a local timestamp or a serial number maintained locally to the IM server, and the IM server sends this timestamp Or the sequence number and the message are sent to the message receiver, and the message receiver sorts the messages according to the timestamp or sequence number. Carefully analyze, it seems that the local serial number of the sender or the local clock is not suitable to be used as a "timing reference" for the receiver's ordering, for the following reasons. The sender's clock has large unstable factors, and the user can adjust the clock at any time to cause problems such as serial number rollback. If the sender's local serial number is reinstalled, the serial number will be cleared to zero and the serial number will also be rolled back. Multi-sender scenarios such as "group chat messages" and "single-user multi-sign-in" both exist: at a certain time on the same clock, multiple messages may be sent to the same recipient. For example, in the same group, multiple people speak at the same time; or the same user logs in to two devices, and the two devices send messages to a certain recipient at the same time. Due to the problem of clock synchronization among multiple devices, there is no guarantee that the time brought by the device is accurate. There may be user A in the group speaking first, and then B speaking, but because user A’s mobile phone clock is slower than user B’s For half a minute, if this time is used as the "timing benchmark" for sorting, it may cause user A's speech to be considered later than user B. Therefore, it is unreliable to use the sender's local clock or local serial number as the "timing reference"

Can the local clock of the IM server be used as a "timing reference"?

Here is also an explanation. The IM server’s local clock as the “timing reference” means that after the sender submits the message to the IM server, the IM server generates a timestamp based on its own server’s clock, and then carries this when pushing the message to the receiver. Timestamp, the receiver sorts messages according to this timestamp. Let's analyze it, and it seems that the local clock of the IM server is not suitable as a "timing reference" for the receiver's message sequencing. Because, in actual projects, IM services are deployed in clusters, which means that many servers are deployed simultaneously. Although multiple servers use NTP time synchronization service to reduce the clock difference between the service cluster machines to the millisecond level, there is still a certain clock error, and the scale of the IM server is relatively large, and the uniform maintenance of the clock is also challenging , The overall clock is difficult to maintain extremely low error, so generally the local clock of the IM server cannot be used as the "timing reference" of the message.

Since there are problems with the localized clock or serial number of a single machine, can this problem be solved if there is a global clock or serial number? All messages are sorted based on this global sequence number, so that there is no problem of clock synchronization.

 Can the global sequence of the IM server be used as a "timing benchmark"?

For example, if there is a globally increasing sequence number generator, it should be able to avoid the problem of multi-server clock synchronization. The IM server can use the sequence number issued by this sequence number generator as the "timing reference" for message sequencing. And this "global serial number generator" can be implemented in a variety of ways:

1. Common such as Redis's atomic increment command incr.

2. The self-increasing id that comes with DB

3. Or similar to Twitter's snowflake algorithm,

4. "Time-related" distributed serial number generation services, etc.

Usability issues of "timing benchmark":

Using the serial number issued by the "global serial number generator" as the "timing benchmark" for message sorting can solve the problem of no standard "production date" for each message. But if it is a scenario that faces high concurrency and needs to ensure high availability, you also need to consider the availability of this "global sequence number generator". First of all, the atomic auto-increment of Redis and the auto-increment id of DB require the "take number" operation on the main library, and the main library is basically a single-point deployment, and the guarantee of availability will be relatively poor. In addition, The single-point master library for high-concurrency number taking operations may be prone to performance bottlenecks. While the time-dependent distributed "sequence number generator" similar to the snowflake algorithm is generally not problematic in number issuing performance, there are still some problems. One is that the time accuracy of the issued number is limited, usually to the second or millisecond level. For example, the ID generator of Weibo is accurate to the second level. In addition, since most of these services are deployed in clusters, the carrying time is adopted There is also the problem of clock inconsistency for the server time (although clock synchronization is relatively easier than controlling a large number of IM servers).

It can be seen from the above that there are still many problems based on the "global sequence number generator". Does this mean that sorting messages based on the sequence number generated by the "global sequence number generator" is not feasible?

Let's analyze it in detail from the perspective of back-end business implementation. From a business perspective, for scenarios such as group chat and multi-sign-in, there is no need to guarantee the absolute timing across multiple groups, only the order of messages for a certain group. In this way, if there is an independent "ID generator" for each group, the pressure can be distributed to multiple main library instances through hash rules, which greatly reduces the concurrency pressure of multiple groups sharing an "ID generator". For most instant messaging services, the product level can accept certain subtle errors in the message sequence. For example, if you receive multiple messages from the same group in the same second, you can accept multiple messages in this second in business. They are sorted according to the "order of reception". In fact, such subtle errors are basically imperceptible to users. Then, for sorting based on the sequence numbers generated by the "distributed time-related ID generator", it is fine if the time accuracy is business acceptable. From the previous sharing on WeChat, we can learn:

1. The timing of WeChat chats and Moments messages is also implemented through an "incremental" version number service. However, this version number is in an independent space for each user, and it is guaranteed to be incremented and not continuous.

2. The message box of Weibo relies on the "distributed time-related ID generator" to sort services such as private messages and group chats. The current accuracy can ensure order within seconds.

 

Errors other than "timing reference"

With a "timing benchmark", can it be ensured that messages arrive at the receiver in the "predetermined order"? The answer is that it is not always possible. The reason lies in the following two points.

1. IM servers are deployed in clusters, and the machine performance of each server is different, so the processing efficiency is different, and there is no guarantee that the first message will be pushed to the receiver first. For example, some servers are slow to process or just happen to be Encountered a GC, causing it to receive earlier messages, but push them out later than other machines that process faster.

2. After the IM server receives the sender’s message, the subsequent corresponding processing is generally multi-threaded, such as "fetching the sequence number", "temporary message", "querying the receiver connection information", etc., due to the multi-threaded processing flow , There is no guarantee that the message that gets the sequence number first will arrive at the receiver first, so the order of messages seen by multiple receivers may be inconsistent.

Solution: In-packet rectification of message server (server)

Although in most cases, instant messaging services such as chat and live interaction can accept "message disorder with small errors", in some specific scenarios, IM services may be required to guarantee absolute timing.

For example, a certain behavior of the sender triggers multiple messages at the same time, and these multiple messages need to be delivered in strict accordance with the trigger timing at the business level.

An example: User A sends the last break-up message to user B and also ticked the option of "Remove the other party". At this time, two messages of "Send a message" and "Remove" may be generated at the same time. If the server processes it, If the signaling message of "clearing" is processed first, it may cause the "sent message" to fail to be sent due to "clearing".

Program:

1. In this case, we can generally adjust the implementation method, and merge multiple requests at the business level on the sender, and merge multiple messages into one;

2. It is also possible for the sender to ensure the orderly arrival of two messages through a single sending thread and a single TCP connection.

But even if the IM server receives orderly, due to multi-threaded processing, there may still be a problem of timing disorder when it is actually processed or pushed down. To solve this "need to ensure the absolute order of multiple messages" can be solved through the IM server Rectification in the package.

For example: when we implement offline push, the gateway machine will automatically subscribe to a topic of this IP after the gateway machine is started. When the user goes online, the gateway will inform the business layer that the user has an online operation, then the business layer will take multiple of this user offline The message pub is a topic subscribed to the gateway machine connected to this user. When the gateway machine receives these messages, it will be pushed to the user through a long connection. The whole process is roughly like the following figure.

However, in many cases, the Sharding of the Redis queue component and the multi-threaded consumption processing of the gateway machine will cause disorder. In this way, if some signaling (such as deleting all sessions) operations are pushed out of order to the client, it may cause the client Logic error on the above.

Then talk about the process of offline push server rectification:

1. First, the producer generates a packageID for each message package, and adds an orderly and self-increasing seqId to each message in the package. 

2. Secondly, the consumer performs rectification according to the packageID and seqID of each message, and the final execution module will only perform the final operation after receiving all messages in a complete and orderly period within a certain timeout period , otherwise it will trigger a retry or give up directly according to business needs operating. Through server-side rectification, the rectification in the server-side packet is roughly as shown in the figure. What we have to do is to perform a rectification and summation of messages within a certain period of time according to the ID of the packet when the server gets the TCP connection and pushes down Sorting, so that even if the server is out of order when processing multiple messages, they can still be rectified into order when they are finally pushed to the client.

Solution: message receiving end rectification (client)

After messages with different sequence numbers arrive at the receiving end, problems such as "messages generated first arrive later" and "messages generated later arrive first" may occur. The rectification of the message receiving end solves such a problem.

The local rectification method of the message client can be implemented according to the characteristics of the specific service. The current common implementation method in the industry is relatively simple. The steps are as follows:

1. When a message is pushed down, it will be pushed to the receiver along with the message and serial number;

2. The receiver makes a judgment after receiving the message, and if the current message sequence number is greater than the sequence number of the previous message, the current message will be appended to the conversation;

3. Otherwise, continue to look for the penultimate, third, etc., until you find the message that is exactly smaller than the current push message, and then insert it and display it later.

to sum up:

The key point of how to maintain the consistency of the message sequence is to find a timing reference to identify the sequence of each message. This timing benchmark can be determined by a global sequence number generator. Common implementation methods include resource generation that supports monotonic self-incrementing sequence numbers, or distributed time-related ID generation service generation. Both methods have some limitations. However, you can You can choose according to the characteristics of your business. 

With the message sequence number determined by the timing benchmark, due to the difference of the IM server and the multi-threaded processing method, there is no guarantee that the message from the first server will be pushed to the receiver first. The " server-side in-package rectification " mechanism can be used to ensure that the need is "strict The correct execution of "ordered" batches of messages, or the receiver performs local rectification of messages according to the message sequence number , so as to ensure the final consistency of multiple receivers. 

Guess you like

Origin blog.csdn.net/madongyu1259892936/article/details/106051576