Four basic characteristics of IM system-reliability (and solutions)

What are the situations of message loss?

Let’s take the most common “server-side routing relay” type of IM system (non-P2P) as an example. Here is an explanation. The so-called “server-side routing relay” means that after a message is sent from user A, it needs to pass through the IM server first To transfer, and then push to user B by the IM server. This is also the most common type of message distribution in IM systems at present.

So, let's assume a scenario: User A sends a message to User B. Next, let's take a look at which links may be at risk of losing news?

Referring to the above sequence diagram, the message is roughly divided into two parts as a whole:

1. User A sends a message to the IM server, the server temporarily stores the message, and then returns a successful result to the sender A (steps 1, 2, 3);

2. The IM server then pushes the temporarily stored message sent by user A to the recipient user B (step 4). 

The scenarios where messages may be lost are as follows. In the first part. Steps 1, 2, and 3 may all fail.

1. Since user A sends a message is a process of "request" and "response", if user A fails to send the message to the IM server due to network disconnection and other reasons;

2. Or the IM server failed to store the message when it received the message;

3. Or user A waits for the IM server for a certain timeout period, but the IM server has not returned a result, then user A will be prompted to send failed in these cases.

note:

1. In the first part. Steps 1, 2, and 3 may all fail. Since user A sending a message is a process of "request" and "response", if user A fails to send the message to the IM server due to network disconnection or other reasons; or the IM server receives the message for server-side storage Failed; or user A waits for the IM server for a certain timeout period, but the IM server has not returned a result, then user A will be prompted to send failed in these cases. Next, he can make up for it by retrying and so on. Note that this may lead to the problem of sending duplicate messages. For example: the client does not receive a response within the timeout period and then retries, but in fact, the request may have been successfully processed on the server, but the response is slow. Therefore, in this case, the server needs to have deduplication logic, generally the sender There is a unique ID for the same retry message, which is convenient for the server to reuse it.

2. In the second part . After the message is stored in the IM server, respond to user A to inform that the message was sent successfully, and then the IM server pushes the message to user B's online device. During the push preparation phase or after writing the message to the kernel buffer, if the server loses power, the message will not be successfully pushed to user B. In this case, in fact, the connected IM server may no longer function normally, and later remedial measures are needed to solve the problem of message loss. Even if our message is successfully connected to user B's device via TCP, if user B's device has a problem with the processing after receiving it, the message will be lost. For example, when user B's device writes a message to the local DB, an exception occurs and it fails to be successfully stored. In this case, because the network level has actually been successfully delivered, user B cannot see the message. So it is more difficult to deal with.

Generally we will use the following corresponding solutions:

1. For the first part, we can basically solve the problem through the timeout retransmission of client A and the deduplication mechanism of the IM server;

2. For the second part, the industry generally refers to the ACK mechanism of the TCP protocol to implement a set of ACK protocols for the business layer.

Solution to loss: business layer ACK mechanism

Let's first explain ACK. The full name of ACK is Acknowledge, which means confirmation. In the TCP protocol, the ACK mechanism is provided by default. A standard ACK packet that comes with the protocol is used to confirm the data received by the communication party and inform the communication sender that it has confirmed that it has successfully received the data. Then, the business layer ACK mechanism is similar, and the solution is: how to confirm whether the message is successfully delivered to the receiver after the IM service is pushed. The concrete realization is as follows:

When the IM server pushes the message, it carries an identification SID (security identifier, similar to the TCP sequenceId). After the message is pushed out, the current message will be added to the "message to be ACK list". After the client B successfully receives the message, it will give The IM server returns a business-layer ACK packet, which carries the SID of the received message. After receiving the message, the IM server will delete this message from the "ACK Message List" record, and this push is truly over.

1. Message retransmission in the ACK mechanism

What if the message is lost in the process of pushing it to user B? For example: Network B is actually unreachable, but the IM server has not yet sensed it; User B’s device crashed before it fetched data from the kernel buffer; the message was dropped by some intermediate device on the way to the intermediate network, and the TCP layer still The retransmission has been unsuccessful and so on. The above problems will cause user B to not receive messages.

Solution: The common strategy to solve this problem actually refers to the retransmission mechanism of the TCP protocol. Similarly, the "waiting ACK queue" of the IM server generally maintains a timeout timer. If the ACK packet from user B is not received within a certain period of time, the message will be retrieved from the "waiting ACK queue" and pushed again.

2. The problem of repeated push of messages

As mentioned earlier, for the pushed message, if the ACK packet is not received within a certain period of time, it will trigger the retransmission of the server. There are two situations when the ACK is not received. In addition to the fact that the pushed message is lost and user B does not return the ACK, it may also be that the ACK packet returned by user B is lost. In the second case, server retransmission caused by ACK packet loss may cause the receiver to receive repeated push messages.

solution:

The general solution is: when the server pushes the message, it carries a Sequence ID. The Sequence ID needs to be unique in this connection session. For the same re-push message, the Sequence ID does not change. The receiver performs the business according to this unique Sequence ID. Layer de-duplication, so that after de-duplication, for user B, what he sees is still receiving a message, which does not affect the user experience.

3. Does this really prevent you from losing news?

If you are careful, you may find that the combined mechanism of "ACK + timeout retransmission + deduplication" can solve the problem of message push loss when most users are online. Can it completely cover all message loss scenarios? Imagine, suppose an IM server is down due to hardware reasons after a message is pushed. In this case, if the message is really lost, the responsible IM server is down and cannot trigger a retransmission, resulting in receiving Party B cannot receive this message. There is a problem. When user B reconnects and goes online again, he may not know that a message was lost before. How to deal with this retransmission failure situation?

Remedial measures: message integrity check

Let's analyze the problem of retransmission failure that may be caused by server downtime. The problem here is: the server machine is down, and the retransmission road will not work. Then, if user B is able to perform an integrity check when user B is back online, and finds that user B has "message loss", he can resynchronize or repair the lost data. The more common implementation mechanism of message integrity check is "time stamp comparison". The specific implementation is as follows:

Let's take a look at how the "time stamp mechanism" checks the integrity of messages. I will use this example to explain this process.

1. The IM server pushes msg1 to receiver B with the latest timestamp timestamp1. After receiving msg1, receiver B updates the timestamp of the latest local message to timestamp1.

2. The IM server pushes the second message msg2 with the latest timestamp2. During the push process, the connection between receiver B and the IM server of msg2 is disconnected due to some reason, resulting in msg2 not successfully delivered to receiver B.

3. User B reconnects online and carries the latest local timestamp timestamp1. The IM server returns all messages temporarily stored by user B that have a timestamp greater than timestamp1 to user B, including msg2 that was not successful before.

4. After user B receives msg2, he updates the timestamp of the latest local message to timestamp2.

Through the above time stamp mechanism, user B can successfully send the missing msg2 to compensate. It should be noted that since the time stamp may have the problem of asynchronous clocks of multiple machines, there may be a certain deviation, resulting in insufficient data acquisition. So in actual implementation, you can also use the global auto-increment sequence as the version number instead.

to sum up:

Ensuring the reliable delivery of messages is a crucial part of the IM system design. "No message loss" and "No repetition of messages" have a greater impact on user experience. We can use the following methods to ensure the reliability of message pushdown.

1. In most scenarios and actual implementations, the ACK confirmation and retransmission mechanism of the business layer can solve most of the message loss during the push process.

2. Through the client's deduplication mechanism, the problem of message duplication during the retransmission process is shielded, so as not to affect the user experience.

3. For special scenarios where retransmitted messages are unreachable, we can also use the "undercover" integrity check mechanism to discover the loss of messages in a timely manner and perform additional push repairs. Message integrity check can be done through timestamp comparison, or global It can be realized by self-increasing sequence.

 

Guess you like

Origin blog.csdn.net/madongyu1259892936/article/details/106018457