IM system function-heartbeat mechanism (and solutions)

Why do you need a heartbeat mechanism

The "long connection" approach has brought us many benefits. The most important link in order to achieve reliable delivery of messages through the "long connection" is how to maintain this "long connection". Since the TCP connection used at the bottom of this "persistent connection" is not a real physical connection, it is actually just an unaware virtual connection. The disconnected ends of the intermediate link will not be aware of it, so maintain this " A key issue of "long connection" is to allow this "long connection" to be notified quickly when there is a problem with the intermediate link, and then to re-establish a new available connection through "reconnection", so that Our "long connection" has always maintained a "high availability" state. 

This mechanism for identifying connection availability "fast" and "uninterrupted" is called the "heartbeat mechanism". The "heartbeat mechanism" tests the availability of the connection by continuously sending "simulated data" to the connection, and at the same time allows our connection to continue to have data flow when there is no real business data to send and receive, without being operated by the intermediate network The business thinks that the connection is no longer in use and cut the connection.

1. Reduce the overhead of server connection maintenance

For most instant messaging scenarios, both parties to send and receive messages are often in a mobile network environment. Changes in the strength of mobile phone signals and intermediate routing failures may cause the "long connection" to actually be unavailable.

For example, if the user enters the elevator with a mobile phone, the mobile phone network signal suddenly disappears completely. The long connection is no longer available at this time, but the IM server cannot perceive this "connection unavailable" situation; in addition, if our Internet router suddenly The connection is dropped. The long connection established between the App and the IM server before is actually in an unavailable state at this time, but the client and the IM server are also unable to perceive it. The "server push" of messages can be realized because we maintain the corresponding mapping relationship between "user equipment" and "network connection" on the IM server for every device that goes online. In addition, In many cases, in order to save network overhead, some client information (such as app version number, operating system, network status, etc.) that does not need to be carried in each request is temporarily cached on the server, so that once the client has established a long connection , You only need to carry this information for the first time, and subsequent requests do not need to carry it again, but use the information cached by the IM server. In addition, in the implementation of many IMs, some information such as "user online status" and "all online devices" will be maintained on the server side, which is convenient for business use.

If the IM server cannot perceive the abnormal conditions of these connections, it will cause a problem: the IM server may maintain a large number of "invalid connections", resulting in a serious waste of connection handle resources; it will also cache a large amount of actual Information such as "mapping relationship", "device information", and "online status" that are no longer useful in the above is also a waste of resources; in addition, the IM server pushes messages to "invalid long connections" and subsequent retry pushes will be reduced The overall performance of the service.

2. Support client reconnect after disconnection

Through the "heartbeat" to quickly identify the availability of the connection, in addition to reducing the resource overhead of the server, it is also used to support the disconnection and reconnection mechanism of the client. For the client to send a heartbeat packet, if within a certain timeout period (considering the certain delay of network transmission, the timeout period should be at least greater than one heartbeat interval), for example, if the heartbeat packet is sent twice in a row, no IM is received. If the server responds, the client can think that the long connection with the server is unavailable, and the client can disconnect and reconnect. The cause of the server's failure to respond may be that the network with the server is disconnected in the middle, or the server's load is too high to respond to heartbeat packets. No matter what the situation, it is necessary to reconnect after disconnection in this scenario. It enables the client to quickly and automatically maintain the availability of the connection.

3. Connect Keep Alive

To maintain a "highly available" long connection, another important task is to try to make the established long connection survive longer.

Here you may ask: Can the long connection be killed when the user network and the intermediate routing network are both normal? The answer is: it does.

To explore this reason, I may start with IPv4. Due to the limited resources of IPv4 public IP (approximately 4.3 billion), in order to save the use of public IP, mobile phones connected to the Internet through mobile operators are actually only assigned an IP from the operator's intranet.

When accessing the Internet, the carrier gateway uses a two-way mapping table from "external network IP and port" to "intranet IP and port" to allow mobile phones that actually use internal IP to communicate with external networks. This network address conversion The process is called NAT (Network Address Translation).

There is nothing wrong with the implementation mechanism of NAT itself. The problem is that many operators, in order to save resources and reduce the pressure on their own gateways, for connections that have not been sent or received for a period of time, operators will remove them from the NAT mapping table, and this removal The actions will not be perceived by the mobile phone and IM server. In this case, if there is no NAT mapping relationship, the message sending and receiving on the persistent connection cannot be performed normally. And how long it will take to clear the NAT mapping table, the operators in each place are also different, from a few minutes to a few hours. Assuming that the user has not sent or received messages for a few minutes, the long connection may have been unavailable. Then, if our client can send some signaling to the server during the free time when there is no message sending and receiving, it can prevent the long connection from being killed by the operator's NAT. These "signals" are generally implemented through heartbeat packets.

Several implementation methods of heartbeat detection

There are currently three commonly used implementation methods in the industry: TCP Keepalive, Application Layer Heartbeat, and Smart Heartbeat.

1.TCP Keepalive

TCP Keepalive is part of the TCP/IP protocol stack implementation of the operating system. For the local TCP connection, it will automatically send detection packets without data at a certain frequency during the idle period of the connection to detect whether the other party is alive. The operating system turns off this feature by default and needs to be turned on by the application layer. The default three configuration items: the heartbeat period is 2 hours, 9 times after failure, the timeout period is 75s. All three configuration items can be adjusted.

From this point of view, as an existing implementation of the system layer TCP/IP protocol stack, TCP’s Keepalive does not require other development workload. It is very convenient to use as a detection mechanism for connection survival; upper-layer applications only need to process the detected The connection is abnormal, and the heartbeat packet does not carry data, and the waste of bandwidth resources is minimal. Due to the advantages of ease of use and low network consumption, TCP Keepalives are enabled in many IM systems. It was found in previous packet capture that WhatsApps uses TCP Keepalives with an idle period of 10 seconds for survival detection. Although it has many advantages, TCP Keepalive itself has some shortcomings. For example, the flexibility of heartbeat interval is poor. A server can only be adjusted to a fixed interval of heartbeats at a certain time; in addition, TCP Keepalive can be used to detect the survival of the connection layer. , But it does not mean that the real application layer is available.

Disadvantages: Let me give an example. For example, when the IM system has code deadlock or blockage, it is actually unable to process business requests, but at this time the connection layer TCP Keepalive probe does not require application layer participation and can still be at the kernel layer Normal response. This situation will lead to misjudgment of detection, so that machines that have lost business processing capabilities cannot be discovered in time.

2. Application layer heartbeat

In order to solve some of the shortcomings of TCP Keepalive, many IM services use application layer heartbeat to improve the flexibility and accuracy of detection. Application layer heartbeat actually means that the client sends a business layer data packet to the IM server at regular intervals to inform itself of its survival. If the IM server does not receive the heartbeat packet within a certain period of time, it is determined that the client is unreachable for some reason, and the connection is disconnected from the IM server at the same time, and other resources allocated accordingly are cleared.

Compared with the TCP Keepalive heartbeat, the application layer heartbeat does not belong to the implementation of the TCP/IP protocol stack, so there will be some additional data transmission overhead. However, the heartbeat packets of most application layer heartbeats are designed to be as simple as possible, usually just a few Bytes, for example, some application layer heartbeat packets are just an empty packet for keep-alive, and some heartbeat packets just carry the heartbeat interval for the client to adjust the next heartbeat, so the additional data overhead is very small. Compared with TCP Keepalive, application layer heartbeat needs to process sending and receiving at the application layer, so it can better reflect the availability of the application, rather than just representing the availability of the network. In addition, the application layer heartbeat can flexibly set the heartbeat interval according to the actual network conditions. In the actual situation of domestic operators' NAT timeout chaos, the flexibly set heartbeat interval has more obvious advantages in saving network traffic and keeping alive.

At present, most IMs use application layer heartbeat solutions to solve the problems of connection keep-alive and availability detection. For example, in the previous packet capture, it was found that the application layer heartbeat interval of WhatApps is 30 seconds and 1 minute. The application layer heartbeat interval of WeChat is mostly 4 and a half minutes. At present, the microblog long connection uses a 2 minute heartbeat interval. Each IM client sends a heartbeat strategy differently. The simplest one is to send heartbeat packets at a fixed frequency, regardless of whether the connection is idle or not. Before grabbing the mobile phone QQ package, I found that the App will send a heartbeat at a frequency of about 45s; there is also a slightly more complicated strategy that the client sends a heartbeat packet after the data is idle. This comparison is better for traffic saving, but it is realized The above is slightly more complicated.

The following is a typical application-layer heartbeat processing flow chart of the client and server. It can be seen from the figure that the client and the server each use the heartbeat mechanism to achieve "disconnected reconnection" and "resource cleanup".

It should be noted that for the client, the time to determine whether the connection is idle is the established heartbeat interval time, while for the server, considering the delay in network data transmission, the timeout period for determining whether the connection is idle requires It is greater than the heartbeat interval, so as to avoid misjudgment of connection availability due to network transmission delay . Smart heartbeat

3. Smart heartbeat

In the domestic mobile network scenario, the NAT timeout time varies greatly among local operators under different network types. Although the implementation of fixed frequency application layer heartbeat is relatively simple, in order to avoid NAT timeout, the heartbeat interval can only be set to be less than the shortest time of NAT timeout in all network environments. Although it can also solve the problem, but for the device CPU, The resources of power and network traffic cannot be saved to the greatest extent. In order to optimize this phenomenon, many instant messaging scenarios will adopt the "smart heartbeat" scheme to balance "NAT timeout" and "device resource saving". The so-called smart heartbeat means that the heartbeat interval can be automatically adjusted according to the network environment. By continuously adjusting the heartbeat interval, the NAT timeout critical point is gradually approached, and equipment resources are saved as much as possible while ensuring that the NAT does not timeout. It is said that WeChat has adopted a smart heartbeat solution to optimize the heartbeat interval. However, from a personal point of view, with the current drastic reduction in mobile tariffs, mobile phone hardware equipment conditions are getting better and better, and smart heartbeat has a limited effect on saving equipment resources. In addition, the smart heartbeat solution needs to keep trying in the process of confirming the critical point of NAT timeout, which may also reduce the availability of the "timeout confirmation phase" connection to a certain extent. Therefore, I suggest you to weigh according to the needs of your own business scenarios. necessity.

to sum up:

The "heartbeat mechanism" established by the client and the IM server can quickly and automatically identify whether the connection is available, and at the same time prevent the operator from being disconnected when the NAT is timed out. The "heartbeat mechanism" solves the following three problems:

1. Reduce the overhead of server connection maintenance invalid connection.

2. Support the client to quickly identify invalid connections and automatically reconnect after disconnection.

3. Keep the connection alive to avoid being disconnected by the operator NAT overtime.

The realization of heartbeat detection mostly adopts the following two methods in the industry:

1. TCP Keepalive. The operating system TCP/IP protocol stack comes with it, no secondary development is required, easy to use, and does not carry data and consumes less network traffic. However, there are defects such as insufficient flexibility and inability to determine whether the application layer is available.

2. Application layer heartbeat. Implementing the heartbeat mechanism by itself requires a certain amount of code development, and the network traffic consumption is a little bit more, but the flexibility of the heartbeat interval is good, and with the intelligent heartbeat mechanism, it can achieve the maximum saving of equipment resource consumption without NAT timeout. "At the same time, it can more accurately feedback the true availability of the application layer.

Guess you like

Origin blog.csdn.net/madongyu1259892936/article/details/106211230