IM interview questions

1. In the message storage, if the content table and index table need to be processed by database, what field should be hashed? Can the index table and the content table be combined into one table?

Answer: The content table should be hashed according to the primary key message ID for sub-database sub-table processing, which is convenient for locating a specific message; the index table should be hashed according to the index user UID for sub-database sub-table processing, which can make the current All contacts of the user fall on one table, reducing the trouble of traversing all tables. The index table can be combined with the content table to form a table. The advantage is obvious. It can reduce the database IO when pulling historical messages. The downside is that the message content is redundantly stored, which wastes space.

2. The information needed by the nearest contact can be obtained from the index table. Why do we need a separate contact table?

Answer: If you get all the contact information of a user from the index table (including the last chat content and time), there will be a grouping and top 1 operation in the SQL statement, and the performance is not ideal; in addition, the current user and a single contact The unread value between the data needs to be maintained. It is much more convenient to store in a field of the contact table than to use the index table.

3. How does the TCP long connection method realize "when a message needs to be sent to a certain user, the network connection corresponding to this user can be accurately found"?

Answer: First, the user has a login process: (1) The TCP client and the server establish a TCP connection through a three-way handshake; (2) The client sends a login request based on the connection; (3) The server parses and judges the login request If it is valid, the mapping relationship between the uid of the current user and the socket descriptor (that is, fd) identifying the current tcp connection is established; (4) This mapping relationship is generally stored in the local cache or distributed cache. Then, when the server receives the message to be sent to this user, it first searches fd from the cache according to the uid, and if it finds it, it pushes the message out based on fd.

4. With the ACK mechanism of the TCP protocol itself, why do we need the ACK mechanism of the business layer?

Answer: This question is more appropriate to explain from the perspective of the operating system (linux/windows/android/ios) implementing the TCP protocol:
     1 The operating system creates a TCP send buffer on the TCP sender and a TCP receiver on the receiver. Buffer;
     2 After the send() method is successfully called by the application layer program at the sending end, the data is actually written into the TCP sending buffer;
     3 According to the TCP protocol, when the TCP connection is good, the TCP sending buffer is The data arrives in the TCP receive buffer in an "orderly and reliable" manner, and then calls back to the receiver application layer program to notify the data arrival;
     4 But when the TCP connection is disconnected, in the TCP send buffer and TCP receive buffer There may be data, so how does the operating system handle it?
           First of all, for the data that has not been sent in the TCP send buffer, the operating system will not notify the application layer program to process (just imagine: the send() function has returned success, and I will tell you the failure later. How to design such a system? It’s complicated...), the usual processing method is to directly reclaim the TCP send buffer and its socket resources;
           for the TCP receiver, when the TCP connection is not detected, the TCP receive buffer is no longer written The data is imported, so there will be enough time for processing, but if the connection is found to be disconnected in the future, in order to release the resources in time, the TCP receiving buffer and the corresponding socket resources will be directly recovered.

The summary is: The sender's application layer program, when the send() method is called and returns successfully, the data is actually written into the TCP send buffer, instead of being processed by the receiver's application layer program. How to do it? Can only rely on the ACK mechanism of the application layer. Even if the data is successfully sent to the receiving device, abnormal situations may occur when the tcp layer passes the data to the application layer, such as the failure of storing the client's local db, which causes the message to be actually unsuccessfully received in the business layer. In this case, the ack of the business layer can be used to provide protection, and the client will only return the ack to the server if all executions are successful.

5. In the instant messaging scenario, why the sequence number generator used to ensure the timing of message reception is not globally incremented?

Answer: This is determined by the business scenario. The message of this group and the message of another group are logically completely isolated. As long as the sequence number of the message is incremented in a local range such as the group; of course, if it can It is best to achieve a global increase, but it will waste a lot of resources without bringing more benefits.

6. Can TLS identify the problem that the client emulator imitates the user's real access? If not, what other better way?

Answer: TLS is an encryption protocol at the transport layer. It is used to ensure that messages are not intercepted, tampered, or forged during message transmission, but it cannot identify real users who are fake.
         If the client emulator accesses the server like a real user, there is actually no need to identify it, because at this time the emulator is generally to help the real user do something without malicious behavior; if there is malicious behavior, the identification method Recognition is through machine learning. For example, the client simulator will frequently send messages. For this feature, online access traffic can be screened.

7. Can the atomic embedding script method similar to Redis+Lua be able to achieve "safety-free" change consistency? For example, will there be problems when the machine is powered off during execution?

If redis is powered off during the execution of the lua script, it may cause the two unreads to be inconsistent, because the execution of the lua script in redis can only ensure that multiple commands will be executed atomically, and the overall execution will be synchronized to the slave library. And write aof, so if the power is lost during execution, it will directly cause the interrupted later part of the script to not be executed. Of course, this probability is very small in practice. As a solution to the bottom line, if there are fewer conversations when unread changes are made, you can get a full amount of unread conversations to cover the total unread, so that there is a chance to get the final agreement.

8. Can the heartbeat mechanism be used in conjunction with TCP keepalive and application layer heartbeat?

From a functional perspective, the heartbeat mechanism of the transport layer and the application layer is not necessary to combine, because the heartbeat of the transport layer detects the connection availability, the heartbeat mechanism of the application layer can also complete the detection; but from a debug perspective, the heartbeat detection of the application layer The mechanism cannot locate whether it is a network problem or a system problem. At this time, it is very good to be assisted by the transport layer, but the implementation will be relatively complicated!

9. The logic of dichotomy for heartbeat detection?

In fact, the heartbeat interval for the next dynamic adjustment is: the median of the current maximum value of the confirmed safe heartbeat interval and the minimum value of the confirmed excessive heartbeat detection. For example, the last heartbeat interval was 4 minutes, and N consecutive acknowledgments were successful, then the currently confirmed safe heartbeat interval is 4 minutes. Assuming that the heartbeat interval is too large when it has been confirmed 10 minutes, then the next adjusted heartbeat is (4 + 10) / 2 = 7 minutes.

10. How to enable tcp keepalive?

For example, netty can be opened by the following code:

ServerBootstrap bootstrap = new ServerBootstrap();
bootstrap.childOption(ChannelOption.SO_KEEPALIVE, true);

In addition, the configuration of the heartbeat interval of tcp keepalive also needs to modify the system's /etc/sysctl.conf, similar Below:

net.ipv4.tcp_keepalive_time=120
net.ipv4.tcp_keepalive_intvl=30
net.ipv4.tcp_keepalive_probes=3

11. If users have a lot of offline messages, is there a way to reduce the amount of data transmission of offline messages when users are online?

Answer: All the offline messages of the user are not all caring and interested to the user. The user may only read the most recent messages with a recent contact, but do not want to read the previous ones, so this It would be a waste of resources to pull all previous offline messages locally. The usual practice is:
1 Separate all offline messages of the user by contact;
2 When the user enters the chat window with the contact after logging in, first load the 10 most recent offline messages with the contact;
3 When the user uses When sliding the screen of the mobile phone with your hand, you can pull 10 more pages.

12. Through the long-connected access gateway machine, what is the difference between the shrinkage and the normal Web service machine when shrinking?

Ordinary web server machines provide http short connection service. Removing the machine when scaling down will cause the front-end connection to fail. However, through the load balancing algorithm of nginx, the reconnecting client will connect to another server. For the client, it is basically imperceptible; but for the long-connected access gateway machine, when the machine is scaled down and removed, it will cause all the long-term connections on the machine to be disconnected, which will affect all All users connected to this gateway machine, of course, through the portal scheduling service, the client can connect to the new gateway machine by reconnecting, but the user experience is always bad.

13. In order to avoid each message querying the user's online status, all messages are sent to all gateway nodes, which will also cause the traffic of each gateway machine to increase exponentially. In this way, will it affect the rate at which consumers push messages? After all, if there are 50 gateway nodes, each gateway node originally only needs to fetch 1 message, but now it needs to fetch 50 messages, 49 of which are invalid.

Therefore, this requires a trade-off. If most of the business scenarios are peer-to-peer scenarios, it is better to use the global online status for accurate delivery. If it is a scene with a large fan-out like group chat and live broadcast, it is recommended to use all gateways to subscribe to the full amount of messages. The way.

14. When notifying friends about going online and offline, do you first check the online status of your friends to obtain the servers they are connected to, and then push online and offline messages to these servers? From the online status data of hundreds of millions of people, hundreds of them are queried Is there any optimization method for online friends?

A user’s friends are limited. If the online status is stored through the central kv storage, concurrent querying of hundreds of friends is not a problem. The performance will not be too slow, but the storage pressure will be greater. If you really want to optimize, if there are too many friends, I personally think that you can find out the user’s friends and assemble it into a special message and send it to all gateways. Each gateway can claim the maintenance of its own machine. Those of your friends who log in and connect on this machine, and then push the online and offline messages.

15. In the automatic fusing mechanism, how to confirm the fusing threshold of Fail-fast (for example: when the ratio of access time exceeding 1s per unit time reaches 50%, the dependency will be fused)

On the one hand, the pressure test can be done on the other hand with a semi-fuse mechanism combined with flow control to only release 50% of the flow.

16. Is there a good open source code recommendation for current limiting?

Guava's RateLimiter is recommended for single-machine current limiting. It is very simple to write a global current limit directly based on Redis+Lua.
Yes, the current limit threshold needs to simulate pressure testing to avoid the possibility of service being dragged due to too loose threshold settings or the threshold is too sensitive to cause a little jitter and the overall fuse.

 

Guess you like

Origin blog.csdn.net/madongyu1259892936/article/details/106003906