Detailed Explanation of IM System Design for Billion-Level Flow Instant Messaging (Full)

foreword

To understand the corresponding java knowledge, you can read my article:
The zero-based learning route of java framework from entry to master (super complete)

Design the business logic of the system, and then optimize it in a targeted manner.
Master the knowledge points of each deep-level framework from the overall framework, so as to check for gaps and make up for gaps

Other system designs are as follows:

  1. Detailed design of seckill system
  2. Detailed design of short domain name system (full)
  3. Detailed explanation of LVS four-layer load balancing architecture

The knowledge points of the following articles are mainly derived from: System design interview questions - how to design an instant messaging (IM) system with 1 billion traffic?
While taking notes, I incorporated some of my own insights and development

1. Background

desired functionality?

  • add friend
  • chat session list
  • single chat, group chat
  • Multi-terminal login (pull from database)
  • Message roaming (after a terminal receives many messages, log in to another terminal to achieve synchronization)
  • Message read and read unread list

Constraints to consider?

  • qps storage capacity
  • reliability
  • Delay in sending and receiving messages
  • Consistency of message timing (the order in which messages are received and sent, messages are not repeated or missed)
  • group chat
  • Maintainable and O&M

2. High performance

Access layer optimization

  1. The polling pull mode cannot guarantee the real-time requirements,
    and it can be pushed to the client through the TCP long connection

  2. How does the client establish a long connection with the server
    a. The client directly connects to the server through the public network IP socket programming
    b. Sends the public network IP to the client through the IPConf service, and flexibly uses the long connection call (if the client chooses, you can choose one Optimum. Cache it, if the connection fails, choose another ip to connect to reduce the pressure of ipconfig)
    c. After establishing a long connection, the business logic layer and uid are mapped
    d. IPConf interacts with the business logic layer through the coordination service, Load balance according to machine

  3. Long connection takes up resources, and combined with DB, reading and writing will be slower
    Split service, long connection is responsible for sending and receiving messages, business server is responsible for business blocks

  4. Scheduling resource optimization Network communication will frequently iterate messages to support business development. The restart of stateful business is relatively slow.
    Changing and unchanged services are disassembled. Generally, the state machine service
    long connection service is used to send and receive messages, and update the state machine, ip The service schedules the policy by invoking the state machine.
    Frequent changes (such as message login and logout) control the disconnection of long-term connections, and send close through mq for long-term connection scheduling.

  5. When establishing a long connection, how does the client know where the access layer server is?

plan advantage shortcoming
Broadcast
1. The service layer sends its message to all access layers
2. The access layer will only handle its own uid connection (pairing through map with or without its own uid)
Simple implementation, suitable for large chat rooms 1. Too many invalid communication in single chat scene
2. Message storm system crash
Consistent hashing
1. ipconf and business logic use the same hash, and fragments fall into the server according to uid
2. Discover through unified service registration, horizontal expansion uses virtual node migration, and migration is reconnected through disconnection
Simple calculation without performance consumption
1. Excessive reliance on service discovery
2. Horizontal expansion requires connection migration
3. Uniformity limitations of consistent hashing
Routing service layer
1. The bottom layer uses kv to store uid and the mapping relationship of access machines
2. Update the mapping relationship according to different sessions
3. Use routing services and business logic to maintain message queues
4. Arbitrary routing services consume mq to parse the routing information of this message Determine the access layer machine
5. Send the message to the queue dedicated to the access layer according to the routeKey created by the access layer machine flag
1. MQ is reliable, decoupling peak shaving
2. Routing service is stateless and can be expanded horizontally
3. Routing stores long connections and mapping relationships
1. The routing layer cluster needs to be maintained independently.
2. The bottom layer of the service depends on kv and mq stability

Supplement the corresponding knowledge points: Principle and Application Analysis of Consistent Hash Algorithm

Storage Tier Optimization

  1. Read less and write more, how to control message distribution to prevent resource exhaustion?
    Make its control messages concurrent as much as possible (not one thread per message, but start individual threads through distribution), so as to ensure that the system will not crash when sending messages in groups.

  2. In the above situation, there may be message extrusion and increased delay?
    a. Cache package compression, let the router send once, when the window is sent once every 10ms, the overall performance is improved once. And b. Compress and package it through push-pull combination (for example, the server sends a request to let the client specifically pull the request, reducing the polling of messages)

  3. The storage system has severe write amplification (that is, writing A to BCD itself writes messages, but all three channels must be written once), how to reduce costs?
    a. The ultra-large group is downgraded to the enlarged mode for storage, and the messages are only written to the inbox synchronously (when A reads the BCD message, it only needs to read the corresponding BCD inbox, and the complexity of reading is O(n), Write only once)
    b. Asynchronous storage of group status messages

  4. How are messages handled? How to ensure read/unread consistency in group chats with thousands of people?
    a. Real-time stream processing, the receiver's read message is sent synchronously, and the receiver's message has been read through asynchronous storage b
    . Super large group messages are downgraded through the group status change service
    c. Asynchronous writing ensures final consistency through retry

How to optimize latency when writing to disk?:
(It is best to process on disk when synchronizing data)

Hierarchical storage:

  • According to the dimension of uid, a sortable list is maintained on the database, and the maximum number of entries is adjusted according to the degree of activity.
  • For very large group chats, it degenerates into read diffusion mode, maintains the list in the session level and caches the message status asynchronously. The session in the group chat locally caches the LRU for the message list according to the ID (the local LRU can use the HotRing algorithm to measure hotspot data)
  • Use protobug serialization compression to reduce storage space
  • Message queue broadcast status to local cache (Gossip algorithm can also be used)
  • Data over a certain period of time is compressed in the file system and OLAP query is provided

The advantages of this solution are as follows:
it is graded according to the request reading popularity, and the range of offline message synchronization is well-operated, and it can also support the push-pull combination of group chat.
The disadvantages are as follows:
the local cache is slow to warm up, and there will be jitter when the service restarts.
A large amount of memory is used, and operation and maintenance are difficult.
Need to pay attention to issues such as cache hit rate


Multiple DB storage : (different storage for different fields)

  • rocksDB stores kv, the key is the session ID and the value is the serialized message list
  • If the message list is too long, the list can be sorted, the key is the session ID, and the value is the meta index information
  • Use session ID + segment seqID as key, value to store message list, and merge when reading
  • etc.

The advantage is that the use of DB complements the strengths and weaknesses, and solves the storage capacity problem based on disk.
The disadvantage is that the ROcksDB stand-alone database requires a self-developed distributed proxy layer. The disk kv reads and writes the disk, and the performance of pulling messages is affected


Graph database :

Store the relationship of the database relationship, various zipper relationships of the conversation message list,
can run near-line OLAP query to quickly identify hot news, etc., use the message to process quickly and accurately hit

advantage:

  1. low latency
  2. Provides rich query functions

The disadvantage is that the operation and maintenance cost of the graph database is relatively high, and it is not easy to maintain


Storage layer proxy service :

Add this layer to shield the underlying details. The proxy layer does hash sharding based on the key, and replicates kv
super large group chats based on the consistency protocol. Multiple requests may pull a message list (do spin cache, reduce the database of downstream DB access)

advantage:

  1. business isolation
  2. The agent layer is stateless and can be extended horizontally
  3. The message list is cached at the proxy layer, reducing the underlying memory burden

The disadvantage is that it adds a layer of logic and increases the complexity

3. High consistency

The message order of the sender and receiver is consistent and the message is not lost

Want to ensure that messages are not lost?

The data transmitted by the TCP network will not be lost, but the data packets may be lost (TCP network disconnection, delay).
After understanding the background, you can use the retry mechanism

  • The upstream client retries (when the data cannot be obtained, it passes the retry and the server returns the ack mechanism)
  • The downstream server retries (also passes the retry and the client returns the ack mechanism)

However, there are also bugs in retrying. If the ack is lost, the message will be retried indefinitely, which will also cause inconsistency in the order (for example, the original answer is ABCD, and the resulting message is ABCCD, which will cause confusion)

To solve the above repeated sending problem, can it be deduplicated by UUID?

× , it is inappropriate to use UUID here to judge the deduplication of ack

The sent ack packet (using UUID) also needs to be paired with a global table to see if there is any duplication.
If the traffic itself is relatively small, it can be judged with the global variable (using a table to store).
If the traffic is relatively large, it can be taken in a small time segment to judge
, but it is a billion-level flow, even in a small time period, there are too many variables

Solve the problem of repeated sending? What is the correct way?

Learn from the three-way handshake mechanism of TCP, and judge by ack+1

  • Upstream: The client generates a message in the sending session, and it is an auto-incrementing cid, and the server itself stores the last largest cid.
    If the sent message is not cid+1, it will be discarded (it is impossible to wait all the time, because the subsequent message cannot be sent, and the service will cause a lot of waste)
    The cid here is for a single client and can be repeated for different clients
  • Downstream (similar to upstream processing): the server assigns a seqid to each message sent, and the client itself stores the last largest seqid.
    If the message sent is not seqid+1, it will be discarded (similar to the upstream processing method)
    The seqid here is on the server side, and it needs to be guaranteed to be non-repeated and incremented

Guarantee the order of the news?

Through the above knowledge, it is already known that messages will not be repeated and will not be lost,
so they can be sorted by increasing ID

  • The upstream client assigns seqid to each message according to cid (the reason for sorting is because the server stores more than one seqid of the client)
  • The downstream is sorted by seqid

Especially after the server is down, when the seqid itself is performing master-slave replication, there will be a delay in the replication process

(How to ensure that the ID is not repeated when restarting?)
Every time Redis restarts, it will check whether the instance ID is the same. If it is not the same, it means that it has been restarted. After restarting, hash and timestamp are used to ensure that the message will not be repeated (to ensure that the message is incremented, but not monotonous)

How to generate an incrementing message id?

You can read this article: Ultra-detailed analysis of distributed ID generation methods (full)

Specific program design

plan advantage shortcoming
Pure pull data :
only uplink (sender) message consistency is guaranteed
Simpler to implement Poor real-time performance (just by server standards)
Monotonically increasing ID :
via Redis+lua
Simpler to implement 1. Too many communications and poor group chat performance.
2. Reliability of Dependent Distributed Real ID Generation System
Double ID method :
the sender sends the current id and the last message id, and the receiver keeps the last id (the receiver and the receiver judge whether there is any missing message by comparing the id)
Does not rely on monotonicity of id generation 1. Too many communications and poor group chat performance.
2. The implementation mechanism of the downlink message (receiver) is complex
Push-pull combination :
the server notifies the client to pull the message, and the client pulls the message (the pull itself can be used as an ack)
1. The downlink message does not need an ack mechanism, and the server does not need to maintain and resend after a timeout.
2. All sessions can be pulled at one time to reduce the number of calls.
3. Batch fetching is conducive to message compression and improves bandwidth utilization.
Unable to resolve the timing of uplink messages

Through the above solutions, find a better solution

  • The uplink message (client) can ensure the timing consistency of the message through the previous ID pairing strategy (dual ID method)
  • Downstream messages (server) are combined through push and pull to ensure high throughput

4. High availability

Complex and long links can easily cause bottlenecks, resulting in no high availability

4.1 Connection break

The entire link spans the public network (operator). If TCP keeps alive (the heartbeat mechanism is 2h), if the long connection does not respond after timeout, it will be disconnected. The heartbeat mechanism should depend on the entire link. If you want to maintain an end-to-end validity (it can be maintained between cores, but the business logic is not easy to maintain), you should maintain the heartbeat mechanism on the business logic.

1. Heartbeat mechanism:

The heartbeat mechanism should be placed in the business logic layer

  • Pushing from the server to the client (×) is immediacy in itself, and it is unrealistic if the amount of data is large
  • Push from the client to the server (√), periodically send its heartbeat to the gateway of the server, and reset the internal timer through the gateway

Package size: the heartbeat control package of the package should not be too large, and should be controlled below 0.5kb

Heartbeat time:

  • Long heartbeat: too many disconnected clients, low efficiency and low resource utilization
  • Short heartbeat: too many heartbeat requests, resulting in excessive traffic pressure on the gateway

The best way is adaptive heartbeat: the front-end uses a fixed heartbeat (no link can be fixed for a long time), and the back-end calculates the NAT elimination time (adaptive estimation time, taking the middle of the minimum and maximum critical values)

2. Disconnection and reconnection

Background 1 : When taking a high-speed rail or train, the client network frequently switches, which will lead to service creation and destruction, and excessive emptying of resources. How to establish a long connection more stably and quickly?

Solution : At the moment of disconnection, start a session timeout. If you can connect after that, the resources will not be cleared, and you can reconnect to ensure the stability of the link (direct multiplexing). It will create
a TCP channel, just associate fid with session

The above method can prevent frequent creation and destruction, and will not let Redis avalanche

Background 2 : If the server crashes and causes the client to reconnect, how to deal with the avalanche caused by too many requests?

Solution : The IPConf service uses the discovery mechanism to quickly identify server node failures and schedule them. After the client is disconnected, it reconnects to the request through a random strategy to obtain service rescheduling. If the original server is still there, it will be selected first (it has services Fault self-discovery, load balancing mechanism)


3. News Storm: (how to ensure the reliability of the message)

Background 3 : How to resend unsent messages after service crashes under long connection?

Solution :

  1. After the connection is established, it is necessary to call the offline message synchronization interface to actively pull the message (mainly to synchronize the status message of the gateway)
  2. The status information of the connection exists in the status server alone, and the entire communication system can use RPC or shared memory (the status server can be used for persistent processing, similar to snapshot processing)

Background 4 : Too many heartbeat counts timed out, causing a large number of timers to occupy memory resources, causing the entire system to freeze and cause messages to time out?

Solution : use a binary heap (timing time complexity is logn), a large number of creation and elimination, the bottleneck lies in the data structure, so change to the time wheel algorithm (but the timing accuracy is missing)

4.2 Weak network

Optimize TCP connections over fast links:

  • Reduce TCP packets, IP will fragment (more than 1400 bytes), and control its packets below 1400
  • The congestion control window is enlarged to avoid receiving congestion
  • Socket read and write buffer to avoid packet overflow
  • Adjust the RTO initial time to avoid network congestion for retries
  • Disable the Nagle algorithm delay to reduce fewer data packets being cached (when the TCP data is relatively small, it will be piled up to a certain extent and sent, here to prevent accumulation)

Optimized by strategy:

  • Choose the best connection line for multiple ip speed test
  • Different network environments choose different timeout periods, and the timeout parameters are dynamically issued to calculate the policy
  • Short chain degradation, too many link states lead to frequent links, which can degenerate into polling method messages

Protocol Optimization:

  • binary protocol
  • QUIC protocol (based on UDP)

4.3 Multi-architecture in different places

Multi-data center communication, wide-area request (cross-data center request, relatively high delay)

In this regard, wide-area requests should be reduced to a greater extent (to ensure less delay), which is also the core idea

5. High reliability

The communication between the client and the server is processed through TCP full-duplex

basic concept

Short and long connections

concept general situation
short connection The client pulls the data from the server at the first time, but too many polls cause the server to be overloaded (this solution is not selected). But in the case of weak network, short connection will be selected
Long connection It can reduce this polling method and push to the client faster (√)
  • connID: The client establishes a connection with the server. Through a long connection, this connID can be allocated through global id allocation, snowflake algorithm, etc. (as long as it is globally unique)
  • sessionID: processing of business logic. Mark the chat box between A and B, mainly the session ID
  • msgID: also the processing of business logic

Push-pull mode
The server pushes its request to the client, and the client responds through the pull mode

Network calls are the most costly in the entire system, and it is necessary to take into account the message storm (bandwidth is easy to be full)

General reliability will be reconciled with consistency

  • Reliability: After the message is successfully sent, the end-to-end arrival
  • Consistency: at any time, the order of sending and receiving messages is the same

background:
To design an instant messaging system, the reliability and consistency of the bottom layer can only guarantee the communication at the bottom layer, but not the upper layer system. How to design this architecture is a big learning?
During the transmission process, the design of the overall architecture must ensure availability and
reliability at every step: reliable uplink message + reliable server service + downlink message reliability
consistency: consistent uplink message + consistent server service + consistent downlink message

Problems that may arise during the entire architecture transfer process:

When the client sends two msg messages to the server, both messages arrive at the server on the same TCP link.
There are generally three situations that occur when sending a message:

  1. When the client sends its message to the server (reliability at the TCP level), and then to the business logic layer, the business logic layer collapses and causes loss, but the server business layer is unknown, and the client thinks the message has been received
  2. After the message of the business logic layer is successfully processed. The server is multi-threaded, processing two messages that come in separately, and the processing speed of the message body is different, which may cause the messages to be out of order
  3. After the server's data is processed and reaches the client. A certain message failed to be stored, resulting in the message being lost and out of order

As can be seen from the above example, TCP/UDP can only guarantee the reliability of the underlying data, but cannot guarantee the reliability of business logic.

Particular challenges and difficulties lie in:

  • The messages distributed by different operating systems are different, which makes it impossible to guarantee the successful arrival of messages at the network level
  • The order of messages cannot be determined, there is no global clock to determine a unique order
  • Multiple clients/multiple servers/multithreaded multicoroutines process messages, and the order is more difficult to determine

Scheme selection:
In order to solve the above problems, and ensure the timeliness, reachability, idempotence and timing of the system

  • Timeliness : End-to-end receiving and sending messages in a timely manner and in real time (the delay should not be too large even during peak hours)
  • Reachability : timeout retry, ACK confirmation.
    The so-called timeout retry, by setting a timer.
    When the client sends its message to the server, when the server returns ack to confirm receipt, the client's timer will end.
    When the server sends its message to the client, when the client returns ack to confirm receipt, the timer on the server will end.
  • Idempotence : assign seqID, store seqID on the server
    , each message can only be stored once, and store the ID in the map by assigning the ID to ensure that the message cannot be received and resent once
  • Orderliness : seqID is comparable, and the server sorts the messages according to the messages sent by the sender. The
    ID must not only be unique, but also orderly (the ID of the subsequent message sent is larger than the ID sent earlier)
    through the client ID and the server. ID, if the ID sent by the client is large, the ID sent by the server should also be large (keep consistent)

5.1 Uplink message

The client sends its message to the server (guaranteed that this channel is available)

Replenish:

  • Strictly increasing: the last ID is greater than the previous one
  • The trend is increasing: the last step of ID is larger than the previous one
  • clientID is the ID of a single client, seqID is the ID of the server

The general plan is as follows: (the process is progressive one by one)

  1. Simply use the client's ID for pairing ( not selected ). The global session ID should be relied upon, not individual client IDs.
  2. The clientID uses UUID, which can guarantee the order, but cannot guarantee the uniqueness (the guarantee of uniqueness needs to store its mapping in multiple memories in the form of map, which will waste space (more than N clients need to save a map) )) ( not selected )
  3. The mapping of the server should be connID -> preMaxCID, which represents the CID transmitted by the client, and the previous largest ID should be paired (to ensure that the incoming ID is larger than the original one, that is, preMaxCID + 1 = CID), in this case , in the case of a weak network, if a single data packet is lost and retried all the time, it will cause the database to crash. ( Optional ), but the weak network will be treated specially.
  4. In order to solve this situation, you can use a sliding window. If the window is not full (the previous ID is not passed, the thread pool will not allocate threads). This situation will cause long connection resource maintenance (long connection state) ( not selected )
  5. The trend is increasing (the way of linked list, but this will waste the message bandwidth of the protocol). Both the client and the server will store the preID. After the two are successfully paired, the server will store the ID sent by the client and store the ID as preID. This is storing two IDs to compare. ( not selected )

Design a solution with a relatively small memory footprint and a reliable solution
For the methods mentioned above, make it into a table:

plan advantage shortcoming
The clientID is strictly incremented
1. The client creates a session and establishes a long connection with the server. The initial clientID is 0.
2. Send the first message to assign a clientID, which will be in the session (representing a chat box between A and B) Strictly increment
3. The message stored by the server is judged by clientID = preClientID + 1
4. After the server receives the message, it returns to the client ACK (if not received, retry up to three times)
1. Long connection communication delay is low
2. The sequence of the sending end is subject to ensure strict order
In the case of a weak network, the loss of a single piece of data will cause retries and lead to data paralysis, and the timeliness cannot be guaranteed
clientID linked list
1. The client uses the local timestamp as the clientID + the clientID of the previous message.
The server
2. The server also stores preClientID and clientID. Only when the current preClientID is paired can it receive the clientID
ditto Waste protocol message bandwidth
clientID list
1. The server stores multiple clientIDs for each connection to form a clientID list
2. The list of multiple clientIDs is used as a sliding window to ensure message idempotency
Reduce the message storm of weak network retransmission 1.实现复杂
2.网关层面需要更多内存,来维护连接状态
3.本身传输层TCP对弱网有所优化,应用层维护窗口收益不大

对于上面的方案,选用的方案只要保证clientID单调递增,特别是弱网情况下,通过优化传输层协议(QUIC)。本身长连接就不适合在弱网情况下工作,丢包和断线是传输层的问题。

上面的前提都是要在同一个sessionID,所以每次存储的时候不仅仅要存储clientID还需要存储sessionID

5.2 消息转发 + 下行消息

  • 分配seqID
  • 异步存储消息
  • 处理业务逻辑
  • 将其消息转发给其他的客户端

补充:

上面一直讲到clientID,还需要一个seqID(服务端分配的ID),本身会话中分为单聊和群聊,任何一个客户端的ID都不能作为整个会话的消息ID,否则会产生顺序冲突,必须要以服务端为主。
服务端需要在会话范围内分配一个全局递增的ID(比如客户端发出的msg1, msg2,可能由两个不同的终端发出。所以服务端发出的seq1,seq2,一大一小要跟客户端一样有一大一小,保证消息的有序性)


比如msg1,cid1,seq1。下一条消息为msg2,cid2,seq2 等等(每条消息都有它的生命周期,生命周期到了就会丢弃,减少带宽的存储)。保证同一个session的有序性


如何存储seqID:
redis 存储seqID,存在一个Redis(incrby),而且还要主从复制,如果一个会话的qps过高,还不能这么存储,Redis10w的数据会满。
从业务上处理,只保证会话内的每个消息有序,那么msgID sessionID seqID 拼接作为key,value为int的64位,通过哈希key变量来分组到不同的Redis。保证消息的唯一性可通过时间戳(比如雪花算法左移右移)。

如何让seqID持久化:
服务端的seqID单调递增跟客户端的处理方式一样。但有一点不一样的地方在于,客户端如果断掉的话,ID可以从零开始递增,但是服务端是全局的ID,从零递增会导致消息不一致。

如何保证消息的持久化,sessionID 业务逻辑永远是递增的(断电故障)。即使主从复制,在主节点故障的时候,从节点选举为主节点,也会有消息慢一拍,导致消息不一致,消息的回退。

为了解决上面的情况,需要将其Redis与lua结合在一起。每次取key节点,还需要比较ruaID(通过lua来保证一致性)。但是如果比较的ruaID真的不一致了(就像上面的主从复制的时候,主节点发生了故障),还是会回退处理。为了解决这种情况,seqID结合时间戳,该ID也是趋势递增。

趋势递增也会有bug(解决方案如下):
客户端可能收到的msg为1,10。中间的跳变缺失不好判断是真缺还是假缺,客户端会pull服务端(分页查询【1,10】的消息),如果确认有消息跳变,则进行补充(结合推拉的消息,此时会有网络异动。正常网络不会有这种现象,弱网比较多)


通过上面的文字版,将其整理成表格,具体如下:

消息转发保证可靠性

  1. 将其消息转发给mq异步存储,保证消息不丢失
  2. seqID无需全局有序,只要保证当前的session有序即可,解决单点瓶颈
方案 优点 缺点
服务在分配seqID前,请求失败或者进程崩溃?
分配seqID之后在回馈ACK
保证seqID可用 1.ack回复慢,收发消息慢
2.seqID瓶颈
3.消息存储失败,消息将丢失
服务端存储消息、业务逻辑等失败处理?
1.消息存储后在回馈ACK,ACK失败则客户端重发
2.处理服务逻辑的时候断开,则客户端可通过pull拉取,补充消息的漏洞,保证消息可靠
3.消息存储后,但业务逻辑失败。(可通过异常捕获,并客户端pull拉取历史消息)
1.保证业务可用性
2.出现异常,处理瞬时(比较接近异常)
1.上行消息延迟增加
2.整体复杂度搞
3.弱网协议需要协议升降级

对应的下行消息保证可靠性,整体表格如下:

方案 有点 缺点
客户端定期轮询发送pull请求 实现简单,可靠 1.客户端耗电高(用户体验差)
2.消息延迟高,不及时
seqID的严格递增
1.按照消息到达服务端顺序,来分配seqID(用服务端来弄seqID,有全局性)。特别是使用redis incrby生成seqID,key是sessionID或者connID
2.服务端的seqID严格递增,回馈给客户端,客户端按照preSeqID = seqID + 1做到幂等性
3. 服务端等待客户端的ACK回馈,如果超时则重传
可最大程度保证严格单调递增 1.弱网重传问题
2.Redis单点问题,无法保证严格单调递增
3.需要维护超时重传以及定时器
4.无法解决另外一个客户端不存在的传递消息
seqID趋势递增
1.使用lua脚本存储maxSeqID,每次获取该ID的时候,都会检查是否一致,如果不一致,则发生主从切换
2. 客户端发现消息不一致,则通过pull拉取命令,拉取不到,则说明屎seqID跳变。如果是另外一个终端不在线,则查询状态后仅存储而不推送
1.保证连续,任意时刻单调和递增
2. 会话级别seqID,不需要全局分布式ID,redis可用cluster水平扩展
3.可识别用户是否在线,减少网络带宽
群聊场景,有消息风暴
推拉结合 + 服务端打包 解决消息风暴
seqID 链表
服务端与客户端都通过存储seqID以及preSeqID,通过比较前一个消息是否满足,如果不满足再通过pull拉取
屏蔽对seqID趋势递增依赖 存储的时候还要多存储一个preSeqID

群聊点对点已经无法处理,可以通过批处理进行处理。

  • 将其多个msg存储在一个窗口中(窗口进行排序,发送给客户端),客户端同时对这么多个消息进行发送ack
  • 将其消息压缩(减少卡顿)

上面的消息如果过大,反而影响TCP的拆包,可以使用长短连接,群聊用短连接(用其优化)
服务端长连接push给客户端,让客户端主动pull,服务端主动发送短连接的http请求,减少服务端负载


plato 的总体大致流程如下:

  1. 客户端A创建连接之后,发送消息的时候分配一个clientID(从0开始,而且是递增)
  2. 启动消息定时器(ack回馈清除 或者 超时就会重传),将其消息发送给服务端
  3. 服务端通过redis将其session进行分片(可使用incryby)。异步写入mq,保证可靠传输
  4. 消息处理完成之后回馈给客户端A 一个ack,A客户端收到之后就会取消定时器(这一步的过程可以异步)
  5. 启动下行消息定时器,将其消息发送给客户端B,客户端B通过session的maxSeqID + 1来判定
  6. 客户端B回馈服务端来接收或者拒绝,以此判定是否关闭定时器

Guess you like

Origin blog.csdn.net/weixin_47872288/article/details/127500408