IM-- four important characteristics

real-time

Real-time the main problem is: When a message is sent out after, how our system to ensure that this message is perceived fastest and get to the recipient, and try to make consuming fewer resources. Here are a few key points: the fastest touch-up, and less resource-intensive.

Let's take a look, IM in real-time architecture pursuit of news, ever experienced a few representative stage.

Polling short scene

As a question and answer request response mode hatched short polling mode having a lower migration costs, easier landing. But the disadvantages are also obvious:

1. In order to improve real-time, short polling frequency is generally relatively high, but most of the polling request is actually useless, the client not only the costs of electricity also cost flow;

High frequency resource request to the server for the pressure is relatively large, a large number of one server for polling QPS carry high frequency, the second is the back-end storage resources greater pressure.

Therefore, a short poll this way, usually used in small-scale users and do not want to spend too much on a small application conversion costs of the service.

Long polling scene

Polling polling long than short, one of the biggest improvement is that: in the short polling mode, the server has no new message is generated regardless of the round, and the response will return immediately. And long polling mode when this request is not acquired a new message, and will not immediately return to the end, but will be suspended from the server (Hang), waits for a period of time; if a new message is generated during this time, it able to respond immediately and return.

In comparison, we found that long-polling can significantly reduce short polling mode, the client unwanted high frequency of polling results in network overhead costs and power consumption, but also reduces the QPS server processes the request, and shorter than the polling For model, even more advanced

Long polling usage scenarios common in: real-time requirements are relatively high, but the overall amount of users is not too large. It also has more use in the browser does not support websocket end of the scene.

But there are still the following problems:

1. The end of the service suspension request, but reduces the inlet QPS request, did not reduce the pressure on the rear end of polling resources. If there are 1000 requests waiting for news, it could mean there are 1000 thread continuously polls the message storage resources.

2. Long polling news is not obtained within the timeout period, it will return at the end, and therefore still have not completely solved the client request is invalid.

Server push: the real edge-triggered

With the advent of HTML5, full-duplex websocket completely solved the problem of server push

和短轮询,长轮询相比,基于websocket实现的IM服务,客户端和服务端只需要完成一次握手,就可以创建持久的长连接,并进行随时的双向数据传输。当服务端接收到新消息时,可以通过建立的websocket连接,直接进行推送,真正做到边缘触发,也保证了消息到达的实时性。

websocket的优点是:

1.支持服务端推送的双向通信,大幅降低服务端轮询压力

2.数据交互的控制开销低,降低双方通信的网络开销。

3.web原生支持,实现相对简单。

tcp长连接衍生的IM协议

XMPP,MQTT,或者基于TCP,UDP来实现自己的私有协议。

 

可靠性

消息丢失有哪几种情况?

参考上面时序图,发消息大概整体上分为两部分:

1.用户A发送消息到IM服务器,服务器将消息暂存,然后返回成功的结果给发送方A(步骤1,2,3)

2.IM服务器接着再将短暂的用户A发出的消息,推送给接收方用户B(步骤4)

其中可能丢失消息的场景有下面这些:

在第一部分中,步骤1,2,3都可能存在失败的情况。

由于用户A发消息时一个请求和响应的过程,如果用户A在把消息发送到IM服务器的过程中,由于网络不通等原因失败了;或者IM服务器接收到消息进行服务端存储时失败了;或者用户A等待IM服务器一定的超时时间,但IM服务器一直没有返回结果,那么这些情况用户A都会被提示发送失败。

接下来,他可以通过重试等方式来弥补,注意这里可能会导致发送重复消息的问题。

比如:客户端在超时时间内没有收到响应然后重试,但实际上,请求可能已经在服务端成功处理了,只是响应慢了,因此这种情况需要服务端有去重逻辑,一般发送端针对同一条重试消息有一个唯一的ID,便于服务端去重。

第二部分中。消息在IM服务器存储完后,响应用户A告知消息发送成功了,然后IM服务器把消息推送给用户B的在线设备。

在推送的准备阶段或者把消息写入到内核缓冲区后,如果服务端出现掉电,也会导致消息不能成功推送给用户B。这种情况实际上由于连接的IM服务器可能已经无法正常运转,需要通过后期的补救措施来解决丢消息的问题,后续详细介绍。

即使我们的消息成功通过TCP连接给到用户B的设备,但如果用户B的设备在接收后的处理过程出现问题,也会导致消息丢失。比如:用户B的设备在把消息写入本地DB时,出现异常导致没能成功入库,这种情况下,由于网络层面实际上已经成功投递了,但用户B却看不到消息。所以比较难处理。

解决方案:

1.针对第一部分,我们通过客户端A的超时重传和IM服务器的去重机制,基本就可以解决问题。

2.针对第二部分,业界一般参考TCP协议的ACK机制,实现一套业务层的ACK协议。

解决丢失的方案:业务层的ACK机制

具体实现如下图:

IM服务器在推送消息时,携带一个标识SID(安全标识符,类似TCP的sequenceId),推送出消息后会将当前消息添加到待ACK消息列表,客户端B成功接收完消息后,会给IM服务器回一个业务层的ACK包,包中携带有本条接收消息的SID,IM服务器接收后,会从待ACK消息列表记录中删除此条消息,本次推送才算真正结束。

ACK机制中的消息重传

如果消息推给用户B的过程中丢失了怎么办?比如:

1.B网络实际已经不可达,但IM服务器还没有感知到

2.用户B的设备还没从内核缓冲区取完数据就崩溃了

3.消息在中间网络途中被某些中间设备丢掉了,TCP层还一直重传不成功等。

解决这个问题的常用策略其实也是参考了TCP协议的重传机制,类似的,IM服务器的等待ACK队列,一般都会维护一个超时计时器,一定时间内如果没有收到用户B回的ACK包,会从ACK队列中重新取回那条消息进行重推。

消息重复推送的问题

ACK包丢失导致的服务端重传,可能会让接收方收到重复推送的消息。

一般的解决方案是:服务端推送消息时携带一个sequence ID,Sequence ID在本次连接会话中需要唯一,针对同一条重推的消息Sequence Id不变。接收方根据这个唯一的sequence ID来进行业务层的去重,这样经过去重后,对于用户B来说,看到的还是接收到一条消息,不影响使用体验。

补救措施:消息完整性检查

假设一台IM服务器在推送出消息后,由于硬件原因宕机了,这种情况下,如果这条消息真的丢了,由于负责的IM服务器宕机了无法触发重传,导致接收方B收不到这条消息。

问题在于:服务器机器宕机,重传这条路走不通了

那如果在用户B在重新上线时,让服务端有能力进行完整性检查,发现用户B有消息丢失的情况,就可以重新同步或者修复丢失的数据

比较常见的消息完整性检查的实现机制有时间戳比对

1、IM服务器给接收方B推送msg1,顺便带上一个最新的时间戳timestamp1,接收方B收到msg1后,更新本地最新消息的时间戳为timestamp1

2.IM服务器推送第二条消息msg2,带上一个当前最新的时间戳timestamp2,msg2在推送过程中由于某种原因接收B和IM服务器连接断开,导致msg2没有成功送达到接收方B。

3、用户B重新连上线,携带本地最新的时间戳timestamp1,IM服务器将用户B暂存的消息中时间戳大于timestamp1的所有消息返回给用户B,其中就包括之前没有成功的msg2.

4.用户b收到msg2后,更新本地最新消息的时间戳为timestamp2

需要说明的是,由于时间戳可能存在多机器时钟不同步的问题,所以可能存在一定的偏差,导致数据获取不够精确。所以在实际的实现上,也可以使用全局的自增序列作为版本号来代替。

 

发布了43 篇原创文章 · 获赞 37 · 访问量 7万+

Guess you like

Origin blog.csdn.net/qq_28119741/article/details/103847683