Real-time audio and video technology with a delay of less than 500 milliseconds developed by instant messaging

First, let’s explain what real-time video is. Real-time video means that people can’t feel the delay in the whole process of video image generation to consumption. As long as the video service meets this requirement, it can be called real-time video.

The real-time nature of video is summarized into three levels:

Pseudo-real-time: Video consumption delay exceeds 3 seconds, one-way viewing in real time, the general architecture is CDN + RTMP + HLS, and now basically all live broadcasts use this type of technology.
Quasi-real-time: The video consumption is delayed by 1 to 3 seconds, and the interaction between the two parties is possible but there are obstacles to the interaction. Some live streaming sites have implemented this type of technology through TCP/UDP + FLV, and YY live streaming belongs to this type of technology.
When real: Video consumption latency < 1 second, 500 milliseconds on average. This type of technology is a real real-time technology, and there is no obvious sense of delay when people talk to each other. Such technologies have been implemented in QQ, WeChat, Skype and WebRTC.

Most real-time videos on the market are 480P or below 480P real-time transmission solutions, which are difficult to use in online education and online teaching, and sometimes fluency is a big problem.

To shorten the delay in real time, you need to shorten the delay. To shorten the delay, you need to know how the delay is generated. From video generation, encoding, transmission to final playback and consumption, delays will occur in every link.

Imaging delay: the general technology is powerless, it involves CCD-related hardware, the best CCD on the market now has 50 frames per second, and the imaging delay is also about 20 milliseconds, and the general CCD is only about 20 to 25 frames , with an imaging delay of 40 to 50 milliseconds.

Encoding delay: It is related to the encoder. In the following summary, the general optimization space is relatively small. Instant messaging chat software app development can add Wei Keyun's v: weikeyun24 consultation

We focus on designing for network delay and playback buffer delay. Before introducing the entire technical details, let's understand the knowledge and characteristics of video encoding and network transmission.

We know that the image format collected from the CCD is generally in RGB format (BMP). This format has a very large storage space. It uses three bytes to describe the color value of a pixel. If it is a 1080P resolution image space: 1920 x 1080 x 3 = 6MB, even if it is converted into JPG, it will be nearly 200KB. If it is 12 frames per second and JPG needs nearly 2.4MB/S bandwidth, this bandwidth is unacceptable for public network transmission.

The video encoder is to solve this problem. It will perform motion detection according to the changes of the front and rear images, and send the changes to the other party through various compressions. After 1080P is encoded with H.264, the bandwidth is only 200KB/S ~ 300KB/ S or so. In our technical solution, we use H.264 as the default encoder (also studying H.265).

As mentioned earlier, the video encoder will perform selective compression according to the changes before and after the image, because the receiving end did not receive any image at the beginning, then the encoder needs to do a full amount of compression when starting to compress the video. This full amount of compression is in H. 264 I frame, the following video images are incrementally compressed according to this I frame, these incrementally compressed frames are called P frames, H.264 also introduces a bidirectional predictive coding B frame in order to prevent packet loss and reduce bandwidth, A B frame takes the preceding I or P frame and the following P frame as reference frames. H.264 introduces grouping sequence (GOP) coding to prevent video images from being lost in the middle P frame, that is, sending a full amount of I frames at intervals, and a grouping GOP between the previous I frame and the next I frame.

As mentioned earlier, if the P frame in the GOP packet is lost, it will cause an error in the image at the decoding end. In fact, this error shows up as a mosaic. Because the continuous motion information in the middle is lost, H.264 will fill in according to the previous reference frame when decoding, but the data that is filled in is not the real motion-changed data, so there will be a problem of color chromatic aberration. This is the so-called mosaic phenomenon.

This phenomenon is not what we want to see. In order to avoid such problems, generally if P frame or I frame is lost, all frames in this GOP will not be displayed until the next I frame arrives and the image will be refreshed again. However, the I frame comes according to the frame cycle, which requires a relatively long time period. If the subsequent image is not displayed before the next I frame comes, the video will be still, which is the so-called freeze phenomenon. If too many video frames are lost continuously and the decoder has no frames to solve, it will also cause serious stuttering. The freeze phenomenon and mosaic phenomenon on the video decoding end are caused by frame loss. The best way is to try not to lose frames.

Knowing the principle of H.264 and block coding technology, the so-called second-opening technology is relatively simple. As long as the sender develops and sends the I frame from the latest GOP to the receiver, the receiver can normally decode the completed image and display it immediately. But this will send some more frame data at the beginning of the video connection, causing playback delay. Just try to let the expired frame data only be decoded and not displayed when the receiving end is playing, until the current video frame is within the playback time range.

In the previous four delays, we mentioned the encoding delay, which is the frame data process time after the RGB data from the CCD is encoded by the H.264 encoder. We tested the delay of each resolution of the latest version of X.264 on a common client with an 8-core CPU. At 1080P resolution, the video encoding bit rate will reach 300KB/S, the data size of a single I frame will reach 80KB, and the data size of a single P The frame can reach 30KB, which poses a serious challenge to the real-time transmission of the network.

A key part of real-time interactive video is the network transmission technology. Whether it is early VoIP or the popular live video broadcast at this stage, the main means is to communicate through the TCP/IP protocol. However, the IP network is inherently an unreliable transmission network, and video transmission on such a network is likely to cause freezes and delays. Let's take a look at several key factors affecting the quality of network transmission in IP network transmission.

We have designed a transmission model for the real-time transmission of 1080P ultra-clear video according to the characteristics of video coding and network transmission. This model includes a codec object with automatic bit rate according to the network status, a network sending module, and a network receiving module. and a UDP reliable reachable protocol model.

No matter how good the model is, it needs to be verified by a reasonable measurement method, especially in the time-sensitive transmission field of multimedia. Generally, in the laboratory environment, we use netem to simulate various situations of the public network for testing. If the simulation environment has reached an ideal state, relevant personnel will be organized to test on the public network. Let's introduce how to test our entire transmission model.

Real-time audio and video technology with a delay of less than 500 milliseconds developed by instant messaging

Guess you like