Three points will help you decipher why the delay of live streaming is very high

The development of communication technology has promoted the rise of video on demand and live broadcast services. The advancement of 4G and 5G network technology has also made streaming media technology more and more important. However, network technology cannot solve the high delay problem of live streaming media. This article will not It will introduce the impact of the network on the live broadcast business, but will analyze a common phenomenon in live broadcasts - the obvious network delay that can be felt between the anchor and the audience. In addition to the delayed live broadcast required by the business, what factors can cause the delay of live video broadcast to be so high?

When the audience interacts with the anchor through barrage, it may take 5 seconds or even longer from the time we see the barrage to the host's response. Although the time when the anchor sees the barrage is not the same as the time when the audience sees the barrage. There is a big difference, but it takes a long time for the live broadcast system to transmit the host's audio and video data to the client or browser. This time for data transmission from the host to the audience is generally called the end-to-end audio and video delay.

As a benefit for this article, you can receive free C++ audio and video learning materials package + learning route outline, technical videos/codes, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull streaming, srs )↓↓↓↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

Live streaming involves a very long link from audio and video collection and encoding to audio and video decoding and playback, which requires access to the host, streaming server, and audience. These three parties provide different functions:

  • Anchor end: audio and video collection, audio and video encoding, and streaming;
  • Streaming media server: live stream collection, audio and video transcoding, live stream distribution;
  • Audience side: streaming, audio and video decoding, audio and video playback;

In this lengthy collection and distribution process, some technologies are used to ensure the quality of live broadcasts in different processes. These methods used to ensure reliability and reduce system bandwidth together cause the problem of high latency in live broadcasts. This article will analyze why the end-to-end delay of live streaming is high from the following three aspects:

  • The encoding format used for audio and video determines that the client can only start decoding from a specific frame;
  • The size of the network protocol slice used for audio and video transmission determines the interval at which the client receives data;
  • The server and client reserve cache to ensure user experience and live broadcast quality;

Data encoding

Live video must use audio and video coding technology. The current mainstream audio and video coding methods are Advanced Audio Coding (AAC) 1 and Advanced Video Coding (AVC) 2. AVC is often called H .264. This section does not discuss the encoding and decoding algorithm of audio data. Let us analyze in detail why H.264 encoding is needed and how it affects live broadcast delay. Suppose we need to watch a 2-hour 1080p, 60FPS movie. If each pixel requires 2 bytes of storage, then the entire movie needs to occupy the following resources:

3600∗2∗1920∗1080∗60∗2bytes=1668.54GB3600∗2∗1920∗1080∗60∗2bytes=1668.54GB

However, in actual situations, each movie occupies only a few hundred MB or a few GB of disk space, which is several orders of magnitude different from the results we calculated. Audio and video encoding is used to compress audio and video data and reduce disk usage. and network bandwidth.

H.264 is the industry standard for video compression. Because video is composed of pictures frame by frame, and there is strong continuity between different pictures, H.264 uses key frames (Intra-coded pictures, I frame) as the full data of the video, forward reference frame (Predicted picture, P frame) and bidirectional reference frame (Bidirectional predicted picture, B frame) are continuously used to incrementally modify the full amount of data to achieve the purpose of compression.

H.264 will use I frames, P frames and B frames to compress video data into the picture sequence shown in the figure above. These three different video frames play different roles.

video frame

effect

I 帧

Complete image in JPG or BMP format

P frame

Data can be compressed using data from the previous video frame

B frame

Data can be compressed using previous and next video frames

The compressed video data is a series of continuous video frames. When decoding the video data, the client will first find the first key frame of the video data, and then incrementally modify the key frame. If the first video frame received by the client is a key frame, the client can play the video directly. However, if the client misses the key frame, it will need to wait for the next key frame before playing the video.

Group of pictures (GOP) specifies how video frames are organized. The encoded video stream is composed of continuous GOPs. Because each GOP starts with a key frame, the size of the GOP will affect the delay on the playback end. The network bandwidth occupied by the video is also closely related to the GOP. Under normal circumstances, the GOP of the mobile live broadcast will be set to 1 ~ 4 seconds. Of course, we can also use a longer GOP to reduce the occupied bandwidth.

The GOP in video encoding determines the interval between key frames and the time it takes for the client to find the first key frame that can be played, which in turn affects the delay of live streaming. This second-level delay is critical to the video live broadcast business. The impact is quite obvious. The setting of GOP is the result of trade-offs between video quality, bandwidth and delay.

data transmission

Audio and video data transmission can choose to use different application layer protocols. The two most common network protocols are Real Time Messaging Protocol (RTMP) and HTTP Live Streaming (HLS). Each network protocol uses different methods to transmit audio and video streams. We can think that the RTMP protocol distributes data based on audio and video streams, while the HLS protocol distributes audio and video data based on files.

The RTMP protocol is an application layer protocol based on TCP. It divides audio and video streams into segments for transmission. By default, the size of the audio data segment is 64 bytes and the size of the video data segment is 128 bytes5. When using the RTMP protocol, all data will be transmitted in chunks:

Each RTMP data block contains a 1 to 18-byte protocol header. The protocol header consists of three parts: Basic Header, Message Header, and Extended Timestamp. In addition to the containing block Except for the basic protocol header of ID and type, the other two parts can be omitted. The RTMP protocol entering the transmission stage only requires a 1-byte protocol header, which also means extremely low additional overhead.

The HLS protocol is a rate-adaptive streaming media network transmission protocol based on the HTTP protocol released by Apple in 2009. When the player obtains the pull address using the HLS protocol, the player will obtain the m3u8 file as shown below from the pull address:

#EXTM3U
#EXT-X-TARGETDURATION:10

#EXTINF:9.009,
http://media.example.com/first.ts
#EXTINF:9.009,
http://media.example.com/second.ts
#EXTINF:3.003,
http://media.example.com/third.ts

m3u8 is a file format for playing multimedia lists. The file contains a series of video stream slices. The player can play each video stream in sequence according to the description in the file. The HLS protocol splits the live stream into small files and uses m3u8 to organize these live clips. When the player plays the live stream, the split ts files will be played in sequence according to the m3u8 description.

The size of the ts file sliced ​​by the HLS protocol will affect the end-to-end live broadcast delay. Apple's official documentation recommends using 6-second ts slices, which means that the delay from the host to the audience will increase by at least 6 seconds. Use shorter slices. This method is not unfeasible, but it will bring huge additional overhead and storage pressure.

Although all application layer protocols are limited to MTU9 of physical devices and can only transmit audio and video data in segments, the granularity of audio and video data segmentation by different application layer protocols determines the end-to-end network delay. The slice granularity of protocols based on stream distribution such as RTMP and HTTP-FLV is very small, and the delay is less than 3 seconds. They can be regarded as real-time transmission protocols; while the HLS protocol is a protocol based on file distribution, and its slice granularity is very large. In actual use, It may cause a delay of 20 ~ 30s.

It should be noted that file-based distribution does not equate to high latency. The size of the fragments is the key factor that determines the delay. Ensuring that the fragments are small while reducing additional overhead is an issue that needs to be considered in the real-time streaming media transmission protocol.

Multi-port cache

The links of the video live broadcast architecture are often very long. We cannot guarantee the stability of the entire link. To provide smooth data transmission and user experience, both the server and the client will increase cache to cope with the audio and video lag of the live broadcast. .

The server usually caches part of the live broadcast data and then transmits the data to the client. When the network suddenly jitters, the server can use the data in the cache to ensure the smoothness of the live stream. When the network condition recovers, the data will be cached again; the client will also use the read-ahead buffer to improve the quality of the live broadcast. We can reduce the buffer size to increase real-time performance, but when the network conditions are more jittery, it will seriously affect the user experience of the client.

Summarize

The high latency of live streaming is a systemic engineering problem. Compared with one-to-one real-time communication such as WeChat video, the link between the producer and consumer of the video stream is extremely long, and many factors will affect the anchor and the audience. Because of the cost of bandwidth, historical inertia, and network uncertainty, we can only solve the problems we encounter through different technologies, and what we have to sacrifice is the user experience:

  1. There is too much total audio and video data - using audio and video encoding will use key frames and incremental modification to compress the data. The key frame interval GOP determines the maximum time the client needs to wait when playing the first picture;
  2. The browser's protocol support for real-time streaming is insufficient—the HLS protocol is used to distribute live slices based on HTTP, which will cause a live broadcast delay of 20 to 30 seconds for the host and audience;
  3. Uncertainty caused by too long links - servers and clients use caching to reduce network jitter, which has a significant impact on live broadcast quality;

The above factors will affect the end-to-end delay of the live broadcast system. Using RTMP and HTTP-FLV in a normal live broadcast system can achieve a delay of less than 3 seconds. However, GOP and multi-end cache will affect this indicator. The delay is within 10 seconds. It's normal. At the end of the day, let’s look at some relatively open related questions. Interested readers can think carefully about the following questions:

  • How much additional overhead will a file-based streaming protocol bring?
  • What are the compression rates of different video encoding formats?

As a benefit for this article, you can receive free C++ audio and video learning materials package + learning route outline, technical videos/codes, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull streaming, srs )↓↓↓↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

Guess you like

Origin blog.csdn.net/m0_73443478/article/details/134688290