In-depth explanation of mobile live broadcast technology (1)

2016 is known as the "first year of mobile live broadcasting", China's online live broadcast users have exceeded 300 million, Baidu, Tencent, Taobao, NetEase, etc. have their own live broadcast platforms. This article mainly shares some technical points related to mobile live broadcasting.

 

1. What is video

 

Any video file, structurally speaking, is composed in this way:

 

- The most basic content elements consist of images and audio;

- The image is processed in a video encoding compression format (usually H.264);

- The audio is processed in an audio coding compression format (such as AAC);

- Indicate the corresponding meta information (Metadata);

-Finally, it is packaged and packaged by a container (such as MP4) to form a complete video file.

 

 2. What is live video

 

In short, live broadcast is the process of streaming each frame of data ( Video / Audio / Data Frame ) after labeling it with a timing label ( Timestamp ). The sending end continuously collects audio and video data, encodes, encapsulates, pushes the stream, and then spreads and disseminates through the relay distribution network. In this way, the live broadcast process of "production, transmission, and consumption" is realized.

 

三、GOP ( Group of Pictures )

 

In order to facilitate the storage and transmission of video content, it is usually necessary to reduce the volume of the video content, that is, the original content elements (image and audio) need to be compressed, and the compression algorithm is also referred to as an encoding format. For example, the original image data in the video will be compressed using the H.264 encoding format, and the audio sample data will be compressed using the AAC encoding format. When the playback is to be viewed, a decoding process is also required accordingly. Therefore, between encoding and decoding, a convention that can be understood by both the encoder and the decoder needs to be agreed. As far as video image encoding and decoding is concerned, this convention is simple:

 

The encoder encodes multiple images to produce GOP (Group of Pictures) segment by segment, while the decoder reads the segment GOP for decoding, then reads the picture and then renders it for display. GOP is a group of continuous pictures, which consists of one I frame and several B/P frames. It is the basic unit of video image encoder and decoder access, and its arrangement sequence will be repeated until the end of the image. I frames are intra-coded frames (also called key frames), P frames are forward predicted frames (forward reference frames), and B frames are bidirectionally interpolated frames (bidirectional reference frames). Simply put, an I-frame is a complete picture, while P- and B-frames record changes relative to an I-frame. Without I-frames, P-frames and B-frames cannot be decoded. Therefore, for a video, the data of the image part is a set of GOPs, and a single GOP is a set of I/P/B frame images.

 

Fourth, the concept of push and pull flow and RTMP and HLS protocols

 

Push streaming refers to the process of transmitting the packaged content in the collection stage to the server; pull streaming refers to the process of pulling live content from the server using a specified address.

 

RTMP

 

RTMP is the abbreviation of Real Time Messaging Protocol (Real Time Messaging Protocol), which is a real-time messaging protocol developed by Adobe for audio, video and data transmission between the Flash/AIR platform and the server. RTMP protocol is based on TCP, including RTMP basic protocol and various variants such as RTMPT/RTMPS/RTMPE. In the RTMP protocol, the video must be encoded in H264, and the audio must be encoded in AAC or MP3, and most of them are packaged in flv format. RTMP is the most mainstream streaming media transmission protocol at present. It supports CDN well and has low implementation difficulty. It is the choice of most live broadcast platforms. However, RTMP has one of the biggest shortcomings - it does not support browsers, and Adobe has no longer updated it. Therefore, if the live service needs to support the browser, it needs another push protocol support.

 

HLS

 

HTTP Live Streaming (abbreviated as HLS) is an HTTP-based streaming media network transmission protocol proposed by Apple, which is part of Apple's QuickTime X and iPhone software systems. It works by dividing the entire stream into small HTTP-based files to download, a few at a time. Basically, it can be considered that HLS realizes live broadcast by means of on-demand technology. Since the data is transmitted through the HTTP protocol, there is no need to consider the problem of firewalls or proxies, and the duration of the segmented files is very short, and the client can quickly select and switch the bit rate to adapt to playback under different bandwidth conditions. However, this technical feature of HLS determines that its delay is generally higher than that of ordinary live streaming protocols. The HLS transmission content includes two parts: one is the M3U8 description file, and the other is the TS media file. The video in the TS media file must be H264 encoded and the audio must be AAC or MP3 encoded.

 

Each .m3u8 file corresponds to several ts files. These ts files are the data that actually stores the video. The m3u8 file only stores the configuration information and related paths of some ts files. When the video is playing, the .m3u8 is dynamically changed. . Generally, in order to speed up, the .m3u8 is placed on the web server, and the ts file is placed on the CDN. The .m3u8 file is actually an m3u file encoded in UTF-8. The file itself cannot be played, but a text file that stores playback information.

 

HLS request process: HTTP requests the url of m3u8; the server returns a playlist of m3u8, which is updated in real time, and generally gives the url of 5 pieces of data at a time; the client parses the playlist of m3u8, and then requests each playlist in sequence. A url to get the ts data stream.

 

HLS delay: Assuming that the list contains 5 ts files, each TS file contains 5 seconds of video content, then the overall delay is 25 seconds. Because when users see these videos, the anchor has already recorded the videos and uploaded them. Of course, the length of the list and the size of a single ts file can be shortened to reduce the delay. In the extreme, the length of the list can be reduced to 1, and the duration of ts can be reduced to 1s, but this will increase the number of requests and increase server pressure. When the network speed is slow It will cause more buffering, so Apple's official recommended ts duration is 10s, so there will be a delay of about 30s.

 

  protocol principle delay advantage scenes to be used
RTMP

long link

tcp

The data at each moment is sent immediately after it is received 2s low latency instant interaction
HLS

short link

http

Collect a period of data to generate ts slice files to update m3u8 files

10-30s Cross-platform H5 Live

 

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326701975&siteId=291194637