How Webcasting Works - Introduction to Streaming Media Protocols

Reference: Geek Time "Interesting Talk about Network Protocol"
This article is a record of how video is transmitted in live webcasting.

How Webcasting Works - Introduction to Streaming Media Protocols

Streaming media (streaming media) refers to a technology and process that compresses a series of media data, sends the data in segments on the Internet , and instantly transmits video and audio on the Internet for viewing . This technology enables data packets to be sent like flowing water; if Without this technique, the entire media file must be downloaded before use. Streaming transmission can transmit live audio and video or videos pre-stored on the server. When viewers watch these audio and video files, the audio and video data will be played by specific playback software immediately after being delivered to the viewer's computer.

We know that live video is a series of pictures, and we express the quality of the video through frames and frame rate (FPS), pixels , etc.

For example, the FPS is 30, which means playing 30 pictures per second, assuming that the pixels are 1024*768, and each pixel is composed of RGB, each with 8 bits, and a total of 24 bits.

Then the video per second is: 30 frames × 1024 × 768 × 24 = 566,231,040Bits = 70,778,880Bytes, one minute is 4G.

Therefore, we can find that the processing of live video must undergo corresponding encoding processing to reduce transmission consumption. Encoding is a process of compression .

Features of video and picture compression process

  1. Spatial redundancy : There is a strong correlation between adjacent pixels of an image. Adjacent pixels of a picture are often gradually changed, not abrupt . It is not necessary to save each pixel completely. It can be saved every few times. The middle one is calculated by algorithm.
  2. Temporal Redundancy : Content similarity between adjacent images of a video sequence. The pictures that appear continuously in a video are not abrupt , and can be predicted and inferred based on existing pictures.
  3. Visual redundancy : The human visual system is not sensitive to certain details, so not every detail is noticed, and some data can be allowed to be lost .
  4. Coding redundancy : Different pixel values ​​have different occurrence probabilities. High probability uses fewer bytes, while low probability uses more bytes , similar to the idea of ​​Huffman Coding .

The general encoding process of commonly used encoding algorithms:

img

After encoding, the images frame by frame become a string of binary, which can be placed in a file and saved in a certain format.

Common file formats:

  • AVI、MPEG、RMVB、MP4、MOV、FLV、WebM、WMV、ASF、MKV。

Common encoding compression techniques:

  • H.261、 H.262、H.263、H.264、H.265。

    It is a standard developed by VCEG (Video Coding Experts Group) of ITU (International Telecommunications Union) and VCEG under ITU .

  • MPEG-1、MPEG-2、MPEG-4、MPEG-7。

    MPEG (Moving Picture Experts Group) of ISO (International Standards Organization), a standard developed by MPEG under ISO .

The general process of live video streaming

The general process can be divided into server streaming, transcoding, distribution, client streaming, decoding :

  • The anchor pushes the stream , and the server receives the stream : the network protocol pushes the encoded video stream from the anchor to the server , and there is a server running the same protocol on the server to receive these network packets, so as to obtain the video stream inside. The process is called streaming .

  • Server transcoding : After the server receives the video stream, it can perform certain processing on the video stream, such as transcoding , that is, converting from one encoding format to another. Because the clients used by viewers vary greatly, it is necessary to ensure that they can all see the live broadcast.

  • Client-side streaming : After the streams are processed, you can wait for the viewer's client to request these video streams. The process of viewer's client request is called streaming .

  • Video stream distribution : If there are a lot of viewers watching a live video at the same time, they all pull the stream from one server , which is too stressful, so a video distribution network is needed to preload the video to the nearest edge node, so that Most of the videos watched by the audience are pulled from the edge nodes, which can reduce the pressure on the server.

  • Client decoding : After the audience's client pulls down the video stream, it needs to decode , that is, through the reverse process of the above process, a series of unintelligible binary is converted into a frame of vivid pictures. The client plays out.

img

General flow of coding

The purpose of encoding is to convert pictures into a series of binary streams .

As has been analyzed before, a video is a sequence of pictures. If each picture is complete, it will consume a lot of money, so the video sequence will be divided into three frames:

  • I frame , also known as key frame . Inside is the complete picture,Only the data of this frame is needed to complete the decoding
  • P frame , forward predictive coding frame . The P frame represents the difference between this frame and the previous key frame (or P frame), and the previously cached picture needs to be used when decoding.Superimpose the difference with the definition of this frame to generate the final picture.
  • B frame , bidirectional predictive interpolation coded frame . The B frame records the difference between the current frame and the previous and subsequent frames. To decode a B frame, not only the previous cached picture must be obtained, but also the decoded picture,The final picture is obtained by superimposing the data of the front and rear pictures and the data of the current frame.

The I frame is the most complete, the B frame has the highest compression rate, and the sequence of compressed frames should appear at the interval of IBBP . This is encoding by timing .

img

As shown in the figure above, each frame can be divided into multiple slices , each slice can be divided into multiple macroblocks , and each macroblock has multiple subblocks . In this way, a large image is decomposed into small blocks, which can facilitate spatial encoding .

While video is encoded with this timing, it is ultimately a binary stream.

This stream is structured and is a network abstraction layer unit ( NALU , Network Abstraction Layer Unit ). Because the transmission on the network defaults to individual packets, so it is divided into individual units.

img

As shown in the figure above, each NALU is first a starting identifier , which is used to identify the interval between NALUs; then it is the header of the NALU , which mainly configures the type of NALU; finally, the Payload contains the data carried by the NALU.

  • NALU header: the main content is the type NAL Type

    • 0x07 means SPS , which is a sequence parameter set, including all information of an image sequence, such as image size, video format, etc.

    • 0x08 means PPS , which is an image parameter set, including all relevant information of all slices of an image, including image type, serial number, etc.

      Before transmitting the video stream, these two types of parameters must be transmitted, otherwise it cannot be decoded . To ensure fault tolerance,Before each I frame, these two parameter sets will be passed.

      If the representation type in the NALU Header is SPS or PPS, then the payload is the content of the real parameter set.

      If the type is frame, then the payload is the positive video data , and each NALU stores one slice . For each slice, whether it is an I frame, a P frame, or a B frame, there is also a Header in the slice structure, which has a type, and then the content of the slice.

In this way, the entire format comes out. A video can be split into a series of frames, each frame is split into a series of slices, each slice is placed in a NALU, and NALUs are separated by a special start identifier. Separation, in front of the first slice of each I frame, a NALU that saves SPS and PPS separately should be inserted, finally forming a long NALU sequence .

The flow of data stream transmission from the host to the server

The binary stream converted into a video after encoding needs to be packaged into a network packet before it can be sent, using the RTMP (Real Time Messaging Protocol, Real Time Messaging Protocol) protocol (here is the RTMP protocol as an example, of course there are other The protocol can be used for sending, and the principles are similar ). This step is the streaming process.

RTMP, the full name of Real Time Messaging Protocol, is the real-time messaging protocol. A proprietary protocol developed by Adobe for the transmission of audio and video data between Flash players and servers. A plaintext protocol working over TCP, using port 1935 by default. The basic data unit in the protocol is a message (Message), and the message will be split into smaller message block (Chunk) units during transmission. Finally, the segmented message blocks are transmitted through the TCP protocol, and the receiving end decomposes the received message blocks to restore them into streaming media data.

RTMP is based on TCP , so both parties must establish a TCP connection. On the basis of the TCP connection, you also need to establish an RTMP connection, that is, in the program, you need to call the Connect function of the RTMP class library to display and create a connection.

Version numbers and timestamps are negotiated through this RTMP handshake.

  • If the version numbers of the client and server are inconsistent, it will not work.
  • When the video is played, when the data flow is interoperable, it must be time stamped, so that we can know when the data is needed.

During the handshake, six messages need to be sent: the client sends C0, C1, C2, and the server sends S0, S1, S2.

C0 and C1 represent the version number and timestamp of the client.

S0 and S1 represent the version number and timestamp of the server.

C0 and S2 represent the client's confirmation of the server's timestamp, and the server's confirmation of the client's timestamp.

  • First, the client sends C0 to indicate its own version number, without waiting for the other party's reply, and then sends C1 to indicate its own timestamp.
  • Only when the server receives C0, can it return S0, indicating its own version number. If the version does not match, the connection can be disconnected.
  • After the server sends S0, it directly sends its own timestamp S1 without waiting for anything. When the client receives S1, it sends an ACK C2 that knows the timestamp of the other party. Similarly, when the server receives C1, it sends an ACK S2 that knows the timestamp of the other party.

img

After the handshake, the two parties need to pass some control information to each other, such as the size of the Chunk block, the size of the window, and so on.

When actually transferring data, it is still necessary to create a Stream , and then push the stream through this Stream to publish .

The process of streaming is to send NALU in Message , which is also called RTMP Packet . The format of Message is as follows:

img

When sending, the start identifier of the NALU will be removed. Because this is metadata information. And the SPS and PPS parameter sets will be encapsulated into an RTMP packet and sent , and then sent one by one NALU .

RTMP does not use Message as the unit when sending and receiving data, but splits Message into Chunk for sending, and the next Chunk can only be sent after a Chunk is sent.Each Chunk has a Message ID , indicating which Message it belongs to, and the receiving end will assemble the Chunk into a Message according to this ID.

When connecting earlier, the set Chunk block size refers to this Chunk. Converting large messages into small chunks before sending them can reduce network congestion in the case of low bandwidth.

Through the RTMP Packet package , the data will continuously arrive at the streaming media server . The general process is as follows:

img

At this time, in order to prevent the excessive pressure on the server due to the large number of users and the problem that the download speed is too slow for all users to go to the same place. The streaming media server also needs to distribute the video stream through the distribution network.

The distribution network is divided into two layers: the center and the edge :

  • Edge layer servers are deployed across the country and across major operators, very close to users;

  • The central layer is a streaming media service cluster , which is responsible for content forwarding. Intelligent load balancing system,According to the user's geographic location information, select the nearest edge server to provide users with push/pull streaming services.

    The central layer is also responsible for transcoding services, for example, converting the code stream of the RTMP protocol into an HLS code stream.

insert image description here

The client pulls the stream

If the client wants to watch the live broadcast, it also needs to pull the stream from the server through the RTMP protocol.

img

First read the decoding parameters of H.264, such as SPS and PPS, and then decode the received NALU frames one by one , and hand them to the player for playback, and a colorful video image will come out.

Note : The RTMP protocol is based on TCP. Because TCP needs to ensure reliable transmission, it may have a high delay. In practice, other UPD-based protocols may be used to implement webcasting (such as ARTC provided by Alibaba Cloud). Here we only use The RTMP protocol is used as a case introduction.

Guess you like

Origin blog.csdn.net/qq_53578500/article/details/126799445