The past and present of ultra-low-latency live broadcast technology

Authors: Li Chenguang, Kuang Jianxin, Chen Jianping

Epilogue:

According to the "Statistical Report on China's Internet Development Status" released by the China Internet Network Information Center, as of June 2022, the number of live webcast users in my country has reached 716 million, accounting for 68.1% of the total Internet users. The main reason is that during the 2020 epidemic, the number of people who work from home and have leisure and entertainment has surged, and new media interactive live broadcasting has become one of the most important leisure and entertainment methods for the majority of netizens.

With the continuous expansion and upgrading of the live broadcast industry chain, the division of labor in each link of the relevant industry chain has gradually become clear and the number of participants in each link has gradually increased; in order to meet different employment needs and lead to an increase in the number of related employees, the upgrading and transformation of traditional industries will be empowered through live broadcasting. And integrate and innovate with high-tech to optimize the business model of traditional industries, such as live streaming with goods, transformation of new media advertising media, etc.

Rich traditional culture, news, competitive sports, law, knowledge sharing and other content can be displayed and disseminated more efficiently in the form of mobile interactive live broadcast, which not only enables high-quality live broadcast content to achieve explosive spread, but also enables users to have more There are more opportunities to experience, learn and even actively participate in the live broadcast interaction, so as to achieve a win-win situation for both the content supply side and the demand dissemination.

It can be said that the ultra-low-latency live broadcast technology is embarking on a new development path. InfoQ will join hands with the Volcano Engine video live broadcast team to launch the series "The Evolution of Ultra-Low Latency Live Broadcast Technology", which will take you to explore the evolution of ultra-low latency live broadcast technology, reveal the challenges and breakthroughs behind it, and its impact on the future live broadcast industry.

In today’s article, let’s talk about the past and present of ultra-low-latency live broadcast technology~

Factors such as network infrastructure upgrades, audio and video transmission technology iterations, and WebRTC open source have driven audio and video service delays to gradually decrease, making ultra-low-latency live broadcast technology a hot research direction. Real-time audio and video services are booming in the consumer Internet field, and are gradually penetrating into the industrial Internet field. After experiencing the industry's first round of dividend outbreaks, the scene efficiency of China's real-time audio and video industry has gradually deepened and entered a stage of rational growth.

The selection of delay indicators largely depends on the degree of interaction coupling between users and content producers, and the scenarios are rich and varied.

0f366549af3434a87828dd12f86568de.png

In these extreme scenarios, the delay on the user side is expected to be as small as possible. The low-latency mode close to real-time communication can maximize the user's sense of participation, seamlessly interact with content producers, and mobilize what users see The resulting positivity. For example, in the key links of PK, gift giving, trade union rankings, and rewarding activities in the anchor show, the large value-added players on both sides of the competition hope to observe in real time the reaction of their own anchors after the gifts are swiped on the list, so as to provide support for the background operation decision-making team or follow-up The activity strategy provides the first-time information feedback.

The figure below reflects the comprehensive consideration of the role of low-latency live broadcast technology from the perspective of technology/product/operation; the impact of technological changes on the entire positive ecological cycle from external-internal comprehensive factors.

7ca7c70cc8277c0b88a349a492385117.png

(1) Limitations of traditional standard live broadcast technology

1. The delay problem of RTMP protocol

The RTMP protocol is the most traditional live broadcast protocol. The anchor uses the RTMP protocol to push H.264/5 and AAC encoded video and audio data to the cloud vendor's CDN server for repackaging and distribution. The end-to-end delay is generally controlled within 3 to 7 seconds. The problem is that there are defects in the scalability of RTMP, and there are certain technical difficulties in further probing the delay. In the case of the RTMP protocol: In order to meet the delay reduction, the download buffer of the player must be compressed, which will cause a significant freeze problem, making the playback look and feel uncomfortable (the delay drops below 2 seconds).

762d84c1ec642dcd47b3a1152791a0fa.png

2. The shortcomings of traditional live broadcast technology in real-time interactive scenarios

  • There is a significant difference between the video delay and the delay of the barrage interaction, and the interaction of the problem chat content does not match the rhythm of the video transmission image;

118f966f00ed6720071111089fbb2ef8.png
  • The form of interaction between the audience and the anchor is single, and the one-way content transmission cannot be two-way (it cannot be significantly solved before the introduction of RTC technology).

  • The limitation of one-way conduction is manifested in the first aspect: the streaming transmission at the viewer end cannot be adaptively adjusted according to the network conditions. Users can only perform streaming media transmission at a fixed bit rate and cannot achieve dynamic perception. In scenarios where network conditions change in real time (such as weak networks, mobile base station switching, etc.), fixed one-way bit rate transmission has a high probability of causing frame loss and freezing Other factors affect the viewing experience; on the other hand, when the network conditions are better, the fixed bit rate transmission cannot dynamically increase the video transmission bit rate (higher image quality brings a more comfortable experience)

  • In the interactive live broadcast scene where the live broadcast and the mic-connected scene coexist, the anchor uses the traditional RTMP to push the stream. When encountering the mic-connected PK scene, there will be a switching problem between the push stream/local mic-connected stream/server mic-connected stream. The switch will cause the audience to have a momentary freeze problem; if the ultra-low-latency live broadcast solution based on webRTC live broadcast technology is adopted, this kind of push-stream-connected-mic logic merge switch problem can be solved more friendly (only need to change Server forwarding-the distribution logic of subscribing to the stream channel, does not involve the bypass scheduling switching of the push streaming media data stream).

3. The difference between ultra-low latency live broadcast and standard live broadcast

  • Ultra-low latency live streaming is a new type of application that has emerged in recent years. Scenarios such as e-commerce live broadcasting and event live broadcasting have the characteristics of high concurrency and low latency. The 3-20s delay of traditional live broadcasting is difficult to meet their needs, but the requirements for real-time interaction are not as good as those of typical real-time audio and video such as video conferencing. application, there is no need to reduce the latency below 400ms. To this end, the ultra-low-latency live broadcast integrates the technical architecture of traditional live broadcast and real-time audio and video, and realizes the end-to-end delay between the two by learning from each other. Although there is no standard technical path for ultra-low-latency live broadcast manufacturers, it can be roughly summarized as the transformation of the streaming protocol, network architecture, and streaming protocol. In the actual application process, manufacturers will balance cost and performance Factors such as metrics to choose between different protocols and network architectures.

  • The absolute difference of the transport layer protocol (reliability optimization based on the UDP protocol, providing a basis for weak network countermeasures)

    • The traditional live broadcast FLV/RTMP uses the TCP protocol (or QUIC protocol). TCP is a reliable transmission protocol that sacrifices real-time transmission in exchange for data integrity. In a weak network environment, the "three-way handshake" connection before data transmission will bring a large delay. As an unreliable transmission protocol, UDP has the biggest advantage of high real-time performance, but it does not guarantee the arrival and sequencing of data. Real-time audio and video products (such as RTM **** ultra-low-latency live broadcast ) often use UDP protocol, and optimize the protocol layer and algorithm layer on top of it to improve the reliability and logic of transmission.

  • Optimization of UDP protocol:

    • The UDP protocol often appears in practical applications together with the RTP/RTCP protocol. RTP is responsible for data transmission, and the serial number, port type, time stamp and other fields in the protocol header can provide a logical basis for grouping, assembling, and sorting of data packets; RTCP, as the control protocol of RTP, is responsible for statistics on the transmission quality of RTP Feedback, and provide control parameters for weak network countermeasures.

    • 8525653d43d44f22ce8f9f31c32d953f.png

(2) Evolution of ultra-low latency live broadcast technology

  • The evolution process of live broadcast technology based on the development of business scenarios ( delayed main line )

  • Evolution of the RTM protocol itself

    • a=extmap:18 "http://www.webrtc.org/experiments/rtp-hdrext/decoding-timestamp"
      a=extmap:19 "uri:webrtc:rtc:rtp-hdrext:video:CompositionTime"
      a=extmap:21 "uri:webrtc:rtc:rtp-hdrext:video:frame-seq-range"
      a=extmap:22 "uri:webrtc:rtc:rtp-hdrext:video:frame-type"
      a=extmap:23 "uri:webrtc:rtc:rtp-hdrext:video:reference-frame-timestamp"
      a=extmap:27 "uri:webrtc:rtc:rtp-hdrext:audio:aac-config"
    • a=extmap:18 "http://www.webrtc.org/experiments/rtp-hdrext/decoding-timestamp"

    • a=extmap:19 "uri:webrtc:rtc:rtp-hdrext:video:CompositionTime"

    • a=extmap:21 uri:webrtc:rtc:rtp-hdrext:video:frame-seq-range

    • a=extmap:22 uri:webrtc:rtc:rtp-hdrext:video:frame-type

    • a=extmap:23 uri:webrtc:rtc:rtp-hdrext:video:reference-frame-timestamp

    • a=extmap:27 uri:webrtc:rtc:rtp-hdrext:audio:aac-config

    • RTP uses the RTP private extension header to carry the DTS/CTS value, and each frame of RTP data packet carries the DTS value of the frame through the RFC5285-Header-Extension extension header, and the first RTP packet and VPS/SPS/PPS packet of each frame pass the RFC5285- The Header-Extension extension header carries the CTS value of the frame, and the timestamp of the current frame is calculated by PTS = DTS + CTS. It is used to start the fast audio and video synchronization and the precise audio and video synchronization of the player's playback control logic .

    • The extension header carries the start/end sequence number of the frame: if the first few packets of the first frame are lost, retransmission can be quickly initiated according to the start sequence number to speed up the first frame; if the last few packets of the current frame are lost, then the The end sequence number quickly initiates retransmission, reducing delay and reducing stuttering .

    • The type of frame carried by the extension header: if the correct frame type is carried and parsed, the client does not need to parse the metadata; at the same time, in a weak network situation, the client can skip the B frame and directly decode the P frame, speeding up frame output and reducing potential freezes .

    • The extension header carries the reference frame information of the P frame: if a weak network situation occurs, the client can skip the **** decoding of the B frame according to the reference frame relationship specified by the extension header and its corresponding timestamp to reduce the occurrence of stuttering .

    • In order to speed up the signaling interaction, the CDN can directly return the supported audio and video capabilities to the client without querying the media information under certain conditions; at this time, the media description of the SDP will not contain specific audio and video configuration details . At the audio level, the AnswerSDP does not contain the header information required for aac decoding at this time; at this time, we need to adopt the RTP extension header mode to carry AAC-Config for the client to analyze and process the decoding action by itself when receiving the RTP packet, which is to reduce the signal Reduce the interaction time and increase the success rate of streaming .

    • MiniSDP signaling standard implementation part (Douyin)

    • CDN signaling asynchronous back-to-origin

    • RTP carries extension header components

1. Transplantation of WebRTC protocol in live broadcast player

  • RTM low-latency live broadcast is derived based on WebRTC technology, and the construction of point-to-point transmission based on WebRTC standard generally has the following steps:

    • The communication parties need to carry out media negotiation, and the session detailed specification is SDP (Session Description Protocol) interaction;

    • Then conduct interactive network address negotiation (query the real IP address of the peer) to prepare for building a media transmission channel;

    • When the above conditions are ready, it will enter the final Peer to Peer point-to-point media data transmission.

43fe1b6b95e71b1b394bba7bd2edebf6.png
  • The client-server of the signaling part is developed separately, using the SDP standard message mode; the media transmission part uses the open source WebRTC framework and Byte’s self-developed real-time audio and video media engine for media transmission.

2. RTC **** signaling protocol upgrade (MiniSDP compression protocol)

https://github.com/zhzane/mini_sdp

5c0406c7d09c9c144014970003db2a31.png
  • Standard SDP is relatively long (about 5-10KB), which is not conducive to fast and efficient transmission. In live broadcast scenarios, it will especially affect the first frame time. MiniSDP performs high-performance compression on the standard SDP text protocol, converts native SDP into a smaller binary format, and enables it to be transmitted through a UDP packet.

  • Reduce the signaling interaction time, improve network transmission efficiency, reduce the rendering time of the first frame of live streaming, and improve the QoS statistical indicators such as the second open rate/success rate of streaming.

play agreement RTM-HTTP signaling RTM-MiniSDP signaling FLV
First Frame Time (Preview) 600ms 510ms 350ms
Streaming success rate (preview) 97.50% 98.00% 98.70%

3. CDN asynchronous back-to- source optimization for RTM **** signaling

  • Reduce the interaction time of RTM signaling and the rendering time of the first frame of RTM streaming.

  • In the original process, when the server cache misses, it needs to wait for the source to get the data before returning the AnswerSDP with AacConfig information. The client sends STUN after receiving AnswerSDP, and the server can only start sending data after receiving STUN. (As shown on the left in the figure below); in the case of asynchronous back-to-source: the server no longer waits for the back-to-source result and directly returns AnswerSDP, after which the back-to-source and WebRTC connection establishment processes proceed synchronously. Wait until the WebRTC connection is successfully established and the data is returned to the source, and the RTP data will be sent immediately. (As shown on the right in the figure below)

216cb9daea39bbd4a650fdc31a0b1fca.jpeg

4. Optimization of video rendering freeze (100-second freeze is reduced by 4 seconds on average)

  • Improve the per capita viewing time, change the framing/decoding strategy of the RTC engine; prohibit the frame loss of the RTC in low-latency mode, and improve the live video rendering freeze.

test group Video rendering freezes for 100 seconds (live room scene)
RTM default JitterBuffer strategy 8.3s
RTM improved JitterBuffer non-drop frame strategy 3.6s
  • In the traditional RTC scene, the priority is to ensure the delay, and the whole link will trigger various frame loss (including but not limited to the decoding module, the network module), and the FLV live broadcast scene will give priority to ensuring the viewing experience (no frame loss, good audio and video synchronization effect ). If RTM wants to reduce lag and obtain qoe benefits, the broadcast control strategy needs to be customized, and the customized logic modification points are as follows:

    • To ensure that the jitterbuffer will not be blocked by other api operations such as the time-consuming decoding of the soft solution or the dequeuinputbuffer of the hard solution, the kernel layer has a layer of mandatory audio and video synchronization logic to ensure the audio and video playback experience;

    • At the same time, the upper layer is monitoring the buffer length of the network module and the decoding module, and has corresponding bottom-up logic:

  1. Judging that the hard solution cannot be solved, there are too many dec_cache_frames, an error is reported, and it will be downgraded to the soft solution;

  2. The jitterbuffer is abnormal, and there are too many frame_lists in the cache, which triggers the abnormal logic of the player, reports an error, and pulls the stream again.

b1f3810cead6ca977544143f0c088c6b.png

5. Optimization of RTM play control logic

  • To improve the penetration of mobile watching and broadcasting, the RTC unified kernel solution is inherently flawed (the MediaCodec hardware decoder takes a long time to initialize); the RTM video decoding module is migrated from the RTC kernel to the TTMP playback kernel, and the FLV video decoding module (MediaCodec Avoid reinitialization); significantly reduce the first frame rendering time of the Android platform, and improve the success rate of streaming.

  • RTC core general logic

1663a17f34275a5863c2ce42ea67d69a.png
  • Improved RTM core broadcast control logic

e0e5bd3d814ec74157c5e1f4cd8b6698.png

The above is all the content of the evolution of ultra-low-latency live broadcast technology "Evolution". In the second "practical chapter", we will focus on how to implement ultra-low-latency live broadcast technology on a large scale. Please continue to pay attention~

Guess you like

Origin blog.csdn.net/ByteDanceTech/article/details/132200487