New live broadcast architecture upgrade: Full support for Taobao double 11 live broadcast

Taobao Live has led to a substantial increase in sales for the past three consecutive years. Since 2020, more than 100 professions have moved to Taobao live broadcast rooms. Regardless of the status of the master or the merchant, a large number of entries have been driven by the new outlet. How to deal with the high-peak user live broadcast demand of Double Eleven undoubtedly puts forward higher technical requirements and challenges for Taobao live broadcast. At the same time, e-commerce live broadcasts have strong interactive appeals. The more timely the anchor's response to the barrage, the more effective the purchase will be.

It is verified by AB test that delay has a positive effect on e-commerce live GMV. However, it is difficult to reduce the delay of conventional live broadcast formats such as HLS, FLV, and RTMP. Conventional live CDNs are no longer suitable for live broadcasts with lower latency, and the entire technical system needs to be upgraded. In order to reduce the live broadcast delay, there are several approaches in the industry:

Scheme comparison

Private agreement

TCP protocol stack optimization

QUIC

SRT

WEBRTC

effect

the best

general

better

better

it is good

Versatility

difference

optimal

good

Lower

Good

Difficulty of realization

difficult

Can be deep or shallow

general

general

difficult

The delays of LHLS, CMAF and even LLHLS solutions are basically more than 2 seconds, so we will not compare them for now. Comprehensive consideration, the WEBRTC program is relatively in line with our needs. The Amoy Technical Department and Alibaba Cloud jointly built a low-latency multimedia transmission network based on WEBRTC.


Low-latency transmission network combining communication and live broadcast


The delay of the live broadcast is always a big problem, and many teams are considering how to reduce the delay. Low-latency transmission is a comprehensive problem. It is necessary to start from the whole, not only from the design considerations, but also the close cooperation of the client, server, and data system. The most fundamental transmission protocol is not upgraded, and there is always a ceiling for delay.

RTCP protocol header

For the traditional rtmp, hls, http-flv protocol based on tcp, tcp is a reliable transmission, and it must wait for some data to arrive under a weak network before continuing. But for audio and video data, some data can be lost, such as frames that are not referenced. In addition, tcp's congestion control is at the kernel layer. When congestion occurs, the sliding window is directly halved, resulting in low data throughput. At the same time, it lacks application layer control, is inflexible for audio and video scenes, and has low accuracy. In addition, tcp has the function of retransmission aggregation, which results in slow data confirmation and increased delay.

For these reasons of tcp, traditional players will have a buffer of more than 5s to combat network jitter. Therefore, the conventional audio and video communication basically does not use the TCP protocol to achieve, but uses UDP.

After the live broadcast reduces the delay, there are many similarities to the communication in the transmission, and the plan can be considered together. The webrtc based on udp is semi-reliable transmission and mature technology, which is more suitable for audio and video scenarios. From the signaling design, the rtcp app private protocol is adopted, the rtcp standard is compatible with each other, and the audio and video transmission uses a socket connection. The Jianlian Agreement is more streamlined, ensuring that the media data is given out at the fastest 1RTT and the broadcast can be started quickly.

GRTN transmission network

The design concept of GRTN is to use communication technology to transmit live multimedia data. Technology is a technology, so it is natural to realize that one network can support both audio and video communication and live broadcast.

After the unified architecture, there is only one set of code and one set of operation and maintenance system, reducing maintenance costs. Multiple services are put together, and the peak bandwidth of each service is different. The total peak value is lower than the sum of the peaks of each service. The operator usually charges our bandwidth according to the peak or 95 peaks. After multiple service peaks are staggered , The cost will be reduced accordingly. In the future, planning on-demand and file distribution will also be placed on this network, which has a greater effect of peak shifting and cost reduction.

On the edge node of the CDN downstream, the protocol conversion is realized and converted to the RTC protocol, and low-latency live broadcast can also be realized. This is the way the most advanced systems on the market realize low-latency live broadcast.

However, GRTN is not just switching to the RTC protocol for downlink playback, but a full-link RTC protocol. Full-link RTC can also solve the quality problem of the last mile on the host side, resist packet loss between servers, and improve quality. The loss caused by intermediate protocol conversion is reduced, thereby achieving lower transmission delay. After realizing full-link RTC, congestion control can be end-to-end, such as end-to-end FEC anti-packet loss, large and small stream switching, SVC/GOP loss and other strategies to work together to improve user experience. After end-to-end low-latency RTC transmission, it becomes very simple for the audience to connect to the microphone. You only need to upload one stream from the audience side, and the anchor pulls down to play, and you can realize the connection.

In addition, some students may ask, what is the difference between an end-to-end RTC system and a traditional conference system? The difference here is that traditional conference systems are usually deployed in a central computer room with few access nodes. GRTN is directly deployed on the CDN to improve quality through global coverage.

The third feature of GRTN is its decentralized architecture and dynamic path planning. Students who do live CDN will have a deep impression. When there are many small anchors, the bandwidth of the return source is very expensive. Small anchors have a low hit rate and a high return-to-source ratio. In some cases, more than 50% of the bandwidth is required to return to the L2 (intermediate source), which incurs a lot of costs.

At the same time, because L2 (intermediate source) and central computer room guarantee better than L1 (edge ​​node), the bandwidth price is usually higher than L1, making the return source bandwidth more expensive. All live CDNs are trying to reduce the bandwidth back to the source, including using 302 redirection, establishing cold flow clusters, etc., but returning to the source at L2 is inevitable. The decentralized architecture means that the return path can go directly from L1 without passing through L2 or the central computer room, reducing the transmission path, which not only greatly reduces the return cost, but also improves the transmission quality. Cooperate with dynamic path planning to find the optimal solution between cost and quality. Since the audio and video streams do not have to pass through the center, the impact of the center failure is greatly reduced.

Most of GRTN modules are common to each business, but there are also some content that need to be customized for each business. Business customization mainly includes three parts: client, congestion control and transmission strategy, and streaming media editing. These three pieces have a unified access interface design. The most relevant to transmission quality is mainly congestion control and transmission strategy.

Congestion control algorithm for adaptive services



▐Deep  customization: strong network push stability 


Deep network optimization. First, give a strong network judgment, improve the stability of the push stream image quality for strong network users, and control the bit rate to avoid accidental packet loss and delay. In addition, optimize the AIMD bit rate control algorithm to speed up and slow down. Strong network bandwidth utilization; finally, to limit the range of weak networks, the native congestion control algorithm is too sensitive to delay jitter, and smooths its strategy of adjusting the bit rate.

▐The  system searches for the optimal solution: self-learning parameter optimization 


Conventional audio and video transmission optimization usually requires professional talents to tune based on experience values, which requires high talents. However, we can automatically implement the process of talent tuning by the system. Systematically tune the theoretical default value of the congestion control algorithm in WebRTC, comb the parameter range based on prior knowledge, enable parameter configuration, batch parameter detection based on prior knowledge, and multi-angle evaluation algorithm design. Through continuous iteration, the optimal solution of congestion control parameters is found. The benefits of the parameter self-learning system also include that when the environment changes, the system can automatically update the optimal solution, and the experience of professionals makes it difficult to adapt to the new environment so quickly. After using the self-learning parameter optimization on the streaming end, the streaming jam rate dropped by 40% and the delay dropped by 12%.


▐Frontier  exploration: Congestion control based on reinforcement learning 

Inspired by Pensieve[1], the latest research direction in academia, we cooperated with Beijing University of Posts and Telecommunications to customize Taobao live broadcast based on reinforcement learning congestion control algorithm, self-developed strategy combined with traditional congestion control algorithm, in a stable network, maintain bandwidth If the utilization rate does not drop, it can reduce the delay by 20% and the stall by about 25%.

Taobao Live’s research results in this direction have been recognized by the academic community and published papers in the top online conference MobiCom (Concerto[2], OnRL[3]) for two consecutive years.

[1]. Hongzi Mao,Ravi Netravali, and Mohammad Alizadeh. 2017. Neural Adaptive Video Streaming with Pensieve. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM 2017, LosAngeles, CA, USA, August 21-25, 2017. 197–210.

[2]. Zhou, Anfu, et al. "Learning to coordinate video codec with transport protocol for mobile video telephony." The 25th Annual International Conference on Mobile Computing and Networking. 2019.

[3]. Zhang, Huanhuan, et al. "OnRL: improving mobile video telephony via online reinforcement learning." Proceedings of the 26th Annual International Conference on Mobile Computing and Networking. 2020.


Network transmission strategy suitable for live broadcast


The congestion control algorithm mainly feeds back the network congestion and cannot directly alleviate the network congestion. To alleviate network congestion, it is necessary to cooperate with network transmission strategies, including large and small stream switching, SVC, GOP loss, smooth transmission, FEC, ARQ/NACK, etc.

The congestion control algorithm is completely modularized and refactored from webrtc. The performance is more than twice that of webrtc. It is deeply optimized for live broadcast scenes, while taking into account the second opening and delay. It supports GCC and BBR algorithms to pursue the maximum throughput rate. In the case of small network jitter, it will not be affected, and the maximum support is 20% packet loss and jitter within 500ms. Unlike the communication scene that pursues extremely low latency and fluency, the live broadcast scene pursues high image quality to ensure the user's viewing experience. The entire algorithm is fed back through online broadcast data AB, and the system continues to optimize iteratively.

▐ Zero transcoding rate adaptive system


In most live broadcast systems in the industry, when the user network is poor, users can switch to a lower resolution to ensure smooth playback.

Low-definition live streaming requires cloud transcoding. Cloud transcoding needs to be decoded and then encoded. Encoding and decoding consumes a lot of computing resources and therefore brings a lot of transcoding costs.

There is no need for transcoding in communication technology. When the congestion control algorithm finds that the bandwidth is congested, it will automatically adjust the transmission code rate to relieve the congestion. However, this congestion control algorithm is not applicable in a live broadcast system, because some users have poor networks, which will lower the quality of all viewers. Therefore, the code control of the live broadcast system cannot directly adjust the anchor code rate. We need to look at it separately. If the upstream network of the host is congested, it is indeed necessary to adjust the encoding rate of the host. If the downstream network of some viewers is congested, the encoding rate of the host cannot be adjusted. Instead, some unimportant information can be discarded at the edge of the CDN. Achieve lossy rate reduction. Strategies to reduce the bit rate include: SVC, GOP loss, large and small streams, etc.

SVC is a technology that can lose part of the video content, although it reduces the viewing quality a little, but can ensure that the content is viewable, and is suitable for relatively static scenes. In this scenario, due to the small movements, it is almost invisible, such as live streaming with goods.

SVC is divided into two types, airspace SVC and time domain SVC. Spatial SVC refers to a technology in which a part of the video content is lost and the frame rate remains unchanged, but the video definition of each frame is reduced. Time-domain SVC refers to a technology that loses part of the video content and the frame rate drops, but the definition of each frame is almost the same as that of the original frame. In short, time-domain SVC is to ensure the loss of clarity and fluency, while spatial SVC is to ensure that the fluency and clarity is lost. At present, most of the airspace SVC encoding algorithms in the industry will have a reduced compression rate, and even the effect is not as good as encoding both large and small streams, so they are used less.

Flat structure

The structure commonly used in the industry is the single-layer pyramid structure of X265.

Pyramid structure

Taobao Live uses S265 encoder and supports multi-layer pyramid structure. The advantage is that the reference frame is closer and the correlation is stronger. The higher the frame of the pyramid, the lower the priority, and the smaller the impact of discarding when the network is congested.

Consider time, quality, and level when selecting reference frames

The selection of the reference frame is very important. Choose a reference frame with better quality, and the better the quality of the solution. The closer the reference frame is selected, the higher the similarity of the frame and the better the compression quality. Because each layer will have quality loss, the higher the reference level, the worse the quality. Therefore, the higher the reference level, the more it is necessary to shorten the reference distance and improve the quality of the frame.

Therefore, after using S265's time-domain SVC technology, not only can it achieve the ability to reduce the frame loss rate at the edge of the CDN during congestion, reduce the jam rate, but also achieve a higher compression rate without data loss, and the overall compression rate An increase of 5 to 6 percentage points.

But in extreme congestion, this is not enough. Need to use the lost gop strategy to quickly resolve congestion. In a gop, the code rate needs to be calculated according to the latitude of the gop. If the bandwidth is insufficient, the following frames need to be discarded until the I frame. This also brings a problem. The data in the second half of the gop is only audio, and the data is insufficient, and the congestion control bandwidth will drop a lot. In response to this situation, a series of adjustments have been made at the algorithm level.

▐Strategy  against packet loss 

Strategies against packet loss include FEC, ARQ/NACK, smooth transmission, and so on, in addition to the code reduction rate mentioned above.

FEC is a forward error correction code, that is, when encoding, more redundancy is encoded, so that some data is lost on the decoding side, and the original data can still be recovered. FEC was usually used in radio and television systems in the past and played an important role in one-way transmission systems. Later, the audio and video communication technology, FEC has also become standard. But the live broadcast system is different from the broadcasting system, and also different from the audio and video communication system. Due to the huge number of viewers, if FEC is compiled to combat the packet loss of each viewer, a large number of viewers will receive redundancy even if the network is good, which will result in bandwidth consumption and cost increase. If the redundancy is recalculated for each viewer at the edge of the CDN, the FEC matrix calculation amount is relatively large, and the calculation cost of the CDN is very high. Therefore, the FEC of the live broadcast system is programmed with fixed redundancy on the host side. The CDN selects the percentage of FEC to be transparently transmitted for each viewer's network at the edge node, so as to achieve the goal of not only preventing network packet loss, but also saving costs.

ARQ or NACK are both techniques for requesting retransmission when packet loss is found. However, when the network is congested, requesting a retransmission will aggravate the congestion, and it needs to be used with caution.

One of the major culprits of network stalls is network jitter, such as wifi and 4G signal interference, which may cause network interruption for a short time, and then all data arrives in an instant. All data is sent in an instant, which may cause network congestion, leading to stuttering, and more serious, if a super-large anchor has a spike in sending data, it may directly fill the CDN node, causing a larger range of stuttering.

The smooth sending algorithm strategy prevents network bursts and smooths network traffic, especially when a large number of users enter the live broadcast room at the same time. At the same time, it is deeply customized for the second opening scene. And redesign the sending mechanism and algorithm, and the sending performance is greatly improved, which is more than 1.4 times the performance of native webrtc. Smooth sending uses the udp multi-packet sending mechanism sendmmsg to construct new sending logic and greatly improve sending efficiency. The playback side of the live broadcast scene has the requirement of seconds to open. During the start of the broadcast, the server will send excess gop data according to the configuration, which is somewhat different from the pacer of the communication scene.

For the live broadcast scene, we subdivide the smooth transmission into three stages

  • First frame in seconds

    • It is mainly the data of the first I frame, which is usually very large in live broadcast. The data of the first I frame is quickly sent to the player at the configured maximum speed.

  • Open gop cache in seconds    

    • After the first I frame is sent, gop buffer data is sent at a relatively high speed.

  • Send normally    

    • At this time, it is sent according to the configured speed, or according to the bandwidth given by congestion control.


In addition, the live broadcast scene also has the impact of a fixed time I frame. In the communication scenario, only large I frames are sent when pli and fir request I frames, and the data is relatively stable. However, the live broadcast scene will be fixed for 2-4s and there will be I frames coming, similar to the scene on the right of the following figure. Large I frames may be sent for 200-400ms when the network bandwidth is insufficient. Since audio and video exist in the smooth sending queue at the same time, audio sending will be blocked. Therefore, in the second-on stage, audio and video are sent in an interleaved order. After the second-on stage, audio is sent first.

to sum up


With the rapid rise of e-commerce live broadcasting, how to deliver live content efficiently and reliably, whether it is algorithm, client, or server, single-point optimization has become a bottleneck. To make a breakthrough, the overall design and transformation of the live broadcast system is required. .

At present, Taobao Live Broadcast has pioneered the upgrade of the new live broadcast architecture, completing RTC link access for the user's first and last mile, and low-latency live broadcasting makes the live broadcast era from 5 to 7 seconds delayed into 1 second . This epoch-making technological innovation is inseparable from the optimization and innovation of Taobao live broadcast congestion control algorithms and strategies, and the co-construction of Taobao live broadcast and Alibaba Cloud media communication-level link GRTN system.

Customized optimization of congestion control algorithms for live broadcast scenes, strong network streaming stability, improved strong network bandwidth utilization, smooth code rate control strategy; self-learning parameter optimization system, batch parameter detection based on prior knowledge, multi-angle evaluation algorithm Design, through continuous iteration, find the optimal solution of congestion control parameters. After using the self-learning parameter optimization on the push stream, the push stream jam rate dropped by 40% and the delay dropped by 12%; based on the congestion control of reinforcement learning, we cooperated with Beijing University of Posts and Telecommunications to customize Taobao live broadcast based on reinforcement learning congestion The control algorithm, a strategy that combines self-developed and traditional congestion control algorithms, can reduce delays by 20% and delays by about 25% in a stable network while maintaining bandwidth utilization. Taobao Live’s research results in this direction have been recognized by the academic community and have published papers at the top online conference MobiCom for two consecutive years.

For the construction of a communication-level live broadcast system, the design concept of GRTN is to use communication technology to transmit live multimedia data, so that a single network supports both audio and video communication and live broadcast. The operator’s bandwidth billing is usually based on the peak value or the 95 peak value. After multiple service peaks are staggered, the effect of peak shifting and cost reduction can be achieved. On the edge nodes of the CDN downstream, protocol conversion is implemented to convert to RTC protocol to realize low-latency live broadcast. However, GRTN is not only switching to RTC protocol for downlink playback, but a full-link RTC protocol. At the same time, GRTN is a decentralized architecture. With dynamic path planning, it can find the optimal solution between cost and quality. On this system, network transmission optimization strategies are implemented, including large and small stream switching, SVC, GOP loss, smooth transmission, FEC, ARQ/NACK, etc.

At present, Taobao Live Double 11 has fully run on the GRTN system. Compared with last year's Double 11, the second opening rate has increased by 32%, the stall rate has decreased by 79%, and the stall vv has decreased by 44%. In the future, the Taobao live broadcast team will continue to optimize the system and explore more innovative gameplay, supplemented by the rapid growth of e-commerce live broadcast demand.

✿ Further   reading



Author| Tao Department Audio and Video Technical Team

Edit| Orange

Produced| Alibaba's new retail technology

Guess you like

Origin blog.csdn.net/Taobaojishu/article/details/111189128