A detailed description of streaming media transmission network MediaUni

A network of "diverse integration".

Haiyu Huang|Speaker

Hello everyone, I am Huang Haiyu from Alibaba Cloud Video Cloud. The topic of my sharing today is MediaUni——Design and practice of future-oriented streaming media transmission network.

Next, I will introduce the requirements of applications for streaming media transmission networks, MediaUni positioning and system architecture, MediaUni technology analysis, application implementation based on MediaUni, and the future of streaming media transmission networks.

01 Responding to the requirements of the streaming media delivery network

As shown in the figure, the delay of different scenes increases from left to right. On the far right, a delay of more than 30 seconds is usually considered a spreadable scenario, such as video on demand. The industry will use a CDN on demand network to serve such transmission needs.

A delay interval of 3-10 seconds is more common in live broadcasts. Protocols such as RTMP, HTTPFLV, and HLS will be used for streaming or small fragment transmission, which can meet the delay requirement of 3-5 seconds, and is usually suitable for barrage Interactive scene. In the industry, a CDN live broadcast network is usually used for such transmission services.

The recently emerging delay of 800 milliseconds to about 1 second can usually meet the transmission delay requirements in the live broadcast of sports events. difference. In addition, through the practice of Taobao and other e-commerce scenarios, it is found that the live broadcast with this delay level can effectively improve the GMV compared with the ordinary live broadcast of 3 to 10 seconds.

The delay of video conferencing, live streaming, and voice chat is usually between 250 milliseconds and 400 milliseconds. In the industry, BGP bandwidth or three-wire bandwidth is usually used to build a low-latency audio and video communication network for real-time audio and video communication. Video transmission support.

The scene with lower delay is called controllable and immersive scene, and the delay is 50~80 milliseconds. The typical scene that can be controlled is parallel driving. When there is a problem with automatic driving, you can control the car through remote commands. Take over driving. Cloud games and cloud rendering are typical immersive scenarios.

As can be seen from the above, the reduction in latency brings more business scenarios and more new ways to play.

Taking the cloud rendering scene with the lowest delay as an example, through its top-level architecture diagram, it can be found that the user generates a series of operation instructions, transmits them to the relevant business server through the transmission network, and then controls the rendering server to render. The rendered data will be After passing through the streaming media server, it is transmitted through the streaming media transmission network, and finally reaches the user terminal.

Compared with traditional media transmission, this architecture has two characteristics: first, video comes from rendering rather than shooting, which involves a large number of computing links; second, the transmission network not only transmits media, but also transmits control commands, which requires Ensure low latency and reliable delivery of control commands.

Cost and quality are two topics that cannot be avoided in the transmission network. The figure on the left shows the cloud cost composition of a typical Alibaba Cloud live broadcast APP customer, about 70% of which comes from transmission, which means that in the current environment of cost reduction for enterprises, lower cost requirements for transmission are required.

The picture on the right shows the relationship data between the playback time and freeze rate of another typical video app. Through the AB test, we found that as the freeze rate gradually decreases, the playback time will also increase significantly.

On the whole, the requirements of the application for the streaming media transmission network come from four aspects: first, the reduction of delay can lead to more business scenarios; second, low cost, media transmission is an important part of the cost of video applications; third , Higher-quality video can increase the viewing time of users; fourth, multi-dimensional, in addition to media transmission, it also needs to support the transmission of interactive messages and control signaling, and also support the close connection with computing.

02 MediaUni positioning and system architecture

Based on the above considerations, Alibaba Cloud designed and implemented MediaUni, where Uni is taken from the word unified, which means to serve various business scenarios through a unified network .

MediaUni is an upgrade of Alibaba Cloud's global real-time transmission network GRTN. It is a fully distributed, ultra-low-latency, multi-service support multi-converged streaming media transmission network built on the basis of a wide range of heterogeneous nodes.

The core concept of MediaUni is integration , which is mainly due to the following three aspects of thinking.

First, the utilization rate of resources brought about by mixed business operation is improved. The basic business logic of cloud vendors is: the cloud is an elastic service, and everyone can pay on demand, and use it immediately after purchase. Then, since different users use different time periods, the purchased fixed resources can be time-shared and served to different vendors. , so that the resource utilization rate of the cloud service is greater than the resource utilization rate purchased by the customer, thereby reducing the cost. Different from ordinary cloud computing, a major feature of streaming media is the aggregation of services. For example, most of the single Internet entertainment live broadcasts will occur between 8:00 and 11:00 p.m. It is distributed during working hours during the day, and sports events may appear at any time period. If we only support one of the services, the purpose of resource reuse cannot be achieved.

The theoretical basis behind this is the most basic central limit theorem of probability theory: the sum of a large number of independent random variables obeys a normal distribution, and the more random variables there are, the more their sum will gather near the mean sum. Corresponding to the streaming media transmission business, the use of resources by each customer is a random variable, and their total demand for resources is the total amount of resources that the service provider needs to guarantee. In order to achieve a higher reuse rate, we need to expand the scale and diversity of business as much as possible, so as to achieve the goal of comprehensive cost reduction.

Second, the marginal cost of R&D brought about by technology reuse is reduced. No matter it is the transmission network of audio and video on demand, the transmission network of live broadcast, or real-time audio and video communication, some of the same technologies are involved, such as protocol optimization, QoS optimization, scheduling, etc., and implementing these technologies on one network can bring To achieve better technology reuse, make single-point technology more in-depth, thereby bringing greater business value.

Third, cloud products are more convenient and efficient. From the perspective of cloud vendors, the integration of various transmission capabilities means that users will be more convenient when using products. For example, through the integration of one network, users can upgrade their ordinary The live broadcast (3~5 seconds delay) is upgraded to a low-latency interactive live broadcast (400 milliseconds).

The picture shows the architecture of MediaUni

It can be seen that MediaUni is built on Alibaba Cloud's extensive heterogeneous resources, including 3200+ edge computing nodes and 29 regions, and also includes more than 180T of bandwidth. The types of computing resources involved include CPU, GPU, and ASIC.

There are two very important systems here, the first is the perception system . The perception system needs to have three perception capabilities:

① Perceive business information. During live broadcast, some application information is only known after it actually happens. For example, if the number of viewers of a certain live stream suddenly increases, we must be able to perceive it in real time, so as to provide some differentiated services;

② Perceive resource information, including the water level of basic resources, CPU and memory status, bandwidth usage, network delay, packet loss rate, etc.;

③ Perceived service quality, such as the freeze rate of various services, the delay of the first video frame, the success rate of push-pull streaming, etc.

The second one is the decision-making system . After the sensory information is collected, it will be pushed to the decision-making system, and based on the balance between cost and quality, scheduling decisions will be made: including access scheduling, routing selection, real-time escape strategy, and computing power. Scheduling and more.

The overall architecture of MediaUni has several distinct features:

l  can support multiple protocols. These protocols are supported at the edge, and a layer of offloading is performed at the edge nodes. After offloading, the center will adopt a unified internal transmission protocol, which will help us expand business scenarios without affecting the internal core system;

l  MediaUni supports hybrid networking. For scenarios with low latency requirements, tree networking will be used, while for scenarios with high latency requirements such as audio and video communication, peer-to-peer networking will be used;

l  The routing between networks will select a dynamic path based on the source to serve users;

l  The whole link is programmable. After many businesses run on the same network, there may be problems of mutual influence between businesses. In order to reduce system changes, we support full-link programmable, so that most business needs can be realized in a programmable way;

l  Combination of calculation and network. Computing is deployed on network nodes, which can improve resource utilization and reduce the cost and delay caused by data transmission for computing;

l  Heterogeneous network and computing resource support. The system supports single-line, multi-line, and BGP network resource types, as well as CPU, GPU, and ASIC computing types;

l  Continuously iterative QoS. The system supports the pluggability of QoS algorithms, and has perfect AB testing capabilities to help QoS continue to iterate.

Convergence brings many benefits, but also brings additional challenges. The first is the adaptation and management of heterogeneous resources . The second is to provide reliable services based on unreliable resources . This is the biggest challenge of this network. The usual audio and video communication network uses better resources to achieve lower delay and stability. For a unified network, how to It is very important to optimize nodes from more nodes and provide better services.

Another challenge comes from the need for rapid iteration brought about by multi-service mixed operation . The more businesses are running, the faster the system update will be. How to meet the rapid iteration of the system and alleviate the ensuing stability impact is also a challenge. Great challenge. Finally, there are challenges brought about by standardization and intelligence .

03 Analysis of MediaUni Technology

The following briefly introduces MediaUni's support for different scenarios.

The live broadcast scene has several characteristics. One is that large scale will bring high concurrency and cost sensitivity , and the other is sudden live broadcast events . In this regard, Alibaba Cloud has taken the following measures: First, it is a hybrid network. Tree networking is used in live broadcasts with a delay of 3 to 10 seconds. If the delay requirement reaches about 400 milliseconds, peer-to-peer networking will be used. The second is the second-level perception of hot flow. When a flow suddenly changes from cold flow to hot flow, service resources will be rapidly and dynamically expanded. In addition, the current live broadcast business in the industry is still dominated by TCP-based protocols. We have deeply optimized the protocol stack to achieve better service quality. Finally, the quality and cost can be operated, which can be adjusted according to the quality and cost preferences of the business.

The real-time communication scene has two characteristics, one is low latency, and the other is scene-based. In low-latency real-time communication scenarios, in order to ensure low latency, various QoS policies are usually adopted, but the QoS policies in different scenarios are very different. For example, the QoS technologies used in video conferencing and ordinary voice chat transmission are very different. For this reason, MediaUni adopts several technologies such as dynamic networking, nearby access, pluggable QoS and second-level escape to solve the above problems.

Nearby access means that the link status between users and nodes can be quickly perceived, and the server can quickly make some QoS policy adjustments. Pluggable QoS can help the team meet the QoS requirements of various scenarios. The transmission of Alibaba Cloud relies on a large number of edge nodes, and the reliability of each edge node is not the same. Therefore, it is necessary to sense these nodes more quickly to prevent failures and service quality degradation.

The cloud rendering scene is accompanied by two characteristics of ultra-low latency and control signaling transmission. Here, two methods of nearby access and nearby heterogeneous computing are adopted, and the computing node and transmission node closest to the user are selected for access, and the heterogeneous computing method is used to serve users.

Data transmission scenarios usually have low latency, high reliability, and a relatively small amount of data transmission. Through protocol multiplexing and expansion, data transmission is supported. At the same time, multipath and FEC technologies are used to improve data transmission reliability and reduce retransmission.

High quality for unreliable resources is a huge challenge for transmission networks, and it is also a key technology.

The unreliability of the transmission network mainly comes from two aspects. One is the unreliability of resources : on the one hand, the availability of edge nodes varies; It is a best-effort network, and often encounters network jitter. How to counteract the jitter of network quality is also a big challenge.

The second is the business characteristics of streaming media transmission , including long links and hotspot bursts. In order to ensure low latency, streaming media transmission usually adopts a long link method, and system resources need to be allocated when the connection is established. However, the video is characterized by dynamic fluctuations in the audio and video bit rate. For example, the World Cup, the video bit rate may only be one at the beginning Megabytes, but it may fluctuate to four megabytes, five megabytes or even higher when scoring a goal; on the other hand, when a hotspot event breaks out, the demand for resources will expand rapidly, which means that user services have very flexible requirements for resources .

In response to the above problems, we have carried out two optimizations. On the one hand, it is weak network confrontation . From the production end to the transmission end, and then to the playback end, there will be many optimization strategies, such as noise reduction during pre-processing, bit rate reduction through Alibaba Cloud narrowband HD technology, support for multi-stream, SVC, etc.; Strategies include dynamic routing, protocol stack optimization, FEC, multipath, etc.; on the playback side, there will be jitterbuffer, neteq, post-processing, etc.

Another aspect of optimization is sense and escape . Because it is built on unreliable resources, it is very necessary to perceive the status of resources and business status in real time, and then make second-level escape through intelligent decision-making.

In terms of key technologies of perception and escape, some details are shared below.

We devised a multidimensional perceptual scoring system . Usually streaming media transmission only perceives the availability of service nodes. In order to support multiple services and provide better quality, we introduce the concept of scoring, scoring each node, and also scoring the links between nodes .

Scores come from three aspects: first, node logs, including resource information, resource water levels, and software running status; second, the end-to-end detection system, we cannot just trust the server logs, because there may be quality of service or service Usability issues. Some customers failed without access to the service, which led to survivor bias. For this reason, we built hundreds of thousands of end devices to continuously detect the system. The detection results are also an important basis for scoring; finally, business Some quality data of the layer, such as core service quality indicators such as freeze rate, first frame, and streaming success rate, will also be used as a major basis for scoring.

After the multi-dimensional perception system gets these scores, it will output the scores of each node and each link, and transmit the corresponding information to the intelligent decision-making system according to the score results . The intelligent decision-making system will make decisions based on three aspects: business type, service level, resource cost, and water level, including node control, user access scheduling, routing dynamic switching, and lossless switching. Since audio and video use long links, their schedulability is poor. When a fault occurs or the perceived quality declines, the original long link needs to be migrated, including lossless migration of processing tasks and lossless migration of links on the transmission link. .

There are three difficulties here. One is the speed of perception. We have adopted two methods. On the one hand, we do business classification and use high-priority channels to transmit data with very high utilization of system resources; Computing, many calculations, including scoring, are done on the nodes, so that the transmitted data will be greatly reduced, thereby increasing the transmission speed. The second difficulty is how to achieve lossless switching. To ensure the non-inductive switching of audio and video requires frame-level continuation. At the same time, details such as time stamp continuity and distributed system call exceptions must be dealt with. The third difficulty is how to make decisions. Because there is a lot of information and different businesses have different quality requirements, how to make intelligent decisions while meeting cost requirements is a big challenge.

After technical iterative optimization, Alibaba Cloud has successfully achieved second-level switching when a node fails, and guaranteed minute-level escape when the quality deteriorates.

Another key technology is the application layer linkage of the protocol stack, which is mainly for the TCP scenario.

The TCP scenario has always had an optimization method called unilateral acceleration. This is because the protocol stack of the server is controlled by itself, and the kernel can be modified for optimization. However, the TCP protocol stack on the client side is provided by the client operating system and cannot be modified. Recently, Quic, which has appeared in the past few years, can perform congestion control and protocol stack optimization at the application layer, but since TCP is still the main protocol for streaming media transmission, we have carried out in-depth optimization. Traditional unilateral acceleration adopts a general acceleration method and modifies the transmission congestion control strategy, but the transmission layer and the application layer are independent of each other. The TCP protocol of layer 4 cannot perceive the application information of layer 7, and layer 7 cannot perceive the application information of layer 4. network transmission changes. Such problems arise mainly because layer 4 is designed as a general-purpose protocol, which does not distinguish between streaming media and files. To this end, we have opened up the information on the 4th and 7th floors to realize the linkage between the 4th and 7th floors and the linkage between the 7th and 4th floors.

The above picture on the right is the streaming media transmission engine Tengine-Live , which is at the application layer and can obtain the corresponding business information, bit rate information, and also control some packet loss strategies. During transmission, service preferences, bandwidth requirements, bit rate fluctuations, and bandwidth size can be transmitted to the operating system through the 4-layer channel, and the operating system takes different countermeasures according to different stages of transmission.

Specifically, the operating system can select the type of congestion control algorithm based on the information of the application layer, such as our self-developed Cubic-like algorithm and BBR-like algorithm, and can also select sub-algorithms. At the same time, the operating system can also import The algorithm can be fine-tuned according to the specific information. For example, the first window can be selected according to the bit rate, the aggressiveness of the strategy can be balanced according to the playback stage, and the parameters of the algorithm can also be adjusted.

The corresponding transmission information will pass through the channel from layer 4 to layer 7, and pass the real-time bandwidth, network congestion status, network RTT packet loss rate and other information in the underlying congestion control to the application layer. If the application layer finds that the underlying network is already seriously congested, A more aggressive packet loss strategy can be adopted.

After the technology was launched, the first frame of playback was reduced by 12.1 milliseconds, the freeze rate was reduced by 32.5 milliseconds, and the success rate of streaming was increased by 0.101%.

Full-link programmable is to cope with the challenges brought by multi-link mixed operation. Multi-service mixed operation will cause explosive growth in demand. A method that minimizes intrusion is needed to meet the rapid iteration of requirements.

To this end, we have developed a full-link programmable system, which has programmable extensions for streaming media transmission and streaming media processing, and abstracts the atomic capabilities of C codes to the corresponding Lua APIs. By calling these APIs, we can combine Various business capabilities to meet the requirements of playback behavior, time stamp, QoS algorithm iteration, and transcoding dynamic rule adjustment. After the Lua code is implemented, it will be transmitted to the streaming media transmission network and processing network through the configuration center to meet the needs of new functions or optimization and rapid iteration.

Here are a few points to note: high performance , for transmission nodes, the performance of Lua is very important, because the throughput of transmission nodes is very large, we use a lot of Lua FFI instead of Lua Cfunction to reduce Lua stack data interaction operations, and also Optimize the performance of the hotspots; for security , it is necessary to strictly control the operable range of Lua to avoid affecting the system; in addition, flexibility is the purpose of our programmability, and dynamic delivery is also considered in the implementation, and grayscale is supported , AB experiment, hot update, pluggable and other capabilities to improve the flexibility of programming.

04 Application landing based on MediaUni

During the Qatar World Cup, we supported ultra-low-latency live broadcasts, reducing the end-to-end live broadcast delay to less than one second. In addition, we have also brought a better playback experience, and the freeze rate has dropped significantly, from 3.19% to 2.13%.

In some cloud rendering scenarios, we reduce the end-to-end latency to 58 milliseconds through nearby transmission and nearby processing. At present, this technology has been applied in the Cloud Temple Fair of Central Expo, and Taobao Future City on Double 11 last year.

The art test on the cloud uses multiple abilities comprehensively. In the remote invigilation scenario, when the teacher is interested in a student, he can click to enter the live TV mode to achieve a live broadcast with a delay of 400 milliseconds. If the teacher has a problem and needs to communicate with the students, turn on the microphone, and the communication can be achieved within 400 milliseconds. At the same time, the media processing service can combine these streams in real time to form a picture of a teacher facing multiple students. In addition, based on the ability of data transmission, the application layer can realize the function of oral broadcast, transmit the data to students through the network, and read and display the rules of the examination room.

05 The future of streaming media transmission network

The future of the streaming media transmission network mainly has four directions, namely lower latency, smarter, more open, and computing-network integration .

With the development of AIGC, more and more video content is no longer limited to shooting, but can be generated through AIGC. When so much content is produced through AIGC, lower latency will undoubtedly bring more gameplay.

Whether it is the scheduling system mentioned above or the 4-layer 7-layer linkage strategy, there is a lot of intelligent logic in it. On a larger scale, there are a large number of parameter and logic combinations inside the transmission system. These combinations are currently empirical or selected through AB tests, but the measurement results of these parameters and combinations are certain. By measuring the video The core playback parameters can form a supervised learning system. How to adjust complex parameters in the network based on feedback, smarter is undoubtedly a better solution.

In the field of video, the two most important directions are video processing and video transmission. Video processing already has many international standards, such as H.265, H.266, AV1. However, in terms of video transmission, standards are not so widely defined or used. For example, the transmission protocol standard such as RTP at the bottom of audio and video only stipulates a very basic packet format, while in most application-oriented scenarios, such as RTMP , HLS and other standards are formulated by some companies themselves, and these standards have not been well unified. With the arrival of AIGC, the division of labor for the entire video link is more detailed, and the industry needs more standards to help different tools, The provider of the service performs data interaction. Alibaba Cloud is also actively participating in the standardization of some protocols to promote more open protocols.

The last point is computing-network integration. As mentioned earlier, the emergence of AIGC will allow more videos to be generated through computing. At this time, it is necessary to integrate more computing power into network capabilities. Scheduling further reduces overall resource consumption and brings users higher quality services.

thank you all!

Ministry of Industry and Information Technology: Do not provide network access services for unregistered apps Go 1.21 officially released Ruan Yifeng released " TypeScript Tutorial" Bram Moolenaar, the father of Vim, passed away due to illness The self-developed kernel Linus personally reviewed the code, hoping to calm down the "infighting" driven by the Bcachefs file system. ByteDance launched a public DNS service . Excellent, committed to the Linux kernel mainline this month
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4713941/blog/10093985