WebRTC server design summary

introduction

In the previous article, we analyzed how to design a set of RTC PaaS services? , which describes the functions and services needed to design a complete set of RTC PaaS services from an overall perspective. In this article, let's talk about our thoughts on the details of server design.
First of all, let's reiterate the problems and challenges we face:

  • Network delay (Latency/Delay)
  • Network packet loss (Packect Loss)
  • Out-of-order delivery
  • Jitter/Delay variations
  • Network congestion (Bandwidth Overuse/network congestion)
  • Service reliability/High Availability
  • High Concurrency of Services
  • High Performance of Service Regardless of the last three general server design goals, the first few have always been the factors that must be taken into account when designing an excellent streaming media server to ensure good audio and video quality. The starting point and cut-in of many designs below are closely related to it.

Scheduling server

Scheduling server

Let's start with the article Improving Scale and Media Quality with Cascading SFUs (Boris Grozev) to talk about why there is a scheduling server? And the main functions of the scheduling server and the thinking behind it.

As mentioned in the article, the design of the entire RTC system must not only support the flexible expansion of multiple media servers, but also provide optimal network links and media delays for each user joining the conference. Because real-time communication applications are very sensitive to network conditions such as bandwidth, delay, jitter, and packet loss rate, lower bit rates determine worse video quality, and longer network delays mean longer audio and video end-to-end Terminal delay, and network packet loss often leads to intermittent audio and video freezes. This is the value of the scheduling server, which selects the optimal network link for the users in the conference for media transmission.

First of all, the scheduling server needs to manage all the media servers, and needs to know the media server’s external network IP, service port, internal network IP, geographical location, maximum resource load, real-time resource load and other information, so that the scheduling can allocate the optimal media server according to the user’s information. media server. Registration + timing heartbeat is very easy to achieve the above requirements, and it also supports the horizontal expansion of the media server (distributed SFU) . So how to allocate the optimal media server for users?

If we only follow the strategy of " assigning all users of the same conference to the same media server ", it is obvious that the optimal solution of network link and media delay cannot be achieved in the scenario of cross-country, cross-region and cross-operator. As shown in the figure below, it is a star topology. All terminals are connected to the same central media server and push and pull streams from each other. This strategy and the video conferencing architecture forwarded by the central server are very bad for user experience because:

  • The quality of the transmission link is not guaranteed when the user accesses from a long distance, especially across borders or regions.
  • When domestic users access across operators, the network jitter is large and the packet loss rate is high, which affects the conference quality.
  • Due to limited resources such as CPU, memory, and bandwidth, the capacity and load of single-point media servers are limited.

  • So how to solve the above problems? What about building a real-time transmission network with intelligent scheduling?

In general, the overall principles are as follows:

  • Sub-regions/sub-operators deploy media servers, and users can access to the nearest location through access services , ensuring the quality of the last mile .
  • Flexible/on-demand deployment of routing nodes, distribution of media services through routing, and the ability to select the optimal transmission path according to real-time network quality .
  • It is necessary to ensure the quality of data transmission within the transmission network. For example, if the media servers are connected to the same VPC LAN or different VPC private lines , the cloud provider ensures the stability and reliability of the link. The actual topology is shown in the figure below:

  • On the premise that the media nodes all choose BGP in the third-line computer room, the commonly used scheduling strategies are as follows:
  • The same room/conference in the same region is preferentially assigned to the same media node to avoid cascading servers .
  • According to the principle of geographical priority , users in the same area are allocated to the nearest one first.
  • According to the principle of load priority , choose the one with the smallest load among the conditions.

Generally, points are selected and deployed according to the distribution of users, and then a certain zoning strategy is formulated (requires actual data collection to support). For example, four points are deployed in China: North China, South China, East China, and Southwest China. Asia except China is allocated to the node in Singapore, and Europe is allocated to the node in Frankfurt.

For regional users without node coverage, theoretically, the node with the lowest load is allocated in the large pool.

Secondly, the scheduling server needs to support the shortest path intelligent routing and assist in the cascading of media servers. Because the optimal node is allocated according to user information, the problem of the last mile is solved, and the path between media nodes is not only affected by the health status of nodes and actual network links, such as service crashes, computer room failures, and fiber cuts etc., are also limited by the physical network topology. If Alibaba's cloud enterprise network is selected for node deployment, there is no need to consider the issue of dynamic routing, and the cloud provider can guarantee it.

For the implementation of media server cascading, you can refer to the ideas in How to use open source SFU to build RTC cloud services  . In addition, for the cascading implementation of mediasoup, please refer to  Broadcasting of stream from One to Many using multiple Mediasoup Server instances .

Media Server SFU

For SFU, the basic function is media packet forwarding and QOS guarantee. The key point is how to use UDP to transmit audio and video data in real time in a complex public network environment, so as to provide users with low-latency, clear and smooth audio and video interactive experience. .

In the general direction, QOS can be considered and diverged from two angles:

From the perspective of channel coding , such as common ARQ/Nack packet loss retransmission, FEC, Congestion Control (REMB/TransportCC), Jitter Buffer, IDR Request, etc.

From the perspective of source coding , such as layerable coding SVC, large and small stream Simulast, Long-Term Reference (LTR) frames, etc.

For the introduction of commonly used QOS methods, please refer to the following resources:

From entry to advanced|How to build a video conference based on WebRTC
Chat about WebRTC gateway server 4: QoS solution analysis
webrtc video QOS method summary
webrtc audio QOS method summary
WebRTC Qos optimization miscellaneous notes

Here, I don’t want to list too many explanations and definitions of terms. Interested partners can refer to the above resources to check their differences and applicable scenarios.

Many design methods and ideas of QOS guarantee come from the inheritance of TCP and Voip (Voice over IP), so for the understanding of some keywords and concepts, you can refer to the resources in the corresponding fields.

For example, the study of congestion control theory is inseparable from the study of TCP congestion control theory and algorithms. You can refer to resources:

TCP congestion control diagrams (excluding RTO, because it is too simple)
from Google's TCP BBR congestion control algorithm analysis
30 diagrams: TCP retransmission, sliding window, flow control, congestion control worry about
thousands of words long text | full (small local area ) Summary of the strongest TCP/IP congestion control on the Internet...
10,000 words in detail: TCP Congestion Control Detailed Explanation of
TCP Things (Part 1)
TCP Those Things (Part 2)
WebRTC GCC Congestion Control Algorithm Detailed Explanation
WebRTC Congestion Control Based on GCC ( Upper) - Algorithm Analysis
WebRTC Congestion Control Based on GCC (Part Two) - Implementation Analysis
WebRTC Based on TransportCC and Trendline Filter Sending Bit Rate Estimation (Sendside-BWE)
BBR Congestion Control in Real-time Video Transmission
What is RMCAT congestion control, and how will it affect WebRTC?
Congestion control for real-time video streaming - NADA, GCC, SCReAM
Bug: webrtc:9883 Removes unused BBR congestion controller

I just came into contact with Long-Term Reference (LTR) frames a while ago, and it feels like a good thing. Although I haven’t had a chance to practice it yet, the known benefits: avoid frequent request for key frames, reduce network impact, and help

weak networks to fight and alleviate Network congestion.

So I still want Amway to give you a detailed introduction:
From entry to advanced|How to build a video conference
video encoding long reference frame (LTR) based on WebRTC)
Improve Encoding Efficiency and Video Quality with Adaptive LTR
Cisco ClearPath Whitepaper
Let’s talk about me A few thoughts:

  1. End-to-end/full link packet loss rate vs uplink and downlink segment packet loss rate


Native WebRTC is designed based on P2P scenarios, and the QOS policy is also effective from end to end. After the receiving end finds packet loss, it will send Nack request to the sending end for retransmission. If the path of the whole link (RTT ) is too long, affecting the efficiency of data retransmission and recovery.

As shown in the figure above, under the SFU architecture, the retransmission request is no longer a full-link feedback, but is segmented between the client and the server. On the one hand, the server has the ability to NACK the request, and once it senses the packet loss of the uplink, it will request retransmission to the sender in time. On the other hand, the server can respond to the NACK request of the receiving end in a timely manner. The improved efficiency of packet loss and retransmission helps reduce end-to-end delay and improve audio and video experience.

In terms of implementation details, on the one hand, you can choose to keep the Seq unchanged, receive the RTP packet from the sender, modify the ssrc, and forward it directly to the receiver. This has a serious problem. The packet loss perceived by the downstream of the link is a cumulative loss. Packets, that is, if a user loses downlink packets, it may be that the downlink network between the real user and the SFU node connected to it is poor, or it may be due to packet loss caused by the upstream of the link, such as upstream packet loss at the streaming end or Media server and media server cascade forwarding packet loss. In this way, if the upstream of the streaming end is poor, even if the downstream perceives packet loss and sends a Nack request for retransmission, it is unacceptable. This invalid retransmission request is very inefficient. In this case, the packet loss rate calculated by the receiving end is the end-to-end/full link packet loss rate.

The above design is obviously bad. The improvement method is to distinguish between the original Seq and the reprinted new Seq when forwarding through the SFU media node. The original Seq can be used to calculate the end-to-end packet loss rate of the entire link, and each segment is re-printed. The marked Seq is used to calculate the packet loss rate of the downlink segment, so that only when the packet loss in each segment of the respective link is lost, the packet loss retransmission request will be triggered, thereby improving the efficiency of data retransmission and recovery.
 

  1. End-to-end/full link RTT vs uplink and downlink segmented RTT

First of all, we refer to the discussion of Voice over IP End-to-End Delay Measurements to give a general overview of the end-to-end/full link delays:

When we evaluate the performance and quality of real-time communication systems, end-to-end delay is a very important data and reference standard.
For the client, because it needs to determine the waiting delay before rendering, the end-to-end delay statistics of the whole link are particularly important. For SFU Nack retransmission control, an accurate RTT value is also required, but this RTT value is the RTT value of the uplink and downlink segments, and the RTCP-XR scheme can be used for the calculation of the segmental RTT.

  1. Does SFU need to reorder RTP? Need a Jitter Buffer?

This question only needs to be clear about the role of Jitter Buffer. As the name implies, jitter buffer maintains a data buffer at the receiving end to combat a certain degree of network jitter, packet loss and disorder. The essence is to smooth the decoding and reduce the impact of network disorder and jitter on decoding.

The SFU media node only forwards audio and video media packets without re-encoding and decoding, so there is no need for caching, rearranging, and there is no need to design Jitter Buffer, because on the one hand, caching will introduce unnecessary delay; on the other hand, users Before streaming, decoding and rendering, the media packets received by the network are cached and rearranged, and then framed. At this time, the design of the Jitter Buffer needs to consider the balance between receiving delay and freezing.

For the control logic of the client-side video Jitter Buffer, please refer to WebRTC Video JitterBuffer Detailed Explanation , and for audio, please refer to WebRTC NetEQ.

Audio-related netEQ in webRTC (1): an overview of
audio-related netEQ in webRTC (2): data structure
audio-related netEQ in webRTC (3): access package and delay calculation
audio-related netEQ in webRTC (4): Control command decision-making
audio-related netEQ in webRTC (5): DSP processing talks about the same reason as
WebRTC NetEQ , which explains why Jitter Buffer must be considered when designing an MCU mixing server.

4. In a 1-to-many scenario, a certain user's downlink network is poor and the packet loss is high. How to avoid affecting everyone?

In a 1-to-many video conferencing scenario, the bandwidth capability and device performance of each receiving end are different. If the downlink network condition of a receiving end user is poor, the packet loss is high, or the device performance is poor, force It is obviously not advisable to require the sending end of the peer end to reduce the resolution and bit rate upstream, which will affect the audio and video experience of other users in the conference.

To solve this problem, the first step is to implement a segmented congestion control strategy. Ideally, the sender will control the source encoding and sending code rate according to the uplink bandwidth estimate, and the SFU will use the downlink bandwidth estimate to control the delivery. The highest code rate for each receiver.

However, the real problem is that when the SFU has only one channel of video that can be forwarded, how to control the transmission according to the bandwidth of each link is a bit difficult for a smart woman to cook without rice. There are two source coding strategies to help here - Simulcast and SVC.

Simulcast: The method of large and small streams refers to encoding/sending multiple video streams at the same time, such as sending one stream of 720p and one stream of 180p conventionally, so that when the SFU is sent to the receiving end, it can be selected according to the limitation of the downlink bandwidth. Send streams with different resolutions, taking care of the experience of each end.

The system using Simulcast shows:
 

SVC: Scalable Coding, using a layer-based approach that provides temporally or spatially scalable coding combinations. In the application of RTC, time-domain SVC is usually selected to achieve scalability by changing the frame rate. The SFU can analyze different time-domain layers from the same SVC video stream according to the actual downlink bandwidth, and transmit them to each receiving end respectively. It can also realize differentiated video stream forwarding.
 

Note: This section discusses a large number of references from entry to advanced | How to build a video conference based on WebRTC Discussions and pictures, if there is any infringement, please contact to delete.

  1. Do Real-time Communication Bandwidth Prediction and Congestion Control Algorithms Need to Consider Fairness? Is it ethical to blindly grab bandwidth?


I haven't thought about this very clearly yet, so I'll throw out the question first. Because everyone who has studied TCP congestion control knows that the principle of fairness should be taken into account when designing congestion control algorithms . As an application that needs to ensure audio and video quality, should it be conservative and give in to fairness? Or more shots and more grabs? I don't have an answer yet, but I feel that blindly robbing bandwidth is immoral, disharmonious, and not beautiful.

Self-developed VS open source project transformation

Currently popular open source media servers are:

There are already many comparisons between them on the Internet, and this is not the focus of today. What I want to discuss is when selecting models in the early stage, should I choose self-research or refer to open source projects for transformation and expansion? Is the choice of C language? C++? Or what about the Go language?

If you have rich experience in video conferencing and a strong team, self-development is definitely understandable, because of course the system you design has 100% control, and you can refer to past experience to choose the most suitable and concise solution. Whether self-research can be successful, of course, requires the right time, location, and harmony. For most small companies, transforming an excellent open source project is definitely the best choice.

Here, I suggest that you choose an excellent network IO library or framework and an efficient and convenient programming language. Undoubtedly, Go is a perfect choice at present, and you deserve it. However, the impact of GC on the real-time performance of streaming media requires our attention and research.

Summary of original  WebRTC server design - Nuggets

 

★The business card at the end of the article can receive audio and video development learning materials for free, including (FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, srs) and audio and video learning roadmaps, etc.

see below!

 

Guess you like

Origin blog.csdn.net/yinshipin007/article/details/132383862