TOP100summit [Share Records - NetEase] Build a cloud live broadcast distribution network

The content of this article comes from the case sharing of 2016 TOP100summit NetEase Video Cloud and server-side technical expert Shao Feng of NetEase Hangzhou Research Institute.

Editor: Cynthia

On November 9-12, 2017, the 6th TOP100summit of Beijing National Convention Center, leave a comment for a chance to get a free trial ticket.

 

Shao Feng: NetEase Video Cloud, NetEase Hangzhou Research Institute Server Technology Expert

Graduated from Zhejiang University with a Ph.D. in Computer Science.

Since graduation, he has been engaged in research in the fields of database and distributed storage, and has about ten years of experience in server-side development.

Currently in charge of product R&D work in Netease Video Cloud, he has rich practical experience in server development, storage/database development, etc.

 

Introduction: In the development of NetEase's video cloud live broadcast product, the R&D team encountered the problem of live broadcast freeze. How to provide a stable, smooth, and stutter-free live broadcast service was an urgent problem that needed to be solved at that time. Through client analysis, network statistics and other means, the root cause of the stuck problem is that the live distribution network is not good. Whether a reliable live broadcast distribution network can be provided determines whether the live broadcast is stuck, and ultimately determines the user's live broadcast experience.

In order to ensure a smooth live broadcast experience and ensure that there is basically no lag, the technical team adopted a converged distribution network architecture. Through this integrated distribution network, the live broadcast cloud service basically solves the problem of freezing and ensures a smooth live broadcast experience. This article will introduce the construction and optimization process of NetEase cloud live distribution network architecture.

 

1. Questions

 

The live broadcast business has developed rapidly, but the technical threshold of live broadcast behind it is relatively high. In order to lower the technical threshold and allow product developers to quickly develop live broadcast products, the concept of live broadcast cloud service has emerged. The live broadcast cloud service provides end-to-end solutions for live broadcast, including: live broadcast end acquisition-encoding-play, network-end transcoding-distribution, broadcast-end decoding-play, etc. Each of these links covers a large number of technologies, and also affects the high and low quality of live broadcasts.

NetEase Video Cloud is to provide this live streaming cloud service for developers and users. During the construction of the live broadcast cloud service, we found that the live broadcast network experience is the pain point of all live broadcast products, and stability, smoothness, and no lag are the common demands of all live broadcast products. So how do we make our live broadcast cloud service a good experience and no lag? At first, we adopted a series of audio and video technologies to optimize the anchor and playback ends, but the experience has not been substantially improved. After the analysis, we focused on the network side, so the above question was transformed into: How to optimize the distribution network to ensure that the live broadcast has no lag and a good experience?

On the network distribution side, we initially used a traditional third-party CDN network, but the third-party CDN has many limitations in bandwidth configuration and node deployment, and the live broadcast line adjustment uses manual methods, so it cannot meet our optimization needs. Then we built a distribution network by ourselves, but due to the insufficient number of nodes, the distribution performance requirements were not ideal. Subsequently, we have connected to a number of third-party CDN networks, but none of them can fully meet our requirements of "no lag, good experience". Finally, we consider whether we can build a converged live distribution network based on the existing single CDN distribution network to meet the requirements of stability and smoothness? So we started a journey to build a converged distribution network.

 

2. The practical process

 

The construction of our cloud live broadcast distribution network is divided into three stages from the time dimension: single CDN stage, multi-CDN stage and integrated CDN stage. The stages are progressive.

 

2.1 Single CDN Stage

 

Single CDN - Build Phase

 

 

We chose a traditional third-party CDN distribution network for the cloud live streaming service. By encapsulating and calling the distribution network interface of the third-party CDN, a complete set of distribution network services is constructed, and its basic architecture is shown in Figure 1.

 

Figure 1. Single CDN distribution network

 

The advantage of a single CDN distribution network is that it is simple to implement and can be quickly packaged and implemented. In the early stage of our cloud service development, this solution can help us quickly implement products and apply them to practical scenarios.

But the disadvantages are obvious : the network is unstable, the freeze rate is high, and line tuning is more troublesome.

 

Single CDN - Optimization Phase

 

When our number of users reaches a certain scale, the problem of a single CDN will be concentrated. First of all, the network is unstable, and there are often situations such as freezes and disconnections, and there is differentiated support for domestic operator networks, such as telecom and China Unicom lines. Better, and poor mobile lines and other issues.

 

By investigating the problem with a third-party CDN, it was found that the essential reasons were: insufficient node coverage and insufficient bandwidth resources.

 

By letting CDN manufacturers add node resources and optimize lines, the problem of stuck fluency can be partially solved, but problems such as network coverage cannot be fundamentally solved.

 

2.2 Multi-CDN Stage

 

Multi-CDN - Build Phases

 

For the problem of third-party CDN, we consider adopting a multi-CDN solution to solve it. After conducting keynote tests on several CDN vendors, it is found that each CDN vendor has local advantages and disadvantages. For example, CDN vendor A supports mobile lines better, while CDN vendor B supports telecom/China Unicom lines better. In response to this feature, we have connected multiple CDNs to complement nodes and lines. For some special areas, such as small operators and overseas nodes, we have developed a simple self-developed CDN for regional coverage by deploying our own nodes. Finally, we formed a multi-CDN distribution network system, the architecture is shown in Figure 2.

 

Figure 2. Multi-CDN distribution network

 

Multi-CDN-Optimization Phase

 

In a multi-CDN distribution network, the cloud management center selects the distribution line for the broadcaster. During the freeze rate analysis, we found that the stability of the upstream push flow plays a decisive role. Therefore, we query the location information of the streaming source according to the IP of the host, and then select the best CDN for streaming distribution.

 

For example, if the anchor A is the Beijing Mobile line, we will choose the CDN II with better uplink for distribution; the anchor B is the Shanghai Telecom line, and we will choose the CDN I with better telecommunications for distribution. Select a policy to configure it in the cloud management center. Select strategies based on keynote test results or online results feedback, and adjust regularly.

 

2.3 Fusion CDN stage

 

Converged CDN Construction Phase

 

The multi-CDN distribution network greatly reduces the stall rate, but after running for a period of time, we found that the multi-CDN distribution network still has some defects, such as the third-party CDN upstream line cannot be optimized; CDN; live broadcast lines cannot be temporarily tuned, etc.

 

To this end, we reconstruct the distribution network and propose a fusion CDN architecture, as shown in Figure 3. The integration of CDN distribution network, on the basis of multiple CDNs, mainly adds two major functions: the source station and the intelligent cloud dispatch center.

● By building our own streaming source station, we can optimize the live broadcast uplink to the greatest extent.

● Through the intelligent cloud dispatch center, we can adapt to the network environment and dynamically adjust the upper and lower lines according to network changes.

 

Figure 3. Converged CDN distribution network

 

Fusion CDN optimization stage

 

Currently we are in the phase of convergent CDN usage, but we will also optimize the distribution network. Considering the downlink, third-party CDN vendors cannot fully cover all areas, and the construction/maintenance cost of self-developed distribution networks is too high. Therefore, considering the downlink areas that cannot be covered by CDN vendors, if the user access density is high, we will do a layer of service forwarding at the downlink edge.

 

There are two benefits to this:

● Increase edge coverage while reducing CDN traffic costs;

● The route judgment is more accurate to avoid the CDN vendor's route drift.

 

Its framework is shown in Figure 4, and we are in the process of building this optimization phase.

Figure 4. Converged CDN Distribution Network - Improvements

 

 

In the construction of the live distribution network, the design/construction of the integrated CDN distribution network is the most critical.

 

Next, the design ideas of its two key modules will be described in detail: the flow source station and the intelligent dispatch center.

 

Source station

 

In the original design, the purpose of the origin site is very clear. It is used to receive the push stream from the host and forward it to the CDN. Since the live stream adopts the rtmp protocol, the source station mainly implements the rtmp protocol processing. Internally, the origin station architecture is divided into three layers: the interface protocol layer, the logical processing layer, and the network distribution layer.

● The interface layer receives and parses the rtmp stream protocol;

● The processing layer performs streaming media processing;

● The network distribution layer performs rtmp forwarding.

 

It should be noted that each push stream is forwarded to a different CDN network, so that viewers can obtain stream information from different CDN networks.

 

With the expansion of the cloud live broadcast business, the requirements of interactive live broadcast and live broadcast connected to the microphone have also been introduced into the live broadcast framework. Therefore, we have expanded the source station to provide a multi-protocol source station. The introduced protocol is an RTP protocol. For live broadcasts with high requirements for interactivity or real-time performance, all use the RTP protocol, and the bottom layer uses the UDP channel. For broadcast requirements, we seamlessly connect to existing CDNs through RTMP repackage and mixed screen processing. The overall framework is shown in Figure 5.

 

Figure 5. Multi-protocol origin

 

Origin scheduling

 

We have deployed source station clusters in more than 20 major regions across the country, and adopted BGP networks in important regions such as Beijing, Shanghai, Guangzhou, and Hangzhou. Other areas use multi-line. This ensures the high quality of the network between the user and the origin site. We perform source station scheduling through the global scheduling center GSLB. The dispatch center senses the real-time situation through heartbeat detection. Through the configuration module, dynamically adjust the configuration of the origin site, such as traffic restrictions, black and white list restrictions, etc. The host obtains the source site route from the dispatch center before pushing the stream. The dispatch center will optimally select a source site based on the push source address and policy table. The overall framework is shown in Figure 6.

 

Figure 6. Origin scheduling

 

 

Intelligent adjustment of dispatch center

 

The dispatch center is the core of the entire distribution network, and it dispatches the uplink access points and downlink pull-in points in a unified manner. The most important thing in the dispatch center is the formulation of the routing rule table. The traditional rule table is a fixed configuration rule table, which has poor adaptability to the actual network. In the converged network, we designed a set of intelligent tuning strategies to dynamically adjust the rules according to the actual situation of the network. The tuning process is shown in Figure 7, using a five-step cycle mode.

● Step 1, the GSLB dispatch center obtains/analyzes user address information;

● Step 2, the dispatch center obtains the existing dispatch rules;

● Step 3, the dispatch center generates a routing address and delivers it to the client;

● Step 4, both ends report the stall information to the cloud statistics center;

● Step 5, the cloud statistics center, regularly distracts data, triggers rules, and adjusts the rule base.

Figure 7. Dispatch Center Route Tuning

 

Through these steps, the dispatch center realizes statistical self-tuning.

 

3. Effect evaluation

 

We conducted a series of comparative tests on the above distribution networks in the real environment, and the core test point is the stall rate indicator. In order to improve the quality of cloud live broadcast products, we have adopted a more stringent one-minute freeze rate in the selection of the freeze rate indicator, instead of using the regular duration freeze rate.

 

The so-called one-minute freeze rate means that if the player freezes twice in a row within one minute, it is regarded as a freeze for that minute. The duration of the freeze rate is based on every second interval, and the player is stuck in this second, and this second is regarded as a freeze. The definition of the player card is: the decoding thread obtains data from the player buffer every 3ms, if the buffer is empty, it is defined as the player card. In a general sense, the one-minute freeze rate = 4 to 15 times the duration of the freeze rate.

 

 

Figure 8. Two-week freeze rate comparison

 

As shown in Figure 8, we selected the first half-month (two weeks) freeze rate data of X, Y, and Z three months for comparison. Among them, a single CDN distribution network was run in month X; a multi-CDN distribution network was run in month Y; and a converged CDN distribution network was run in month Z. A comprehensive freeze rate data is given every day. In each month, the environment of the cloud platform is as follows: the actual network traffic is 5TB, 12TB and 20TB per day, and more than 98% of the traffic is running in China, and there is no significant regional variability in traffic. From the figure, it can be seen that the freezing rate has dropped significantly, and in the converged CDN distribution network, it has reached our predetermined target requirement of <5%.

 

Figure 9. Stall rate optimization ratio

 

As shown in Figure 9, the average freeze rate reduction indicators of single CDN, multi-CDN and fusion CDN are given. Using a multi-CDN distribution network is 26% lower than using a single CDN distribution network for two weeks on average. Using a converged CDN distribution network is 44% lower than the two-week average freeze rate using multiple CDN distribution networks.

 

Therefore, based on statistics, we concluded that integrating the CDN distribution network can greatly optimize network distribution and reduce the stall rate index to a high-quality range of less than 5%. Next, in order to achieve the ultimate experience, we will continue to improve the converged CDN distribution network and consider further optimization on the streaming side.

 

4. Promotion Suggestions

● Use a progressive approach to optimize the network in stages;

● Before optimizing the network framework, it must be analyzed in advance to find the key bottleneck points;

● Network data collection is very important, collect as much as possible;

● It is necessary to dig deep into the details, and each small module can make a big article;

● The domestic network environment is special, and operators and regional factors must be considered;

● Edge acceleration is very important, try to get as close to the user as possible;

● Be good at using third-party services, and be able to optimize and sublimate on the basis of others' services.

 

From November 9th to 12th, Beijing National Convention Center, the 6th TOP100 Global Software Case Study Summit, NetEase Cloud Communication and Video Technology Expert Liu Xinkun will share "Network Congestion Control and Its Application in the Field of Real-time Communication". View the schedule

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326172246&siteId=291194637