TOP100summit: [Sharing Records - QQ Space] Technical optimization behind the 1 billion-level live broadcast

The content of this article comes from the case sharing of Wang Hui, the 2016 TOP100summit QQ space client R&D director.
Editor: Cynthia

Wang Hui: R&D director of Tencent SNG social platform department, senior engineer in charge of Tencent Qzone mobile client technology. Since 2009, he has been responsible for the research and development of QQ space technology, and has experienced the transition from the Web era to the mobile client technology, and has a good technical accumulation on the Web and mobile terminals.

Introduction: The rapid development of the mobile Internet. In 2016, the application of video technology in social networks has experienced explosive growth, with functions such as short video, live video, video filters, video face animation, music, karaoke, voice changing, and microphone connection. As products are launched one after another, how to ensure a stable and smooth experience while quickly launching functions has become a challenge.
The main challenges are as follows:
1. In a complex network environment, how to ensure the success rate of video playback, how to ensure the smoothness of live broadcast, and reduce freezes?
2. In terms of experience, how to ensure a fast and smooth playback experience, how to realize live broadcast in seconds, and instant broadcast?
3. In terms of performance, the filter, beauty, and face animation effects are fully enabled in the live broadcast. How to ensure the performance of the anchor?
4. In the case of hundreds of millions of concurrent users, how to better guarantee the quality and how to achieve the ultimate in flexible bandwidth strategy?
This case focuses on the above challenges, and reveals some optimization attempts made by the Tencent Qzone team in the challenges.

1. Case introduction

Qzone is currently the largest SNS community in China. Daily peak uploads 500 million pictures and broadcasts 1 billion videos. 630 million users share their lives here and retain their emotions. Among them, the mainstream crowd is the young people born in the 1995s.
It is precisely because they are young people, and young people are the main driving force of lifestyle changes, that they are not satisfied with the traditional sharing method of pictures + videos. They hope to let you see me "now, immediately, immediately" through live broadcast. This is the upgrade of content consumption. At the same time, with the upgrade of mobile phone hardware and the reduction of traffic costs, mobile live broadcast has become possible.
And our live broadcast product positioning is also based on the content of target users' lives, plus the power of social communication, which is different from the current mainstream network + game live broadcast model.
The benefits are: the content created by users is more diverse, closer to life, and can resonate and spread among friends;
the problem is: the mobile terminals we want to be compatible with are also massive, and performance issues are our focus content of interest.

With the above background, we started to do live broadcasts. The goal is to build a closed-loop capability for live broadcasting within a month, that is, to go online quickly. It is necessary to implement functions such as initiating a live broadcast, watching a live broadcast (one person initiates, multiple people watch it), watching the live broadcast (playback), live broadcast interaction (comments in the live broadcast room, gifts, etc.), live broadcast precipitation (Feeds precipitation, sharing live broadcast, etc.) and other functions ( Refer to Figure 1); and supports three platforms of Android, iOS, and Html5 at the same time (that is, the solution is mature), and supports space and space in mobile QQ.

 

 

First of all, we faced the problems of tight project schedule and insufficient accumulation of live broadcast technology of the team. Even so, we still bite the bullet and continue to communicate with major relevant technology providers, and select technology according to our standards and the provider's recommendations. Our standards are as follows:
● Professionalism (low latency when live broadcast, support The whole platform and basic services are well established);
● Open source code;
● High degree of support, any problems can be solved through communication at any time;
● Support dynamic expansion.

In the end, the ILVB live broadcast solution provided by Tencent Cloud was selected according to the standard, especially the audio and video related group has accumulated many years of technology in this area, and the same department can cooperate and win-win.
It is worth mentioning that our closed-loop R&D model also prompts us and our partners to continuously improve product quality. First go online quickly (complete product requirements and improve monitoring), then check monitoring data (analysis data) after going online, then apply it to optimization work (follow-up data, special optimization), and finally conduct grayscale verification (grayscale part of user verification) optimization effect), and then decide whether to formally apply it to the product according to the effect (as shown in Figure 2).

 

 

Finally, we achieved the one-month launch as we wished, and supported both Qzone and mobile QQ (combined version of Qzone). So far, 12+ versions have been iterated. The viewing volume has also increased from one million in May to 10 million in August, and now it has reached 100 million. It has become a popular national live broadcast with user participation. Product data has also been soaring in line with user demand, and with it comes various feedback, especially performance issues, which are also the focus of this article.

Second, the live broadcast structure

Before introducing the live broadcast architecture, I think it is necessary to review the H264 encoding for everyone. Currently, the live video encoding on the market is basically H.264. H264 has three frame types: the complete coded frame is called I frame, the frame generated by referring to the previous I frame and only contains the coding of the difference part is called the P frame, and the frame coded with reference to the frame before and after is called the B frame. The compression process of H264 is grouping (dividing several frames of data into a GOP, a frame sequence), defining frame types, and predicting frames (using I frame as the base frame, predicting P frame with I frame, and then by I frame and P frame Predicted B frames), data transmission.
Using a simple example to explain the video model, if a GOP (Group of pictures) is regarded as a train pulling goods, that video is a freight fleet composed of N trains (Figure 3).

 

 

Live broadcast is the flow of video data, the data flow process of shooting, transmission, and playback. The data is produced and loaded by the anchor, passed through the network (railway), and then unloaded to the audience for broadcast consumption.
The train needs to be scheduled, and so does the video stream. This is the video streaming protocol, which controls the orderly transmission of the video to the audience through the protocol.
Common protocols are shown in Figure 4:

 

 

We use the QAVSDK protocol developed by Tencent Cloud based on UDP.

I talked about the protocol-related content before, let's talk about our live broadcast model, as shown in Figure 5:

 

 

The general structure of the video room (video stream) and the business room (related business logic interaction) is similar, the difference lies in the data flow (note the arrow in Figure 5).
The video room data flows from the host to the video server through the video streaming protocol, and the video server also sends the data to the audience through the video streaming protocol, and the audience decodes and plays the data. The anchor only uploads, and the audience only downloads. Anyone in the business room needs to send relevant business requests to the server (such as comments, of course, the client will block some special logic, such as the host cannot give gifts to himself). A more detailed structure is shown in Figure 6:

 

 

Note: The RTMP protocol adopted by iOS mobile QQ viewers is not because it does not support QAVSDK, but because mobile QQ has the pressure to reduce the package, and the SDK related to QAVSDK takes up a lot of space.

3. Technical optimization

Next, enter the main event of this article: technical optimization. Technical optimization is divided into four aspects: second opening optimization (time-consuming optimization practice), performance optimization (performance optimization practice), stuck optimization (problem analysis practice), and playback optimization (cost optimization practice).
Before optimization, our necessary work is to monitor statistics first. We will conduct preliminary statistics on the data points we care about, and make relevant reports and alarms to assist in optimization analysis.
The monitoring is divided into the following five parts:
● Success rate, the success rate of launching live broadcast, the success rate of watching live broadcast, and the list of errors;
● Time-consuming, the time-consuming of starting live-streaming and the time-consuming of entering live-streaming;
● Live-streaming quality, frame rate of stuck, 0 frames rate;
● problem location, each step flow, live 2s flow, client log;
● real-time alarm, SMS, WeChat and other methods.

Through these, we realize data viewing, analysis and positioning, and real-time alarms, so as to solve problems more easily.

Four, seconds to open optimization

Almost everyone is complaining "why is the live broadcast so slow, the competition is much faster than us!!!", and we ourselves can't bear it. We want to watch the live broadcast in seconds (from clicking the live broadcast to seeing the screen, it takes less than 1 second), and the average time to open the external network is 4.27 seconds, there is still a certain gap from the second to open.
So we sort out the timing relationship from the time of opening to rendering the first frame, and count the time-consuming of each stage. The flowchart and time-consuming are roughly as shown in Figure 7:

 

 

Through process analysis and data analysis, two time-consuming reasons are found: it takes too long to obtain the first frame of data, and the core logic is serial. Next, we optimize for these two problems.
The first frame took too long. The core problem is to speed up the time when the first GOP reaches the audience.
Our optimization plan is: let the interface machine cache the first frame data, and at the same time transform the player, and start playing after parsing the I frame. This greatly speeds up the time the viewer sees the first frame.
Core logic serial. In this part, we mainly go through the following processes:
● Preloading, preparing the environment and data in advance, such as pre-pulling the live broadcast process in feeds in advance, and obtaining the IP data of the interface machine in advance;
● Delaying loading, delaying loading of UI, comments and other logics , to give up system resources to the first frame;
● Cache, such as caching IP data of the interface machine, multiplexing within a period of time;
● Serial to parallel, pulling data at the same time, saving time;
● Optimizing and sorting out single-step time-consuming logic , to reduce the single-step time-consuming.
The optimized process and time-consuming are roughly as shown in Figure 8, the time-consuming is reduced to 680ms, and the goal is achieved!

 

 

5. Performance optimization

The products are constantly iterating, the gameplay of live broadcast is getting richer and richer, and some performance problems are constantly exposed. In particular, we later added functions such as dynamic stickers, filters, and voice changes, and a large number of users reported that the live broadcast was very stuck.
Through statistics, it is found that the short frame rate of the anchor is very low, the picture is discontinuous, and the subjective feeling is stuck; in addition, there are also a large number of low-end machines in the equipment used by users. As shown in Figure 9.

 

 

Through analysis, it is found that the main reason for the low frame rate is that single-frame image processing takes too long, and the encoding factor is low. Generally, the total time-consuming = processing workload * single-frame time-consuming. So we gradually optimize these two aspects.

Reduce the workload of image processing
● The acquisition resolution is consistent with the processing resolution, for example, the encoding is 960×540. Since some mobile phones may not support this acquisition resolution, the acquisition resolution is generally 1280*1024. Scale first and then process it to reduce the time-consuming processing of images in filters and motion stickers. Figure 10

 

 

● Drop frames before processing them. Although we set the acquisition frame rate for the system camera, many models do not take effect, so we discard the extra frames through a strategy to reduce the number of frames for image processing. For example, we set the capture frame rate to 15, but the actual frame rate is 25. The extra 10 frames will be wasted during encoding and discarded before, which can reduce resource consumption.
● Models are divided into gears. Different models use different capture frame rates and encoding frame rates according to different hardware capabilities to ensure smoothness. At the same time, the frame rate is dynamically adjusted to adjust resource consumption when it is overheated and when it returns to normal.
● The face recognition collection is optimized, and the recognition of each frame is changed to two frames to recognize the face once, which will neither cause face drift nor reduce the processing time.
Reducing the time-consuming of a single frame
● The acquisition process has been transformed to reduce unnecessary time-consuming by about 33%, as shown in Figure 11.

 

 

● Multi-GL thread rendering of dynamic effect stickers, sticker rendering is placed in another OffScreenThread for rendering, which does not occupy the time of the entire beautification process. The effect is shown in Figure 12:

 

 

● Motion stickers use OpenGL blending mode;
● Image processing algorithm optimization, such as ShareBuffer optimization (to quickly copy data between GPU and memory, excluding CPU intervention, saving texture to RGBA time; the time consumption is almost reduced by half, and the FPS is at least improved 2-3 frames), filter LUT optimization, as shown in Figure 13.

 

 

In addition to the above two major optimization points, we also push more machines to adopt hardware encoding. First, the encoding is stable, and the frame rate will not fluctuate; second, the CPU usage will be reduced to a certain extent.
The above are some of our rough optimization points. After optimization, users’ live broadcast complaints have been greatly reduced.

6. Caton optimization

Let's look at some relevant definitions first: Definition of
stuck users: ( stuck time / total duration) > 5% is defined as stuck users, stuck rate = stuck users / total users.
The anchor point definition: the number of points where the frame rate is less than 5 after the upstream large picture encoding.
Definition of viewer freeze point: the number of points with frame rate < 5 after decoding.
Our goal is to make the freeze rate below 50%.
Of course, when the uplink is stuck, all users will be stuck, and when the downlink is stuck, only a single audience will be stuck.
Let's take a look at the reasons for the freeze, see Figure 14:

 

 

There are probably three major modules on the anchor side, the network, and the audience side, which may cause the problem of freezing, and the performance optimization of the anchor side has basically been solved. Then look at the network and the audience side.
From the statistics, it is found that the impact of network quality accounts for about 50%, which obviously needs to be optimized. So we did the network uplink optimization as shown in Figure 15, reducing the data in a single frame and reducing the number of frames. The example of using trains is to reduce the number of goods and control the number of trains.

 

 

The user-side downlink optimization is to hoard goods and throw them away, as shown in Figure 16.

 

 

As shown in Figure 17, let's look at the optimized effect, which has obvious advantages compared with competing products:

 

 

At the same time, the anchor rate has also dropped to 30%, and the audience rate has dropped to 40%. The goal is achieved.

Seven, playback optimization

As shown in Figure 18, let's first take a look at the general process of live playback (playback):

 

 

There are also some problems with playback, which are mainly reflected in two aspects:
● Quality of playback live broadcast: The quality of playback video is saved on the server side, and the quality of playback video is greatly affected by the short network of the anchor
● Server cost: In addition to pushing a private protocol stream, the server also needs to Requires transcoding of private streams to HLS and MP4 for playback

In terms of program selection, MP4 has the advantages of mature playback program, fast speed, and good user experience; while HLS system has poor support and long waiting time for users. Does that mean we have to choose the MP4 solution directly?
In fact, it is possible to use HLS and MP4 for playback, but because the data is changing during live viewing, HTML5 can only use HLS. If we use MP4 for playback, it means that the server needs to transcode the private protocol stream separately. into MP4 and HLS, then this is obviously not economical. This leads us to choose HLS, and the server only needs to transcode the stream to HLS once.
Now that we choose HLS, we must solve the problems existing in HLS in a targeted manner.
On the Android platform, Android 3.0 began to support HLS. Later, because Google officially wanted to push DASH to replace HLS, the support for HLS gradually weakened, and I could not even find a little explanation about HLS in the official documents. Through practice, it is found that the native Android player only supports HLS, and there is no optimization at all. In this way, there will not only be redundant m3u8 file requests, but also the entire process after starting playback is serial, which greatly affects the visible time-consuming of the first frame of the video picture (the average time is about 4.5s).
We can solve this problem by downloading in advance through the local agent. After accessing the download agent, the content of the m3u8 file can be scanned at the agent layer, and the parallel download of ts shards can be triggered to cache the ts data. After such processing, although the level of the player is still serially downloaded, since we have prepared the data in advance, the data will be quickly returned to the player, thus achieving the effect of reducing the time-consuming of the first frame. After entering, the average visible time of the first frame is reduced to about 2s.
The flowchart before and after optimization is shown in Figure 19:

 

 

In terms of caching strategy, there is no mature solution in the HLS caching industry. We have implemented automatic detection and support for the three modes in Figure 20, and the user does not need to care about the underlying cache and download logic at all.

 

 

Finally, 50% of the transcoding calculation and storage costs are saved in terms of server costs; in addition, the loading speed of playback is also faster.

8. Case summary

Through the previous cases and related optimization analysis, three general problem modes and corresponding optimization ideas are summarized.
● Speed ​​category: sort out the time sequence, count the time-consuming of each stage, and break each one;
● Performance category: through Trace, identify the performance loss point, and break each one
; .

Summarized into a routine is Figure 21:

 

 

The following reference points are also reflected in the case:
● Fast iteration, small steps and fast running
● Monitoring-driven optimization
● Establishing models, intuitive analysis of abstract problems
● Product positioning determines the direction of optimization
● Massive services, small to save big

Finally, this article ends with a topological map of the live broadcast (Figure 22).

 

 

The TOP100 Global Software Case Study Summit has been held for six sessions to select outstanding global software R&D cases, with 2,000 attendees every year. Including product, team, architecture, operation and maintenance, big data, artificial intelligence and other technical special sessions, on-site learning of the latest research and development practices of first-line Internet companies such as Google, Microsoft, Tencent, Ali, Baidu and so on. Application entrance for the single-day experience ticket for the opening ceremony of the conference

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326213342&siteId=291194637