Discussion on the Implementation Scheme of WebRTC P2P Interactive Live Broadcast

So far, the live broadcast industry continues to develop in full swing as expected. After competing for delay, high-definition, beauty, and second-to-second functions, one of the hottest live broadcast platforms recently is Lianmai. What is Lianmai?
The simple description is that during the live broadcast, the anchor can interact with one of the fans, and other fans can watch the interactive process.
This Lianmai operation upgraded the interaction between the anchor and the fans from text chat to audio and video interaction. This function instantly improved the sense of participation of fans; at the same time, other fans can see this interactive process, which is greatly satisfying. The happiness of wheat fans.

The flow chart of Lianmai is as follows:

Insert picture description here

· During the live broadcast of the host, the host prompts to enter the interactive link, at which time the user can participate in the interaction;

·The user requests to participate in the interaction, and the host agrees to a user's request;

·The user participates in the live broadcast, and the interaction between the user and the host is broadcast live to all other fans.

So how to realize the function of Lianmai? Today I will introduce you to several implementation methods.

1. Realized through two RTMP streams

The current live broadcast protocol generally uses the RTMP protocol. RTMP is a set of proprietary protocols for audio, video and data transmission between the Flash player and the server implemented by Adobe. This protocol is implemented based on TCP, using multiple channels, signaling and media are transmitted through one channel.

At present, domestic live CDNs basically use this protocol, and the delay is about 3 seconds; because the data of this protocol flows in one direction, if the microphone function is implemented using this protocol, two video streams are required. subscription.

The schematic diagram is as follows:
Insert picture description here

1. First publish the video to the streaming server, and the user pulls the video information from the streaming server;

2. If a user wants to connect to the host, he requests the host to connect to the host through the signaling server, and the host agrees to the request;

3. Lianmaizhe publishes the video to the streaming media server;

4. The host and other users obtain the video released by Lianmai and display it in the form of picture-in-picture on the mobile phone.

In this scheme, the anchor and the fans who participated in Lianmai separately released one video stream, and the fans who watched it simultaneously pulled two video streams. This connection method is very simple in terms of technology, but there are also many problems in its experience:

First of all, the interaction between the anchor and the fans participating in Lianmai is too delayed. As everyone knows, the delay of rtmp all the way is about 3 seconds. If the host needs to have a dialogue with the users participating in Lianmai, it will take about 6 seconds in principle from the time the host asks the question to the response from the other party, which is totally unacceptable for real-time interaction.

Secondly, the sound effect is not good, it will produce echo. Generally, the audio processing module of live broadcast does not perform echo cancellation processing, so the host cannot open the audio of the microphone while watching the video of the microphone, otherwise it will be re-collected by the audio collection device to form an echo.

Finally, the client receives two channels of video, and the traffic consumption is high. Generally, the user terminal needs to receive two videos to see the anchor and the receiver respectively. The two videos cause high traffic consumption, and the two decoding also consumes CPU resources.

Insert picture description here

From the above analysis, you can see that the above solution is not an acceptable set of mic-linking solutions; the mic-linking scenario requires high latency, and the RTMP protocol obviously cannot meet the requirements. A better solution needs to ensure that the interaction between the microphones (two or more) meets the video conference standard, that is, the delay is within 600ms, and the overall interaction process is video mixed and output in RTMP. In other words, this scheme actually involves two systems, one is a multi-person audio and video interaction system that guarantees low latency, and the other is a standard CDN live broadcast system; everyone already knows the live broadcast system very well.

The following focuses on the characteristics of the low-latency interactive system:

1. The live broadcast system is a one-way data channel, while the low-latency video conference system is a set of two-way channels. This makes this type of system not as easy to expand as a live broadcast system in supporting large concurrency, and its network topology is more complicated;

2. The transmission layer of the low-latency system generally uses UDP, and the application layer uses the RTP/RTCP protocol to ensure the instantaneousness of the packet; in order to ensure security, more systems use the SRTP protocol, which is based on RTP. Layer security and authentication measures; client connection establishment often uses the ICE protocol, which combines the environment of the host in the private network. The communication parties first collect as many connection addresses as possible from STUN and TURN, and then prioritize the addresses, select The best way to connect; this method is also good for scenarios that do not use NAT penetration; it can ensure the connection rate of different network customers, for example, some overseas customers directly connect to domestic servers because the effect is not good enough, you can consider using TURN services Transit to ensure service quality;

3. Using UDP will involve network delay and packet loss, so QoS should be considered. The main strategies include:

a. Use jitter buffer to eliminate the jitter characteristics of network packets, and deliver data packets to subsequent modules for processing at a stable rate; audio and video need to have their own jitter buffers, and then synchronize;

b. In terms of audio, it is necessary to implement a packet loss concealment algorithm; GIPS' NETEQ algorithm should be recognized as the best VOIP anti-jitter algorithm in the industry, and it has been open sourced in the WebRTC project;

c. For video, an adaptive feedback model needs to be implemented, which can adjust packet loss protection strategies according to network congestion; when the RTT is large, FEC can be used for data protection; when the RTT is small, the NACK mechanism is selected.

Next, based on the model discussed above, we will introduce two ways to realize the microphone connection; these two methods can guarantee the effect of the microphone connection. The main difference between them is that one uses P2P technology to connect the microphone, and the other uses multiple people. The video conference system supports Lianmai.

2

P2P + live broadcast method

1. The anchor first publishes the video to the streaming media server, and the user pulls the video information from the streaming media server;

2. The mic linker requests to link the mic, the host will pop up the link request at this time, the host selects the mic link user, the mic linker establishes a P2P connection with the host;

3. A P2P channel is established between the host and the microphone, and the audio and video data are exchanged through this channel;

4. The host terminal collects the host video from the camera, obtains the video of the microphone from the P2P channel, and then mixes the two pictures, and then releases to the host module for live broadcast.

The advantages of this implementation are:

1. The interaction delay between the host and the microphone is small. Since the two are P2P connections, the network delay is very small, generally on the order of several hundred milliseconds. The interaction between the anchor and the microphone is very smooth;

2. The sound effect is good; the host uses the echo cancellation module, the echo of the microphone will be eliminated; at the same time, the voice communication between the anchor and the microphone will also be broadcast live.

The problems with this approach are:

1. The anchor end is equivalent to having two video uploads (live video + video interaction with the microphone) and one video download (video with the microphone), which will require higher network requirements. Our team conducts tests under normal wifi and 4G networks such as China Telecom and China Unicom, and the bandwidth of the anchor end can fully meet the requirements;

2. It does not support simultaneous communication with multiple microphones.

3

Realization of video conference + live broadcast

Insert picture description here

In order to enable multiple fans to connect to the microphone at the same time, consider using a video conference system between the anchor and the microphone, and use an MCU (Multi Control Unit) to implement media data forwarding. Then the multiple channels of data are mixed through the MCU, and then the mixed stream is sent to the CDN. The principle is as follows:

1. The anchor end joins the video conference system; note here that the anchor end no longer directly pushes the video to the CDN;

2. The video conference system pushes the host's video stream to the CDN, and the audience can watch the host's video through the CDN;

3. The viewers participating in Lianmai log in to the same video conference channel as the host. At this time, the host and the Lianmai interact through a real-time video conference; the video of the host and the Lianmai are mixed and output to the server. CDN;

4. Other users watch the interaction between the anchor and Lianmai through the CDN.

The advantages of this approach are:

1. The interaction delay between the host and the microphone is very small; due to the use of the video conference system, a forwarding is made through the server, and the basic delay is less than one second;

2. The host only undertakes the traffic of the video conference interaction, and does not need to bear the upload traffic of the live broadcast. The network requirements are lower than that of the P2P method;

3. Support multi-person interaction.

The disadvantages are:

1. Compared with the general live broadcast system, the server side has added a video conference system, which has high development complexity;

2. Audio and video mixing is completed on the server side, which requires high server performance.

The above is a brief introduction to the implementation of Lianmai. These three methods are used in actual projects. In principle, the experience of the latter two methods will be better; especially for the third option, it can support a small range Many people interact in real time, but the amount of development of this solution is large. At the same time, the team familiar with video conferencing and live broadcast is relatively lacking, which requires high R&D teams; the second solution can be realized on the basis of webrtc and live broadcast technology. Compare this aspect Familiar teams can try to integrate.

Q&A

Question 1: Is Lianmai technology implemented on the client side or on the server side? What are the advantages and disadvantages of the two implementation methods?

Answer 1. The second solution just introduced is implemented on the client side, of course, the server side also needs to do some work; and the third solution is mainly implemented on the server side; the related advantages and disadvantages have also been answered above, you can Reference below.

Question 2: Does Lianmai Technology have an open source basic version?

Answer 2. The P2P solution can be implemented on the basis of webrtc; while the video conference + live broadcast solution has not yet seen open source projects, it can be considered to transform the video conference system to output RTMP live broadcast.

Question 3: What is the minimum required for live broadcast and user broadband to connect to the microphone smoothly?

Answer 3. If it is a P2P solution, the bandwidth requirements for the host will be higher; if it is the third conference mode, the requirements are not high, basically uploading all the way, downloading all the way; The second P2P solution, we are in 4G, 10M Unicom, Telecom and other network experiments are all OK.

Question 4: Is your P2P researched and developed by yourself or based on others?

Answer 4. We modified it on the basis of webrtc. The video image of webrtc should be synthesized with the video image of the camera; and in the case of headphones, audio also needs to be synthesized by program.

Question 5: Do you use STUN or ICE technology for firewall or NAT?

Answer 5. ICE must be used; For P2P networks, there are many networks that cannot be directly connected, and TURN services must be used for transfer; For conference mode, TURN can also be used for transfer, so as to solve the unstable situation of remote network connection .

Question 6: If the user terminal is disconnected in each plan, does the user need to go through the connection process again at this time? Or can you automatically reconnect with the video system?

Answer 6. It is possible to reconnect, and there is no need to go through the process of connecting to the microphone.

Question 7: Why doesn't the second solution support simultaneous communication among multiple mics?

Answer 7. P2P can actually support multi-person interaction, but if multiple people communicate at the same time, CPU pressure and network pressure are both great for the host.

Question 8: What encoding does your video and audio use respectively?

Answer 8. The general coding scheme is: H264 is used for video, and AAC is used for audio; if end-to-end control is possible, H265 is recommended, which has a higher compression rate.

Question 9: In the third solution, what is the recommended video conference system?

Answer 9. If you are interested, you can take a look at licode.

Question 10: How many people are needed in the development team of the third solution, and how long is the development cycle?

Answer 10. This is not due to the large number of people, but the main reason is to have a better understanding of the video conference system; If you use licode to transform, you need to implement the transformation of RTMP streaming on the server side. If you are familiar with ffmpeg, etc., you can get a foundation in about a month Version, but there is still a lot of work to be done to stabilize it.

Insert picture description here

Guess you like

Origin blog.csdn.net/lingshengxueyuan/article/details/108024868