I use ChatGPT to do WebRTC audio and video performance optimization, focusing on an efficient

Does ChatGPT replace programmers or add buffs to programmers?

In the past two weeks, AI news has followed one after another. On March 23, Google opened the long-beta AI conversation service Bard. Google emphasized that this is a product positioned as a source of creativity for users, which can generate writing drafts or Chatbots in your life. As early as a week ago, in the early morning of March 15, OpenAi released an upgraded model GPT-4 four months after the release of GPT-3.5. According to the press conference, GPT-4 can support image input, role-playing, and stronger writing skills . Immediately after March 16, Baidu released Wenxin Yiyan, which has five functions: literary creation, commercial copywriting creation, mathematical logic calculation, Chinese understanding, and multi-modal generation.

With the successive releases of AI products by major manufacturers in recent days, the topic of replacing artificial intelligence with AI continues to ferment. Will AI greatly liberate human productivity or will it impact a large number of occupations?

Bloggers are currently outputting WebRTC-related technical blogs, so why not ask AI to see what insights he has.

Like most people, bloggers have not yet obtained the internal test qualification for Bard and Wen Xinyiyan. Knowing that NewBing uses the GPT-4 model, let’s ask GPT-3.5 and Newbing (GPT-4) questions about which QOS technologies WebRTC uses to improve the quality of audio and video calls , and see the differences in their answers.

As shown in the figure below, GPT-3.5 and GPT-4 cannot be troubled by technical science issues. I will continue to dig deeper on this issue and ask them to give examples:

NewBing(GPT-4)

Results given by GPT-3.5

NewBing (GPT-4) directly gives specific operation examples

The results given by GPT-3.5 (somewhat vague)

GPT-4 and GPT-3.5 comparison conclusion

Through experiments, we compare the responses of two versions of the same question. In ordinary text processing, the difference between GPT-4 and GPT-3.5 may be relatively small, but when the problem is specific and complex enough, GPT-4 will be more accurate and creative than GPT-3.5, and it can handle more complex problems of users. Subtle instructions.

Of course, the content of this article is not to discuss the specific differences between GPT-3.5 and GPT-4, but how programmers use ChatGPT to improve work efficiency and add the strongest Buff. Next, I will share " How WebRTC's QOS Improves Audio and Video Quality" for audio and video developers based on my personal development experience .

Overview of WebRTC Technology

WebRTC improves the quality of audio and video calls through a series of QOS technologies: anti-packet loss strategy (NACK, FEC), congestion control strategy (TWCC/REMB), SVC or multi-track, video quality adaptive strategy, Pacer, JitterBuffer, etc.

The overall QOS architecture is shown in the figure below:

figure 1

1  packet loss recovery strategy

1.1 NACK

Compared with ACK, NACK (Negative Acknowledgment) is a mechanism for selective retransmission through "non-arrival acknowledgment". The basic principle is that the sending end caches the data, and the receiving end detects packet loss through the continuity of arriving packets, and initiates a retransmission request to the sending end at an appropriate time in combination with rtt and out-of-order conditions.

figure 2

As shown in the figure, Receiver finds that packets 2 and 3 have not arrived after receiving packet 4, and temporarily puts packets 2 and 3 into the lost nack list. When exceeding a certain out-of-order threshold (calculated by the out-of-order histogram, assuming it is 2 here, then receiving packet 4 can be considered as packet 2 lost), or exceeding a certain jitter time (calculated according to rtt), request the Sender to retransmit the lost Message 2, 3. The Receiver's request is sent to the Sender through the RTP FB. For the specific NACK request format, refer to RFC4585. Sender resends packets 2 and 3 after receiving the NACK request.

It is worth noting that the packet loss recovery effect of the NACK strategy depends on the timing of the retransmission request. One is the calculation of rtt (the default rtt of webrtc is 100ms), and the other is the calculation of the out-of-order threshold. Poor control of the retransmission request rhythm can easily cause retransmission storms, aggravate congestion and cause lag in streaming.

Reference: https://www.rfc-editor.org/rfc/rfc4585.html#page-34

1.2 FEC

FEC (Forward Error Correction), Forward Error Correction, is commonly used for data error correction in data transmission and storage. This technique is also used in WebRTC for packet loss recovery.

There are three ways for webrtc to implement this redundancy function:

1.2.1、RED

Put the previous message directly into the new package, and analyze the main package and redundant package at the receiving end.

image 3

As shown in the figure, the following packets directly contain the previous packets, so when one of the packets is lost, it can be directly recovered through its adjacent packets. The disadvantage of this method is that the anti-continuous packet loss effect is poor, but the implementation is simple.

Opus In-band FEC uses this method for error correction: the important information is re-encoded at a lower bit rate and added to the subsequent data packet, and the opsu decoder decides whether to use the redundancy carried by the current packet according to the received packet Packets are lost and recovered.

Opus In-band FEC detailed reference: RFC 6716 - Definition of the Opus Audio Codec

RED detailed introduction reference: https://www.rfc-editor.org/rfc/rfc2198.html

1.2.2、ULPFEC

XOR is used between multiple packets to generate this redundant information and to be able to recover lost packets at the receiver if needed. ULPFEC is able to provide different levels of protection for different packets by choosing the number of bytes to be protected and applying the XOR to the number of previous packets.

Figure 4

As shown in the figure, FEC packet 1 protects L0 level packets A and B. FEC packet 2 protects A and B of level L0, and also protects packets C and D of level L1.

Reference: https://www.rfc-editor.org/rfc/rfc5109.html

1.2.3、FLEXFEC

Compared with ULPFEC, FLEXFEC can flexibly choose 1D row XOR, column XOR and 2D row and column XOR to increase the network anti-packet loss capability.

1-D row XOR error correction

Figure 5

1-D Column XOR Error Correction

Figure 6

2-D row-column XOR error correction

Figure 7

Although FLEXFEC has a stronger recovery capability than the previous two, if the rows and columns are interleaved, such as (1, 2, 5, 6) in Figure 7, the loss will cause errors that cannot be corrected.

The overall packet loss recovery capability of the FEC strategy used by WebRTC is weak. The industry generally uses Reed-Solomon FEC for packet loss recovery. Reed-Solomon FEC (K + N: K data packets N FEC packets) can truly restore any <=N dropped packets.

For detailed implementation of FLEXFEC, please refer to: https://www.rfc-editor.org/rfc/rfc8627.html

2 Bandwidth evaluation and bit rate control

2.1 REMB-GCC

Figure 8

Figure 8 is a REMB-GCC architecture diagram, the basic idea is to evaluate the bandwidth through the receiving end, and then feedback the bandwidth to the sending end through RTCP REMB. The sender calculates a bandwidth result As based on the packet loss rate, and the RMEB result Ar, and takes min(As, Ar) as the final bandwidth result.

2.2 SendSide BWE

Figure 9

Compared with REMB-GCC  , the main difference of TFB-GCC is that most of the bandwidth calculations are transferred to the originating calculation, and the implementation of the filter no longer uses Kalman filtering but becomes a TrendLine filter .

The packet sent by the sender needs to carry in the extension header: Transport-wide sequence number.

The receiving end periodically sends a Transport-wide feedback message to notify the sending end and the receiving end of information about receiving the message, including message arrival time, message arrival time, message format, and other information. After receiving the Transport-wide feedback message, the sender performs delay filtering calculation (Trandline) according to the information carried in the message.

Transport-wide feedback message format reference: https://datatracker.ietf.org/doc/html/draft-holmer-rmcat-transport-wide-cc-extensions-01

2.3 Rate Control

Figure 10

Figure 11

According to the signal s generated by the overload detector, the finite state machine shown in Figure 10 is driven to adjust the code rate.

GCC algorithm principle detailed reference: https://c3lab.poliba.it/images/6/65/Gcc-analysis.pdf

3  SVC  , multi-track

3.1 SVC

SVC (Scalable Video Coding, Scalable Video Coding or Scalable Video Coding) is an extension of traditional H.264/MPEG-4 AVC coding, which can improve coding flexibility, and has Temporal Scalability, Three characteristics of spatial adaptability (Spatial Scalability) and quality adaptability (SNR/Quality/Fidelity Scalability).

h264 in WebRTC does not support svc encoding, Vp8 only supports Temporal Scalability, VP9 and AV1 support Temporal Scalability and Spatial Scalability.

Figure 12

Above is the time adaptable diagram. Assume the layer shown in the legend is displayed at a frame rate of 30 fps. If we remove all L2 images, the remaining layers (L0 and L1) can still be successfully decoded and produce a 15fps video. If we further delete all L1 images, then the remaining L0 layer can still be decoded and produce a 7.5fps video, so even if there is packet loss, compared with non-scalable encoding, it can significantly improve video fluency on weak networks.

Figure 13

As shown in Figure 12, the L0 base layer is encoded data with the minimum resolution, and the higher the level, the higher the resolution. When a lower resolution is required in practical applications, only high-level data needs to be discarded for decoding.

Users with different bandwidth conditions and different device performance can flexibly adjust the resolution.

SVC extension reference:  http://ip.hhi.de/imagecom_G1/assets/pdfs/Overview_SVC_IEEE07.pdf

SVC combined with H264 reference:  https://www.itu.int/rec/T-REC-H.264-201704-I

3.2 Multi-track

At present, mainstream browsers support unified-plan sdp. We can add multiple video tracks during SDP negotiation. It is more common in business to add two video tracks (similar to SVC's Spatial Scalability) and reuse the same DTLS transmission channel. .

Figure 14

Figure 12 is a schematic diagram of the frame output of one stream, one stream, and one stream, and one stream and one stream, which are typically encoded using WebRTC's support for multi-track features.

Supporting multiple video tracks (large and small streams) allows the receiving end to dynamically switch to a supported resolution when the downlink bandwidth is limited, improving the weak network experience.

Multi-view tracks (large and small streams) are not as flexible as SVC in adapting to network packet loss and bandwidth limitations, but multi-view tracks are easy to implement, and the encoding and decoding performance consumption is low, so they are widely used in actual business scenarios.

Multi-track needs to support Unified Plan SDP negotiation, refer to WebRTC related instructions: https://webrtc.github.io/webrtc-org/web-apis/chrome/unified-plan/

4 Video Quality Adjustment Strategy

In the case of poor network transmission quality (insufficient uplink bandwidth), high CPU occupancy rate, and high encoder coding quality QP value, etc., WebRTC will reduce the quality to ensure video calls. The quality reduction strategy is mainly divided into frame rate reduction (that is, clear priority mode) and resolution reduction (that is, smooth priority mode), which are set through MediaStreamTrack Content Hints.

Clarity-first mode  WebRTC pays more attention to video details when encoding. When the above-mentioned situation needs to reduce the quality, it will reduce the frame rate and keep the resolution unchanged to ensure the subjective experience of streaming users. It is especially important for business scenarios where the screen sharing content on the streaming end is displayed on a PPT or a large screen for streaming users.

Fluency priority mode.  When the streaming terminal needs to reduce the quality, it will first reduce the resolution and maintain a certain frame rate to ensure the smooth experience of streaming users.

When the bandwidth or CPU resources are no longer limited, WebRTC will reversely improve the video quality according to the quality reduction preference setting.

Users should make appropriate settings according to their own business scenarios, so as to ensure that the subjective experience will not be too bad in extreme cases.

5 Pacer

The Pacer module of WebRTC mainly allows the packets to be sent to be distributed as evenly as possible in each sending time window according to the estimated network bandwidth, so as to smoothly send packets and avoid network congestion.

Suppose there is a video stream at 5Mbps and 30fps. Ideally, each frame is about 21kB in size and packed into 18 RTP packets. The average bit rate over a one-second time window is 5Mbps, but over a shorter time frame it can be seen as a burst of 167Mbps every 33 milliseconds. Also, video encoders can exceed the target frame rate in cases of sudden movement, especially when dealing with screen sharing, where frames 10x or even 100x larger than the target size are common. Sending these packets as soon as they are encoded can cause several problems: network congestion, buffer bloat, and even packet loss. Most sessions have more than one media stream, which may include audio streams, video streams, and data streams at the same time. If you send one frame at a time on one transport channel, those packets take 100ms to send out, which may prevent any audio packets from being sent out in time. Pacer solves this problem by having a buffer. Media packets are queued there, and then they are adjusted onto the network using a leaky bucket algorithm. The buffer contains separate fifo streams for all media tracks, eg audio can be prioritized over video - streams of the same priority can be sent in a round robin fashion to avoid any one stream blocking the others.

Figure 15

6 JitterBuffer

Figure 16

After the WebRTC receiving end receives the RTP packet, it puts it into PacketBuffer for caching and sorting. As shown in the figure above, after receiving the Mark (end of frame) flag, start framing from the back to the front. After assembling a frame, it will be placed in the cache of the GOP where the frame is located, and adjusted according to the inter-frame reference sequence. When the inter-frame reference relationship is established, it will be placed in the decoder for decoding. It can be considered that Jitter mainly performs packet sorting, frame sorting, and GOP sorting successively. The reason why a series of work is carried out is that there is a certain amount of jitter in the network itself, and even packet loss. If there is a packet loss, it has to wait for the recovery of the packet loss to complete the framing, so there is a certain amount of jitter between the arrival time of the frame and the sending time. The existence of the jitter buffer is a good solution to this problem. It can smooth the decoded data at the streaming end to ensure that the rendered video is smooth and smooth.

7 key frame request

Video streams are usually sent in the form of 1 key frame + N incremental frames, which rely on previous frames for decoding and display. If sps/pps loss, packet error, etc. are caused due to some reasons, if no remedial measures are taken, it will be difficult to continue decoding the video stream, and the video will be stuck until the next key frame. In many cases, the GOP setting is very large for encoding stability, which means a long time freeze or black screen.

Figure 17

As shown in the figure, the receiving end fails to frame Frame 9 due to packet loss and cannot be recovered. Even if the framing is successful, it cannot be decoded. At this time, it is necessary to request an I frame decoding from the sending end to refresh the current video stream.

WebRTC requests the sender to send a key frame through the RTCP message. The key frame request RTCP message format is relatively simple. Two different key frame request message formats are specified in RFC4585 (RTP/AVPF) and RFC5104 (AVPF): Picture Loss Indication (PLI), Full Intra Request (FIR). From the current implementation, after WebRTC receives PLI or FIR, it asks the encoder to encode and output key frames, and then sends them to the receiving end.

PLI message format reference:  https://www.rfc-editor.org/rfc/rfc4585.html#page-36

FIR reference:  https://www.rfc-editor.org/rfc/rfc5104.html

Summary of QOS technology:

This article briefly introduces the Qos technologies used in WebRTC, which improve Qos quality from different angles. Including recovery of packet loss through NACK and FEC technologies, and solving audio and video freezes caused by packet loss. Automatically adapt to changes in network bandwidth by adjusting encoding and sending bit rates through bandwidth assessment and congestion control technology. Through SVC and multi-track technology, different video quality can be guaranteed for different network quality streaming users. Pacer and JitterBuffer improve the smoothness and fluency of audio and video at the sending end and receiving end respectively . Keyframe requests play an important role in fast video recovery after extreme network jitter. WebRTC uses the synergy of these technologies to improve the overall QoS quality. The best way to understand the technical details is to read the WebRTC source code.

The Qos technology of WebRTC has a significant effect on improving the overall audio and video quality, but there are still many places for optimization in these technologies of WebRTC. Audio and video manufacturer ZEGO’s self-developed WebRTC gateway has optimized these strategies to a certain extent: including self-developed bandwidth evaluation algorithms, NACK algorithms, and large and small flows.

Therefore, if your business needs a stable and reliable audio and video service, you can try the real-time audio and video RTC service.

Click to jump to ZEGO to build real-time audio and video services to learn more about WebRTC best practices.

Guess you like

Origin blog.csdn.net/sinat_40572875/article/details/129758384