WebRTC audio and video synchronization principle and implementation

All audio and video capture and playback systems based on network transmission will have the problem of audio and video synchronization. As a representative of modern Internet real-time audio and video communication systems, WebRTC is no exception.

This article will deeply analyze the principle of audio and video synchronization and how WebRTC realizes low-latency live broadcast.

1. Timestamp

The synchronization problem is the problem of speed, which will involve the corresponding relationship between time and audio and video streaming media, and there is the concept of timestamp.

The timestamp is used to define the sampling time of the media payload data, which is obtained from the monotonically linearly increasing clock. The precision of the clock is determined by the sampling frequency of the RTP payload data. The sampling frequency of audio and video is different. Generally, the sampling frequency of audio is 16KHz, 44.1KHz, 48KHz, etc., while the video is reflected in the sampling frame rate, and the general frame rate is 25fps, 29.97fps, 30fps, etc.

It is customary to increase the speed of the audio timestamp by its sampling rate, such as 16KHz sampling, and one frame is collected every 10ms, then the timestamp of the next frame is 16 x10=160 more numerically than the timestamp of the previous frame, that is Audio timestamps are accelerated at 16/ms. The sampling frequency of the video is usually calculated according to 90KHz, which is 90K clock ticks per second. The reason why 90K is used is because it is a multiple of the video frame rate mentioned above, so 90K is used. So the time stamp of the video frame increases at a rate of 90/ms.

2. Generation of timestamps

Generation of audio frame timestamps

The timestamp of the audio frame of WebRTC, starting from the first packet of 0, starts to accumulate, each frame increases = encoding frame length (ms) x sampling rate / 1000, if the sampling rate is 16KHz, and the encoding frame length is 20ms, then each audio The timestamp of the frame is incremented by 20 x 16000/1000 = 320. This is just the timestamp of the audio frame before it is not packaged, and when it is packaged into the RTP package, the timestamp of the audio frame will be added to a random offset (generated in the constructor), and then used as the RTP The timestamp of the packet is sent out, as shown in the following code. Note that this logic also applies to video packets.

Generation of video frame timestamps

The video frame generation mechanism of WebRTC is completely different from the audio frame. The timestamp of the video frame is derived from the system clock. After the acquisition is completed to a certain time before the encoding (this transmission link is very long, the video frames of different configurations will follow different logics, and there will be different acquisition locations) to obtain the current system. timestamp_us_, and then calculate the time corresponding to this system time, and then calculate  the  ntp_time_ms_timestamp of the original video frame according to this ntp time  timestamp_rtp_. See the code below. The calculation logic is also in  OnFrame this function.

Why do video frames use a different timestamping mechanism than audio frames? In my understanding, the sampling interval and clock accuracy of audio capture devices are generally more accurate. One frame of 10ms is 100 frames per second. Generally, there will be no large jitter, while the frame interval of video frames is more accurate. 25 frames per second is 40ms per frame. If the audio is also incremented according to the sampling rate, it may be misaligned with the actual clock. Therefore, every time a frame is taken, a timestamp is calculated according to the system clock at the time of taking out, so that the real video frame can be reproduced. Correspondence to actual time.

Like the audio above, when it is encapsulated into an RTP packet, the timestamp of the original video frame will be added to a random offset (this offset is not the same value as the audio) as the timestamp of the RTP packet. send out. It is worth noting that the NTP timestamp calculated here will not be sent with the RTP data packet at all, because there is no NTP field in the header of the RTP packet, even in the extended field, we do not put this value, as shown in the video below time-related extension fields.

Related video recommendations:

LinuxC++ audio and video development video : free] FFmpeg/WebRTC/RTMP/NDK/Android audio and video streaming media advanced development

[Article benefits]: The editor has compiled some relevant audio and video development learning materials (including C/C++, Linux, FFmpeg webRTC rtmp hls rtsp ffplay srs, etc.), qun994289133 for free sharing , you can join the group to receive it if you need it! ~Click skirt 994289133 to join to receive information

 

 

3. Core basis for audio and video synchronization

As can be seen from the above, the RTP packet only contains the independent and monotonically increasing timestamp information of each stream, that is to say, the two timestamps of audio and video are completely independent and have nothing to do, and cannot be based on this information alone. To synchronize, because the time of the two streams cannot be correlated, we need a mapping relationship to associate two independent timestamps.

At this time, the SR (SenderReport) packet, a kind of sender report packet in the RTCP packet, comes into play.

One of the functions of the SR packet is to tell us the corresponding relationship between the timestamp of the RTP packet of each stream and the NTP time. It relies on the NTP timestamp and RTP timestamp marked in the picture above. Through the description of RFC3550, we know that these two timestamps correspond to the same moment, which represents the moment when the SR packet is generated. This is the core basis for synchronizing audio and video, and all other calculations are based on this core basis.

4. Generation of SR package

As can be seen from the above discussion, NTP time and RTP timestamp are different representations of the same time, but the precision and unit are different. NTP time is absolute time, measured in milliseconds, while RTP timestamp is related to the sampling frequency of the media and is a monotonically increasing value. The process of generating the SR package is in the  RTCPSender::BuildSR(const RtcpContext& ctx) function. There were bugs in the old version, and the sampling rate was 8K. The new version has been fixed. The following screenshot is the code of the old version:

The calculation idea is as follows

First, we need to get the NTP time of the current moment (that is, the moment when the SR packet is generated). This can be obtained directly from the passed parameter ctx:

 Second, we need to calculate the current time, what is the corresponding RTP timestamp.  It can be calculated according to the timestamp of the last sent RTP packet  last_rtp_timestamp_ and the system time of its collection time  last_frame_capture_time_ms_, the per-ms growth rate of the timestamp of the current media stream  rtp_rate, and the elapse of time from  the current moment. last_frame_capture_time_ms_Note that it last_rtp_timestamp_ is the original timestamp of the media stream, not the random offset timestamp of the RTP packet, so the offset is added at the end  timestamp_offset_. The time information of the last sent RTP packet is updated by the following function:

5. Calculation of audio and video synchronization

Because the local system time of the audio stream and video stream on the same machine is the same, that is, the time in the NTP format corresponding to the system time is also the same, which is in the same coordinate system, so the NTP time can be used as the horizontal axis X , the unit is ms, and the value of the RTP timestamp is used as the vertical axis Y, which is drawn together. The following figure shows the principle and method of calculating audio and video synchronization. In fact, it is very simple. It is to use the two nearest SR points to determine a straight line, and then give any RTP timestamp to obtain the corresponding NTP time. Because the NTP times of the video and audio are on the same base, the difference between the two can be calculated.

The above figure takes the two SR packets of audio as an example to determine the straight line of the corresponding relationship between RTP and NTP, and then gives any rtp_a, even if the corresponding NTP_a is calculated, the same can also be used to find the corresponding NTP_v of any video packet rtp_v. The time point, the difference between the two is the time difference.

The following is the code for calculating the coefficient rate and offset offset corresponding to the straight line in WebRTC:

What is calculated in WebRTC is the corresponding NTP time of the latest received audio RTP packet and the latest received video RTP packet, as the asynchronous duration introduced by network transmission, and then based on the current audio and video JitterBuffer and playback buffer. Size, get the asynchronous duration introduced by playback StreamSynchronization::ComputeRelativeDelay() ,  and get the final audio and video asynchronous duration according to the two asynchronous durations  StreamSynchronization::ComputeDelays() . and judgment, the minimum delay time for final control of audio and video is obtained, which are respectively  syncable_audio_->SetMinimumPlayoutDelay(target_audio_delay_ms) applied  syncable_video_->SetMinimumPlayoutDelay(target_video_delay_ms) to the playback buffer of audio and video.

This series of operations are handled by the timer calling  RtpStreamsSynchronizer::Process() function.

In addition, it should be noted that when the sampling rate is known, it can be calculated by an SR packet. If there is no SR packet, accurate audio and video synchronization cannot be performed .

The method to achieve audio and video synchronization in WebRTC is the SR packet, and the core basis is the NTP time and RTP timestamp in the SR packet. If you can understand the last two NTP 时间-RTP 时间戳coordinate diagrams (in fact, it is very simple, it is to solve the equation of a straight line to calculate NTP), then you will truly understand the principle of audio and video synchronization in WebRTC. If there are any omissions or mistakes, please share with us!

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/123406334