Use of timing information in webRTC

webRTC is an asynchronous system, and the two parties in communication do not need to synchronize time.

This article mainly discusses how webRTC solves the following two time-related problems: 1. Audio and video synchronization 2. Delay-based bandwidth evaluation.

Audio and video synchronization (Lip-sync)

The sending end collects-encodes-sends, the receiving end decodes-renders, the processing and network transmission of audio and video streams are independent of each other, and their respective sampling/playback frequencies are also different. How to restore the real scene at the collection end at the receiving end has never been an easy task. Fortunately, the human auditory/visual system has a certain tolerance. The ITU (International Telecommunications Union) gave a suggestion: audio is within this range for video [-125ms, 45ms], that is, it is 125ms behind or earlier than 45ms. It is acceptable to human beings, and we think this means that the audio and video are in sync.

As a benefit of this article, you can receive free C++ audio and video learning materials package, technical videos/codes, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull streaming, srs)↓↓↓ ↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

The principle of webRTC to solve this problem is also relatively simple. The sending end stamps the data packets of the audio stream and video stream. These timestamps can be aligned with the same time base. The receiving end can use the timestamp and cache to adjust each Stream audio/video frame rendering time to ultimately achieve synchronized effects. We can further understand the implementation details, taking video streaming as an example. The figure below shows the video processing pipeline. Each rectangular box is a thread instance.

There are three types of time information in implementation:

1. Local system time: the time difference from the operating system startup time to the current time

2.NTP ( Network Time Protocol ) time: global time information, the time difference from 1/1/1900-00:00h to the current time

3.RTP time: frame timestamp, taking the video sampling frequency of 90k as an example, rtp_timestamp=ntp_timestamp*90

These three time coordinates are all measurements of time, but they describe time in different ways. For example, the current absolute time 2020-08-05T06:08:52+00:00, they are expressed like this.

Local time 1919620051: Indicates that 1919620051ms has passed since the boot count, which is about 22.2 days.

NTP time 3805596543795: indicates the distance from 1/1/1900-00:00h, 3805596543795ms has passed.

RTP time: RTP time is calculated from NTP time. The time unit is 1/90000s and is stored in u32. The calculation process will send overflow, ((u32)3805596543795)*90=1521922030.

sending end

A frame of video is recorded in the capturer thread. This frame corresponds to three time information, especially the RTP time. This rtp_timestamp will add an offset set in advance in the Packet pacer module and send it out as the final rtp time. This offset is added to the entire rtp time coordinate system, and all external RTP times are added.

The video stream marks each packet according to its own RTP time, and the audio stream similarly marks each audio packet according to its own RTP time, but the time in these two streams moves at its own pace. , is independent. If the receiving end is required to render the two streams synchronously, it must find a way to align these times with the same time base. As shown in the figure below, it logically describes how to synchronize two streams. The time numbers are only for reference and are not real data.

One of the functions of RTCP SR (sender report) is to perform time alignment, aligning the RTP time in the stream with the NTP time. All streams are aligned with the NTP time of the sender, so that the receiver has a unified time base.

RTCP SR format is as follows:

        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
header |V=2|P|    RC   |   PT=SR=200   |             length            |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                         SSRC of sender                        |
       +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
sender |              NTP timestamp, most significant word             |
info   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |             NTP timestamp, least significant word             |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                         RTP timestamp                         |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     sender's packet count                     |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                      sender's octet count                     |
       +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
report |                 SSRC_1 (SSRC of first source)                 |
block  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  1    | fraction lost |       cumulative number of packets lost       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |           extended highest sequence number received           |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                      interarrival jitter                      |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                         last SR (LSR)                         |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                   delay since last SR (DLSR)                  |
       +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
report |                 SSRC_2 (SSRC of second source)                |
block  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  2    :                               ...                             :
       +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
       |                  profile-specific extensions                  |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   The sender report packet consists of three sections, possibly
   followed by a fourth profile-specific extension section if defined.
   The first section, the header, is 8 octets long.  The fields have the
   following meaning:

   version (V): 2 bits
      Identifies the version of RTP, which is the same in RTCP packets
      as in RTP data packets.  The version defined by this specification
      is two (2).

Receiving end

As can be seen in the above figure, after network transmission, the frame data arriving at the receiving end may have gone through jitter and become out of order, such as frame 2/3/4 of stream1. The receiving end uses the pull mode through the design of RTCP SR and buffer, and uses rendering as the end point to reverse the delay of fetching frames from the frame queue. From the single stream processing process, we can see that the delay includes rendering + decoding + jitter delay, and the synchronization between multiple streams also needs to consider the relative transmission delay between streams (refer to RtpStreamsSynchronizer), and finally get the frame delay of each stream. .

Multi-stream synchronization at the receiving end actually includes two parts: smooth playback within a single stream and time-synchronized playback between multiple streams. The principles of audio and video delay processing are similar, but the calculation method of delay is different. Next, the delay processing of video stream will be used to illustrate this process.

Video streaming plays smoothly

The video decoding thread executes the loop and continuously takes the next frame for decoding and rendering from the frame_queue. By strictly controlling the execution decoding and rendering time starting point of each frame, the time interval between the final rendering and playback between frames is roughly equal, and the visual The feeling is fluid. We use waitTime to express this start time, that is, how long the thread waits to fetch the frame for processing.

A note on some important amounts of time

waitTime = render_systime - current_systime - render_cost - decoder_cost;
render_systime =local_systime + max(min_playout_delay,target_delay); //计算渲染时间
local_systime = rtpToLocaltime(rtp_time)//把帧rtp时间转换为本地系统时间
target_delay = jitter_delay+render_cost+decoder_cost; //计算预估目标延迟
min_playout_delay //流间同步用的时延调节参数
jitter_delay //抖动延迟
render_cost //渲染延迟,固定10ms
decoder_cost //解码延迟

To get the waitTime of the frame to be decoded, you must first calculate the rendering time of the frame (render_systime), first convert the rtp time attached to the frame into local system time (local_systime, the conversion method should be easy to understand, do not expand), and then superimpose a time delay calculation The method is max(min_playout_delay, target_delay), min_playout_delay is the adjustment parameter used for inter-stream synchronization. Here we talk about delay processing within the stream, which can be temporarily skipped. target_delay is the sum of the three delays evaluated by the system (jitter delay + rendering delay + decoding delay). Note that the target_delay delay calculated here is statistically accumulated. When actually participating in the calculation of render_systime, there is some implementation processing. Simplification has also been done here to facilitate understanding of the principle.

With the rendering time (render_systime), waitTime is easy to figure out. Observe carefully, each frame is processed by waiting for one more jitter_delay, but the decoding interval between frames remains the same. The existence of jitter_delay is to combat the uncertainty of network transmission. webRTC dynamically calculates its value. Under a stable network, its value is close to 0, and the delay caused is relatively small; under a weak network, the delay is unstable, and this value is calculated to be larger. It increases the waiting time in exchange for smooth and smooth rendering, which is good. Achieving a balance between delay and smoothness.

Audio and video stream synchronization

To achieve smooth playback within a single stream, time alignment between streams also needs to be performed, as shown in the figure below as an example. Assuming that the latest pair of audio + video packets are sampled at the same time, (the actual situation can be at different time points, simplified here), then we also expect them to be rendered at the receiving end at the same time. It can be seen that the entire process mainly contains two time information, transmission time delay (xxx_transfer_delay) and receiver's processing delay (xxx_current_delay). These two streams with different times each maintain their own delay information. If you want the audio and video packets to be rendered at the same time after the same delay, you can add a delay adjustment parameter (xxx_min_playout_delay) to each of the two streams and adjust this parameter by increasing or decreasing Make the delays of the two flows nearly equal.

In webRTC, these delay adjustment parameters (xxx_ min_ playout_delay) are calculated and adjusted periodically (1s). As shown in the example above, the pseudocode is probably like this.

//Time diff video vs audio
time_diff = (video_transfer_delay+video_current_delay)-(audio_transfer_delay+audio_current_delay)
if(time_diff>0){ //video is slower
		down(video_min_playout_delay);
    up(audio_min_playout_delay);
}
else{//video is faster
		up(video_min_playout_delay);
    down(audio_min_playout_delay);
}

Delay-based bandwidth estimation

One of the successes of WebRTC lies in its design of a congestion control algorithm. The basic data comes from the packet loss statistics of the sender and the statistics of packet reception time. Here we will only talk about the statistics and feedback about the timing of RTP packet reception time, and will not describe the congestion algorithm.

The logic of congestion control is now executed on the sending end by default. The calculation of time delay includes sending time T and receiving time t. The sending end can save each packet T by itself, and the receiving end only needs to feedback t. The logic of the algorithm is that the reception time of each packet must be fed back, which involves more interactive data and frequency. WebRTC is also carefully designed for this.

By default, webRTC estimates the transmission bandwidth at the sender end. The media flows through the RTP/UDP protocol stack. The UDP layer does not have the function of bandwidth estimation. webRTC expands the RTP/RTCP transmission format so that it can estimate the bandwidth of the transport layer at the sender end.

RTP format

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |V=2|P|X|  CC   |M|     PT      |       sequence number         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                           timestamp                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           synchronization source (SSRC) identifier            |
   +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
   |            contributing source (CSRC) identifiers(if mixed)   |
   |                             ....                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           header extension (optional)                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           payload header (format depended)                    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           payload data                                        |
   +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+

Organize the following types of extension content in the header extension field

     0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |       0xBE    |    0xDE       |           length=1            |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |  ID   | L=1   |transport-wide sequence number | zero padding  |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Tips: You can notice that there are two sequence numbers in the RTP packet.

sequence number: It is a concept of the RTP layer and is used for reassembly and demultiplexing of RTP streams. For example, in a scenario where multiple streams are reused, each stream has its own auto-increment sequence.

transport-wide sequence number: It is a transport layer concept, the identifier of the transport layer packet, and is used for the code rate statistics of the transport layer. This sequence increment is not affected by multi-stream multiplexing because multiplexing occurs at the RTP layer.

The sender adds a transport-wide sequence number (PacketRouter::SendPacket) to each RTP packet when sending, for example, sending seq=53, 54, 55.

After receiving the packet, the receiving end records the arrival time of the packet. The recorded time is the local internal timestamp (unit ms), that is, how long it has been since the computer was turned on.

packet_arrival_times_[53]=1819746010
packet_arrival_times_[54]=1819746020
packet_arrival_times_[55]=1819746026 

The RemoteEstimatorProxy module at the receiving end is responsible for feedback of transport layer statistics, and periodically feeds back packet reception time information to the sending end.

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |V=2|P| FMT=CCFB |   PT = 205   |          length               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                  SSRC of packet sender                        |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                  SSRC of 1st media source                     |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |          begin_seq            |             end_seq           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |L|ECN|  Arrival time offset    | ...                           .
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    .                                                               .
    .                                                               .
    .                                                               .
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                  SSRC of nth media source                     |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |          begin_seq            |             end_seq           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |L|ECN|  Arrival time offset    | ...                           |
    .                                                               .
    .                                                               .
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |                 Report Timestamp (32bits)                     |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

RTCP transport feedback is generally the most frequently transmitted content on the RTCP channel, and webRTC also has a special design for its transmission. Pay attention to the following parameters

max_intervel = 250ms //feedback 最大周期
min_intervel =50ms   //feedback 最小周期
rtcp_ratio = 5%      //feedback占用带宽比例
Avg_feedback_size = 68bytes  //平均一个feedback包的大小

The time period for sending RTCP transport feedback is controlled within [50ms, 250ms]. Within this range, it is dynamically adjusted according to the current bandwidth. Try to control the proportion of bandwidth occupied by RTCP transport feedback transmission to 5%. The boundary can be calculated, and the bandwidth range [2176bps, 10880bps] occupied by the feedback transmission alone is also a considerable expense.

Summarize

This article summarizes the three timing types in webRTC, local time, NTP time, and RTP time. It also analyzes the use of time information in two topics: audio and video synchronization and delay-based bandwidth evaluation.

appendix

Attached are several key data structures related to timing in the webRTC project.

Capture

class webrtc::VideoFrame{
...
  uint16_t id_;            //picture id
  uint32_t timestamp_rtp_; //rtp timestamp, (u32)ntp_time_ms_ *90
  int64_t ntp_time_ms_;    //ntp timestamp, capture time since 1/1/1900-00:00h
  int64_t timestamp_us_;   //internal timestamp, capture time since system started, round at 49.71days
}

VideoStreamEncoder::OnFrame // caluclate capture timing

VideoStreamEncoder::OnEncodedImage // fill capture timing

RtpVideoSender::OnEncodedImage // timestamp_rtp_+random value

class webrtc::EncodedImage{
...
  //RTP Video Timing extension
  //https://webrtc.googlesource.com/src/+/refs/heads/master/docs/native-code/rtp-hdrext/video-timing
  struct Timing {
    uint8_t flags = VideoSendTiming::kInvalid;
    int64_t encode_start_ms = 0;                //frame encoding start time, base on ntp_time_ms_
    int64_t encode_finish_ms = 0;               //frame encoding end time, base on ntp_time_ms_
    int64_t packetization_finish_ms = 0;        //encoded frame packetization time, base on ntp_time_ms_
    int64_t pacer_exit_ms = 0;                  //packet sent time when leaving pacer, base on ntp_time_ms_
    int64_t network_timestamp_ms = 0;           //reseved for network node
    int64_t network2_timestamp_ms = 0;          //reseved for network node
    int64_t receive_start_ms = 0;
    int64_t receive_finish_ms = 0;
  } timing_;
  uint32_t timestamp_rtp_;											//same as caputrer.timestamp_rtp_
  int64_t ntp_time_ms_;										      //same as caputrer.ntp_time_ms_
  int64_t capture_time_ms_;										  //same as caputrer.capture_time_ms_
}

RTPSenderVideo::SendVideo

class webrtc::RtpPacketToSend{
...
  // RTP Header.
  bool marker_;                //frame end marker
  uint16_t sequence_number_;   //RTP sequence number, start at random(1,32767)
  uint32_t timestamp_;         //capturer timestamp_rtp_ + u32.random()
  uint32_t ssrc_;              //Synchronization Source, specify media source
  
  int64_t capture_time_ms_;    //same as capturer.capture_time_ms_
}

===

receiver side

RtpTransport::DemuxPacket

class webrtc::RtpPacketReceived{
...
  NtpTime capture_time_;
  int64_t arrival_time_ms_; //RTP packet arrival time, local internal timestamp
  // RTP Header.
  bool marker_;                //frame end marker
  uint16_t sequence_number_;   //RTP sequence number, start at random(1,32767)
  uint32_t timestamp_;         //sender's rtp timestamp maintained by RTPSenderVideo 
  uint32_t ssrc_;              //Synchronization Source, specify media source
}

RtpVideoStreamReceiver::ReceivePacket /OnReceivedPayloadData

struct webrtc::RTPHeader{
  ...
  bool markerBit;
  uint16_t sequenceNumber;                //RTP sequence, set by sender per RTP packet
  uint32_t timestamp;                     //sender's RTP timestamp  
  uint32_t ssrc;
  RTPHeaderExtension extension;           //contains PlayoutDelay&VideoSendTiming if has
}
class webrtc::RtpDepacketizer::ParsedPayload{
    RTPVideoHeader video;

    const uint8_t* payload;
    size_t payload_length;
}

class webrtc::RTPVideoHeader{
  ...
  bool is_first_packet_in_frame;
  bool is_last_packet_in_frame;
  PlayoutDelay playout_delay;     //playout delay extension
  VideoSendTiming video_timing;   //Video Timing extension, align with sender's webrtc::EncodedImage::timing
}

class webrtc::VCMPacket{
...
  uint32_t timestamp;            //sender's RTP timestamp
  int64_t ntp_time_ms_;
  uint16_t seqNum;
  RTPVideoHeader video_header;
  RtpPacketInfo packet_info;
}
class webrtc::RtpPacketInfo{
  ...
  uint32_t ssrc_;
  uint32_t rtp_timestamp_;        //sender's rtp timestamp
  //https://webrtc.googlesource.com/src/+/refs/heads/master/docs/native-code/rtp-hdrext/abs-capture-time
  absl::optional<AbsoluteCaptureTime> absolute_capture_time_; //
  int64_t receive_time_ms_;       //packet receive time, local internal timestamp 
}

PacketBuffer::InsertPacket

class webrtc::video_coding::RtpFrameObject: public EncodedImage{
...
  RTPVideoHeader rtp_video_header_;
  uint16_t first_seq_num_;
  uint16_t last_seq_num_;
  int64_t last_packet_received_time_;
  int64_t _renderTimeMs;
  //inherit from webrtc::EncodedImage
  uint32_t timestamp_rtp_;											
  int64_t ntp_time_ms_;										     
  int64_t capture_time_ms_;										 
}

As a benefit of this article, you can receive free C++ audio and video learning materials package, technical videos/codes, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull streaming, srs)↓↓↓ ↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/132759466