WebRTC video bit rate control (1) - CPU usage detection

WebRTC uses CPU usage as one of the basis for rate control. When the CPU is overused (overusing), the video encoding is downgraded (adapt down); when the CPU is not fully used (uderusing), the video encoding is upgraded (adapt up). The goal is to provide the highest quality video possible given current device performance. The quality here includes comprehensive indicators such as clarity and fluency.

The code about CPU usage detection in WebRTC is mainly implemented in the OveruseFrameDetector class of http://overuse_frame_detector.cc , and its object overuse_detector_ is passed in as a parameter when VideoStreamEncoder is constructed. Specifically, OveruseFrameDetector is responsible for the processing of detection data, and VideoStreamEncoder is responsible for initialization, data provision, feedback callback and other processes.

1. What is "CPU usage"

First of all, we need to clarify what is CPU usage. As the name suggests, we often think that it is related to CPU usage. But reading OveruseFrameDetector, except for the CPU mentioned in the comments, there is no related implementation in the code. In fact, the so-called "CPU usage" refers to the time consumption of encoding/time consumption of acquisition, which is a ratio. When the ratio is larger, it means that the encoding cannot keep up with the acquisition, and the performance bottleneck of the encoder has been reached, and the encoding needs to be downgraded; on the contrary, the smaller the ratio, it means that the encoding capacity is sufficient, and the encoding can also be upgraded to provide better video quality. In other words, it is a producer-consumer relationship.

It can be seen that the CPU usage is not directly related to the CPU, but the current operating performance is measured by the relative rate of encoding. There are several advantages to doing this:

Not bound to CPU hardware, decoupled from hardware platform

Take into account the impact of the current software operating environment, such as the CPU consumption of other applications, and the scheduling strategy of the operating system for each process resource

How to deduce this ratio will be described below.

2. Initialization and configuration

Like other observer types, OveruseFrameDetector is registered when VideoStreamEncoder is constructed. The configuration items are initialized in the construction of OveruseFrameDetector itself, the most important two of which are:

low_encode_usage_threshold_percent: Underusing threshold, the default is 42. Below this threshold, it is considered Underusing.

high_encode_usage_threshold_percent: Overusing threshold, the default is 85. Above this threshold, it is considered Overusing.

Changes to configuration items only occur when the encoder is recreated (stream initialization, or encoding format changes):

void VideoStreamEncoder::ReconfigureEncoder() {
  //...
  if (pending_encoder_creation_) {
    overuse_detector_->StopCheckForOveruse();
    overuse_detector_->StartCheckForOveruse(
        &encoder_queue_,
        GetCpuOveruseOptions(
            settings_, encoder_->GetEncoderInfo().is_hardware_accelerated),
        this);
    pending_encoder_creation_ = false;
  }
  //...
}

CpuOveruseOptions GetCpuOveruseOptions(
    const VideoStreamEncoderSettings& settings,
    bool full_overuse_time) {
  CpuOveruseOptions options;

  if (full_overuse_time) {
    options.low_encode_usage_threshold_percent = 150;
    options.high_encode_usage_threshold_percent = 200;
  }
  if (settings.experiment_cpu_load_estimator) {
    options.filter_time_ms = 5 * rtc::kNumMillisecsPerSec;
  }
  return options;
}

It can be seen from the above that when the encoder is_hardware_accelerate, that is, when hardware encoding is used, the threshold will be adjusted and set to 150 and 200 respectively, which is much improved compared to 42 and 85 for software encoding. The purpose of this is that the hardware encoding occupies a dedicated video processing unit (here collectively referred to as VPU) and does not occupy the CPU, so it can be properly "squeezed" to maintain a high load on the VPU without affecting the overall system performance. In addition, because hardware encoding runs asynchronously (input and output are not in the same thread), there are errors in time-consuming evaluation. Using a looser threshold can reduce errors. This issue will be mentioned later

3. Detection start and end

The start and end of detection are controlled in VideoStreamEncoder, which is called when the encoder needs to be reset:

void VideoStreamEncoder::ReconfigureEncoder() {
  //...
  if (pending_encoder_creation_) {
    overuse_detector_->StopCheckForOveruse(); // Terminate detection
    overuse_detector_->StartCheckForOveruse( // enable detection
        &encoder_queue_,
        GetCpuOveruseOptions(
            settings_, encoder_->GetEncoderInfo().is_hardware_accelerated),
        this);
    pending_encoder_creation_ = false;
  }
  //...
}

The detection mechanism is to check every 5 seconds in the thread:

void OveruseFrameDetector::StartCheckForOveruse(...) {
  //...
  check_overuse_task_ = RepeatingTaskHandle::DelayedStart(
      task_queue->Get(), TimeDelta::ms(kTimeToFirstCheckForOveruseMs),
      [this, overuse_observer] {
        CheckForOveruse(overuse_observer); // Complete the judgment and feedback here
        return TimeDelta::ms(kCheckForOveruseIntervalMs); // kCheckForOveruseIntervalMs = 5000 (ms)
      });
  //...
}

4. Sample data collection

In order to obtain the ratio of "encoding time-consuming/acquisition time-consuming", it is necessary to obtain samples of encoding time-consuming and acquisition time-consuming respectively. Specifically,

Encoding time = End time of encoding of this frame - Encoding start time of this frame

Acquisition time = the encoding start time of this frame - the encoding start time of the previous frame

It can be seen from this that the real samples to be collected are the encoding start time and encoding end time of each frame.

4.1 Encoding start time

In order to ensure the accuracy of the time, it is recorded when the video source just arrives at the VideoStreamEncoder:

//
void VideoStreamEncoder::OnFrame(const VideoFrame& video_frame) {
  //...
  int64_t post_time_us = rtc::TimeMicros();
  //...
  // MaybeEncodeVideoFrame(incoming_frame, post_time_us);
}

Passed to OveruseFrameDetector during encoding:

void VideoStreamEncoder::EncodeVideoFrame(const VideoFrame& video_frame,
                                          int64_t time_when_posted_us) {
  //...
  overuse_detector_->FrameCaptured(out_frame, time_when_posted_us);
  //...
}

4.2 Encoding end time

In the callback function returned by the encoding, the time point at which the encoding ends is recorded:

EncodedImageCallback::Result VideoStreamEncoder::OnEncodedImage(...) {
   //...
   RunPostEncode(image_copy, rtc::TimeMicros(), temporal_index);
   //...
}
Passed to OveruseFrameDetector during RunPostEncode:

void VideoStreamEncoder::RunPostEncode(EncodedImage encoded_image,
                                       int64_t time_sent_us,
                                       int temporal_index) {
  //...
  overuse_detector_->FrameSent(
      encoded_image.Timestamp(), time_sent_us,
      encoded_image.capture_time_ms_ * rtc::kNumMicrosecsPerMillisec,
      encode_duration_us);}
  //...

5. Calculation process

The data has been collected, and the next step is how to calculate it.

5.1 Acquisition time-consuming calculation

The time-consuming calculation of acquisition is relatively simple, just the time difference between the current frame and the previous frame:

// overuser_frame_dectector.cc
void FrameCaptured(const VideoFrame& frame,
                   int64_t time_when_first_seen_us,
                   int64_t last_capture_time_us) override {
  // calculate time difference
  if (last_capture_time_us != -1)
    AddCaptureSample(1e-3 * (time_when_first_seen_us - last_capture_time_us));
  // Save, for subsequent calculation and encoding time-consuming
  frame_timing_.push_back(FrameTiming(frame.timestamp_us(), frame.timestamp(),
                                      time_when_first_seen_us));

5.2 Encoding time-consuming calculation

The time-consuming calculation process of encoding is relatively complicated. The main reason is that the callback VideoStreamEncoder::OnEncodedImage at the end of encoding and VideoStreamEncoder::OnFrame at the beginning of encoding do not belong to the same thread, that is, run asynchronously, which leads to a problem: how to judge the end and start Corresponding to the same frame? In addition, there are the following situations that make the matching difficulty even worse:

Multi-layer coding in airspace (such as Simulcast), the callback at the end of each layer of coding is crossed out of order

Encoding failure or frame loss processing, resulting in missing encoding end time, no possibility of correct matching

When hardware encoding (such as MediaCodec on the Android side), the input and output inside the interface may belong to different threads, which intensifies the asynchrony between the start and end of encoding

What's worse, once you start to match wrongly, there will be a cumulative error response of "one step wrong, step by step wrong", which will eventually lead to the complete failure of data statistics.

In order to solve this problem, WebRTC's method is to assume that the encoding time of all frames should be within 1 second, and from the following methods to match:

// overuse_frame_detector.cc
absl::optional<int> FrameSent(
      uint32_t timestamp,
      int64_t time_sent_in_us,
      int64_t /* capture_time_us */,
      absl::optional<int> /* encode_duration_us */) {
    static const int64_t kEncodingTimeMeasureWindowMs = 1000; // Set the encoding time-consuming interval value of 1 second
    // In multi-layer encoding, use the end time of the subsequent frame number to update the corresponding value, and use the timestamp variable to identify whether it corresponds to the frame
    for (auto& it : frame_timing_) {
      if (it.timestamp == timestamp) {
        it.last_send_us = time_sent_in_us;
        break;
      }
    }
    // search the database for a match from the encoding start time
    while (!frame_timing_.empty()) {
      FrameTiming timing = frame_timing_.front(); // Start from the earliest recorded time, the farther it is from time_sent_in_us
      // In the interval of 1 second, when it is not time to close the network, exit first to keep the database, and then observe at the end of the next frame
      if (time_sent_in_us - timing.capture_us <
          kEncodingTimeMeasureWindowMs * rtc::kNumMicrosecsPerMillisec) {
        break;
      }
      // If the time interval exceeds 1 second, the time-consuming of the frame can be counted
      // Note that timing.last_send_us is used to calculate the difference here, not time_sent_in_us
      if (timing.last_send_us != -1) {
        encode_duration_us.emplace(
            static_cast<int>(timing.last_send_us - timing.capture_us));

        if (last_processed_capture_time_us_ != -1) {
          int64_t diff_us = timing.capture_us - last_processed_capture_time_us_;
          AddSample(1e-3 * (*encode_duration_us), 1e-3 * diff_us);
        }
        last_processed_capture_time_us_ = timing.capture_us;
      }
      // remove the frame from the database
      frame_timing_.pop_front();
    }
    
}

It can be seen from the above that every time FrameSent is called, the successful match is not necessarily the current frame, but may be some frames in the history.

5.3 Data smoothing

In order to prevent the detection result from jittering and affect the user experience, the data needs to be smoothed. Here the exponential filter rtc::ExpFilter is used.

// overuse_frame_detector.h
std::unique_ptr<rtc::ExpFilter> filtered_processing_ms_;
std::unique_ptr<rtc::ExpFilter> filtered_frame_diff_ms_;

5.4 Ratio calculation

After each frame of data is recorded, the detection results are updated immediately:

void OveruseFrameDetector::EncodedFrameTimeMeasured(int encode_duration_ms) {
  //...
  encode_usage_percent_ = usage_->Value();
  //...
}

int Value() override {
  //...
  float frame_diff_ms = std::max(filtered_frame_diff_ms_->filtered(), 1.0f);
  frame_diff_ms = std::min(frame_diff_ms, max_sample_diff_ms_); // pay attention to the limit of max_sample_diff_ms_ here
  float encode_usage_percent =
      100.0f * filtered_processing_ms_->filtered() / frame_diff_ms;
  return static_cast<int>(encode_usage_percent + 0.5); // rounding processing
}

6. Feedback regulation

The judgment and feedback adjustment of the detection results are completed in OveruseFrameDetector::CheckForOveruse called every 5 seconds:

void OveruseFrameDetector::CheckForOveruse(
    AdaptationObserverInterface* observer) { // observer points to the VideoStreamEncoder object
  if (IsOverusing(*encode_usage_percent_)) {
    //...
    observer->AdaptDown(kScaleReasonCpu); // encoding downgrade
  } else if (IsUnderusing(*encode_usage_percent_, now_ms)) {
    //...
    observer->AdaptUp(kScaleReasonCpu); // Encoding upgrade
  }
}

Among them, IsOverusing is true, and it needs to exceed the threshold twice in a row :

bool OveruseFrameDetector::IsOverusing(int usage_percent) {
  RTC_DCHECK_RUN_ON(&task_checker_);

  if (usage_percent >= options_.high_encode_usage_threshold_percent) {
    ++checks_above_threshold_;
  } else {
    checks_above_threshold_ = 0;
  }
  return checks_above_threshold_ >= options_.high_threshold_consecutive_count; // high_threshold_consecutive_count为2
}

And IsUnderusing only needs to be lower than the threshold once:

bool OveruseFrameDetector::IsUnderusing(int usage_percent, int64_t time_now) {
  //...
  return usage_percent < options_.low_encode_usage_threshold_percent;
}

Original WebRTC video bit rate control (1) -- CPU usage detection

★The business card at the end of the article can receive audio and video development learning materials for free, including (FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, srs) and audio and video learning roadmaps, etc.

see below!