"WebRTC Series" actual combat web end supports h265 hard solution

1. Background

The demand for real-time preview of H.265 on the web has always existed, but because Chrome itself did not support H.265 hard decoding before, and the soft decoding performance consumes a lot, it can only support one playback, so this demand was shelved.

In September last year, Chrome released the M106 version, which enabled H.265 hard decoding by default, making it feasible for real-time preview to support H.265 hard decoding.

However, the video encoding formats supported by WebRTC itself only include VP8, VP9, ​​H.264, and AV1, not H.265. According to the 2023 WebRTC Next Version Use Cases released by w3c, there is no sign of supporting H.265 in the near future, so I decided to implement WebRTC's support for H.265 by myself.

2、DataChannel

Background Speaking of chrome supports h265 hard solution, but WebRTC does not support direct transmission of h265 video stream. But this limitation can be bypassed through datachannel .

WebRTC's data channel DataChannel is specially used to transmit any data other than audio and video data (but it does not mean that audio and video data cannot be transmitted, it is essentially a socket channel ), such as short messages, real-time text chat, File transfer, remote desktop, game control, P2P acceleration, etc.

1) SCTP protocol

The protocol used by DataChannel is SCTP (Stream Control Transport Protocol) (a transport protocol at the same level as TCP and UDP), which can run directly on top of the IP protocol.

But in the case of WebRTC, SCTP is tunneled through a secure DTLS tunnel, which itself runs on top of UDP, while supporting features such as flow control, congestion control, per-message transfer, and configurable transfer modes. It should be noted that the size of a single message sent cannot exceed maxMessageSize (read-only, default 65535 bytes).

2) Configurable transmission mode

DataChannel can be configured in different modes, one is the reliable transmission mode (default mode) using the retransmission mechanism, which can ensure that the data is successfully transmitted to the peer; the other is the unreliable transmission mode, which can be set by setting maxRetransmits Specify the maximum number of transmissions, or set the transmission interval through maxPacketLife;

These two configuration items are mutually exclusive and cannot be set at the same time. When both are null, the reliable transmission mode is used, and when one value is not null, the unreliable transmission mode is enabled.

3) Support data types

The data channel supports string type or ArrayBuffer type, that is, binary stream or string data.

The next two solutions are based on datachannel .

3. Solution 1 WebCodecs

Official documentation: github.com/w3c/webcode…

Idea: DataChannel transmits H.265 naked stream + Webcodecs decoding + Canvas rendering. That is, the audio and video transmission channel (PeerConnection) of WebRTC does not support the H.265 encoding format, but its data channel (DataChannel) can be used to transmit H.265 data, and the front end uses Wecodecs decoding and Canvas rendering after receiving it.

advantage:

  • Direct transmission of H.265 bare code stream without additional encapsulation, simple and convenient; no redundant data, high transmission efficiency

  • Wecodecs has low decoding delay and high real-time performance

shortcoming:

  • Audio needs to be transmitted, decoded and played separately, and audio and video synchronization issues need to be dealt with

  • The existing sdk is based on video packaging, and the webcodes solution relies on canvas. The existing video-related operations need to be rewritten, such as screenshots, video recording, etc.

  • Due to historical reasons such as various online projects, the existing sdk has undergone major changes, and time is not allowed

4. Scheme 2 MSE

Official example: github.com/bitmovin/ms…

Idea: Fmp4 encapsulation + DataChannel transmission + MSE decoding and playback. That is, the H.265 video data is first encapsulated into Fmp4 format, and then transmitted through the WebRTC DataChannel channel. After receiving it, the front end uses MSE to decode and play the video.

advantage:

  • Reuse video tag playback, no need to implement rendering separately

  • Audio and video have been encapsulated into Fmp4, and there is no need to consider audio and video synchronization on the web side

  • The overall workload is smaller than that of Wecodecs, and it can be launched quickly

shortcoming:

  • Fmp4 encapsulation on the device side may have performance problems, so it needs to be decapsulated by cloud forwarding in real time, or front-end decapsulation

  • The real-time performance of MSE decoding is not good (the first slicing in the cloud will have a delay of 1~2 seconds)

5. Program selection

The first version is first launched as MSE. In the cloud, the amount of front-end development is relatively small, and the roi is high.

The second version of wecodecs is planned, which not only has low latency, but also can avoid the problem of cloud traffic consumption and save costs. Assuming that during the second edition, WebRTC officially supports H.265, then it can be directly compatible with the official solution.

5.1 Describe Mse and the first version of sdk transformation in detail

Media Source Extensions, media source extensions. Official documentation: developer.mozilla.org/en-US/docs/…

With the introduction of MSE, a web browser that supports HTML5 becomes a player that can parse streaming protocols.

From another point of view, by introducing MSE, HTML5 tags can not only directly play mp4, m3u8, webm, ogg and other formats supported by default, but also support video stream formats that can be processed by JS (with MSE function). In this way, we can convert some originally unsupported video stream formats into supported formats (such as H.264 mp4, H.265 fmp4) through JS (with MSE function).

For example, the open source flv.js of station B is a typical application scenario. The HTML5 player at Bilibili uses MSE technology to transcode the FLV source with JS (flv.js) into an HTML5-supported video stream encoding format in real time, and provides it to the HTML5 player for playback.

// 此 demo 来自下面链接的官方示例, 可以直接跑起来,比较直观
// https://github.com/bitmovin/mse-demo/blob/main/index.html
​
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>MSE Demo</title>
</head>
<body>
  <h1>MSE Demo</h1>
  <div>
    <video controls width="80%"></video>
  </div>
​
  <script type="text/javascript">
    (function() {
      var baseUrl = 'https://bitdash-a.akamaihd.net/content/MI201109210084_1/video/720_2400000/dash/';
      var initUrl = baseUrl + 'init.mp4';
      var templateUrl = baseUrl + 'segment_$Number$.m4s';
      var sourceBuffer;
      var index = 0;
      var numberOfChunks = 52;
      var video = document.querySelector('video');
​
      if (!window.MediaSource) {
        console.error('No Media Source API available');
        return;
      }
        
      // 初始化 mse
      var ms = new MediaSource();
      video.src = window.URL.createObjectURL(ms);
      ms.addEventListener('sourceopen', onMediaSourceOpen);
​
      function onMediaSourceOpen() {
        // codecs,初始化 sourceBuffer
        sourceBuffer = ms.addSourceBuffer('video/mp4; codecs="avc1.4d401f"');
        sourceBuffer.addEventListener('updateend', nextSegment);
​
        GET(initUrl, appendToBuffer);
        
        // 播放
        video.play();
      }
​
      function nextSegment() {
        var url = templateUrl.replace('$Number$', index);
        GET(url, appendToBuffer);
        index++;
        if (index > numberOfChunks) {
          sourceBuffer.removeEventListener('updateend', nextSegment);
        }
      }
​
      function appendToBuffer(videoChunk) {
        if (videoChunk) {
          // 二进制流转换为 Uint8Array,sourceBuffer 进行消费
          sourceBuffer.appendBuffer(new Uint8Array(videoChunk));
        }
      }
​
      function GET(url, callback) {
        var xhr = new XMLHttpRequest();
        xhr.open('GET', url);
        xhr.responseType = 'arraybuffer';
​
        xhr.onload = function(e) {
          if (xhr.status != 200) {
            console.warn('Unexpected status code ' + xhr.status + ' for ' + url);
            return false;
          }
          // 获取 mp4 二进制流
          callback(xhr.response);
        };
​
        xhr.send();
      }
    })();
  </script>
</body>
</html>

Through the above demo and test (replacing the fmp4 segment in dmeo with our own IPC device (camera), H.265 type), chrome can hard-decode the fmp4 segment of H.265 type. So, things became clear . With the general direction, it is nothing more than H.265 naked stream, converted into fmp4 clips, and chrome underlying hard solution.

5.2 fmp4 front-end real-time decapsulation

H.265 naked stream decapsulates fmp4, after research, if pure js is used for encapsulation, the workload is quite heavy. Try to use wasm to adjust the c++ library, and found that even the decapsulation performance is not very good. So it was passed at the front end.

[Learning address]: FFmpeg/WebRTC/RTMP/NDK/Android audio and video streaming media advanced development

[Article Benefits]: Receive more audio and video learning packages, Dachang interview questions, technical videos and learning roadmaps for free. The materials include (C/C++, Linux, FFmpeg webRTC rtmp hls rtsp ffplay srs, etc.) Click 1079654574 to join the group to receive it~

5.3 fmp4 cloud real-time decapsulation

Good performance, 0 intrusion on the front end. After confirming the cloud decapsulation, let's talk about the evolution of the core links encountered during development during this period, and the final process plan.

6. Stage 1

The cloud decapsulates Fmp4 in real time, hardcodes codecs (audio and video encoding type) -> front-end MSE decodes and plays -> fails after a few seconds of playback, MSE will throw an exception, which probably means that your data is wrong, and the front and back cannot be connected.

After investigation, it was found that when the MSE was in updating, consumption could not be performed, and the data was directly lost, resulting in failure to connect subsequent data. Since it cannot be lost, we cache it. See the code comments below for details.

See code comments for details:

const updating = this.sourceBuffer?.updating === true;
const bufferQueueEmpty = this.bufferQueue.length === 0;
​
  if (!updating) {
    if (bufferQueueEmpty) {
      // 缓存队列为空: 仅消费本次 buffer
      this.appendBuffer(curBuffer);
    } else {
      // 缓存队列不为空: 消费队列 + 本次 buffer
      this.bufferQueue.push(curBuffer);
​
      // 队列中多个 buffer 的合并
      const totalBufferByteLen = this.bufferQueue.reduce(
        (pre, cur) => pre + cur.byteLength,
        0
      );
      const combinedBuffer = new Uint8Array(totalBufferByteLen);
      let offset = 0;
      this.bufferQueue.forEach((array, index) => {
        offset += index > 0 ? this.bufferQueue[index - 1].length : 0;
        combinedBuffer.set(array, offset);
      });
​
      this.appendBuffer(combinedBuffer);
      this.bufferQueue = [];
    }
  } else {
    // mse 还在消费上一次 buffer(处于 updating 中), 缓存本次 buffer, 否则会有丢帧问题
    this.bufferQueue.push(curBuffer);
  }

Considering that every frame of Fmp4 data cannot be lost, datachannel uses reliable transmission.

But after testing, new problems were discovered. Latency increases cumulatively over time. Because after the packet is lost, the network layer will retry, and the retry time will accumulate into the delay. We have tested that when the network condition is bad, the delay will be as high as 30 seconds or more, and theoretically it will continue to increase if you pull the stream for long enough .

7. Phase 2

Ok, another way of thinking, since the delay problem caused by no frame loss + reliable transmission is completely unacceptable, what if we use unreliable transmission instead?

Unreliable transmission means that frames will be dropped. After investigation, Fmp4 can discard an entire slice (a slice contains multiple frames). In this case, we can design a set of frame loss algorithms. As long as it is judged that a slice is incomplete, we will discard the entire slice.

In this case, theoretically speaking, there will only be a delay of one slice at most, about 2 seconds, which is acceptable for the business layer.

Frame loss algorithm design idea: add 4 bytes of data in the data header of each frame to identify the specific information of each frame.

  • segNum: 2 bytes, big-endian mode, Fmp4 segment sequence number, starting from 1 and increasing by 1 each time

  • fragCount: 1 byte, the total number of Fmp4 fragments, the minimum is 1

  • fragSeq: 1 byte, Fmp4 fragment fragment sequence number, starting from 1

After the front end gets each frame of data, it parses the first 4 bytes to obtain the detailed information of each frame of data. For example, if I want to judge whether the current frame is the last frame, I only need to judge whether fragCount is equal to fragSeq.

The general flow chart of the algorithm:

Explain in detail:

  • frameQueue, used to cache the data of each frame, used to compare with the data of the next frame to determine whether it is a complete frame

  • bufferQueue, the data in this queue is a complete slice of data, ensuring that there is no missing data when MSE consumes

  /**
   * fmp4 切片队列 frameQueue,处理丢帧,生产 bufferQueue 内容
   *
   * @param frameObj 每一帧的相关数据
   *      每来一帧进行判断
   *      buffer中加上当前帧是否为连续帧(从第一帧开始的连续帧)
   *        是
   *          当前帧是否为最后一帧
   *            是 拼接buffer帧以及当前帧,组成完整帧,放入另外一个待消费 buffer
   *            否 当前帧入 buffer
   *        否 清空 buffer,当前帧入 buffer
   */
​
const frameQueueLen = this.frameQueue.length;
const frameQueueEmpty = frameQueueLen === 0;
​
  // 单一完整分片帧单独处理,直接进行消费
  if (frameObj.fragCount === 1) {
    if (!frameQueueEmpty) {
      this.frameQueue = [];
    }
    this.bufferQueue.push(frameObj.value);
    return;
  }
​
  if (frameQueueEmpty) {
    this.frameQueue.push(frameObj);
    return;
  }
​
  // 是否为首帧
  let isFirstFragSeq = this.frameQueue[0].fragSeq === 1;
  // 当前帧加上queue帧是否为连续帧
  let isContinuousFragSeq = true;
  for (let i = 0; i < frameQueueLen; i++) {
    const isLast = i === frameQueueLen - 1;
​
    const curFragSeq = this.frameQueue[i].fragSeq;
    const nextFragSeq = isLast
      ? frameObj.fragSeq
      : this.frameQueue[i + 1].fragSeq;
​
    const curSegNum = this.frameQueue[i].segNum;
    const nextSeqNum = isLast
      ? frameObj.segNum
      : this.frameQueue[i + 1].segNum;
​
    if (curFragSeq + 1 !== nextFragSeq || curSegNum !== nextSeqNum) {
      isContinuousFragSeq = false;
      break;
    }
  }
​
  if (isFirstFragSeq && isContinuousFragSeq) {
    // 是否为最后一帧
    const isLastFrame = frameObj.fragCount === frameObj.fragSeq;
    if (isLastFrame) {
      this.frameQueue.forEach((item) => {
        this.bufferQueue.push(item.value);
      });
      this.frameQueue = [];
      this.bufferQueue.push(frameObj.value);
    } else {
      this.frameQueue.push(frameObj);
    }
  } else {
    // 丢帧则清空 frameQueue,则代表直接丢弃整个 segment 切片
    this.emit(EVENTS_ERROR.frameDropError);
    this.frameQueue = [];
    this.frameQueue.push(frameObj);
  }

I thought I was done, but something unexpected happened.

When frame loss occurs, the above algorithm indeed discards the data of the entire slice, but the MSE is abnormal again at this time, which means that the data sequence is incorrect, resulting in parsing failure.

But using ffplay to test locally (it can continue to play after losing a whole slice) , it got deadlocked and continued to investigate.

8. Stage Three

It is said that chatgpt is not very popular recently , I tried to use it, and I did find the reason. When MSE consumes fmp4 data, it needs to perform index identification according to the internal serial number, so even if the entire slice data is lost, playback will still fail. what to do? Is it going back to unreliable transport?

After some weighing, it was finally decided that when a frame loss occurs, the front-end notifies the cloud to re-slice, and at this time the front-end re-initializes the MSE.

After the transformation, we found that the effect is not bad. We set the unreliable transmission and datachannel retransmission times to 5.

The probability of frame loss is greatly reduced. Even if frame loss occurs, there will only be less than 2 seconds of loading, and then continue to display the screen, which is acceptable to the business layer.

Finally, after the transformation of the above three stages, the entire link diagram will be obtained. Of course, there are still many details that have not been mentioned, such as using mp4box to obtain the codec, and the front-end regularly checks the status of the datachannel, etc., so I won’t go into details. Those who are interested can leave a message to discuss .

The complete link diagram is simply drawn.

9. Summary

At present, the datachannel + MSE solution has been launched, and after testing, there is no performance problem when hard-coding 16 channels at the same time online.

In the future, we will try to use webcodes to analyze H.265 and deal with issues such as audio and video synchronization. Completely solve the problem of delay.

In the next article, I plan to write some ideas for daily troubleshooting of WebRTC problems . You are also welcome to talk about some problems encountered in daily life in the comment area, and summarize them together in the next article.

Original link: "WebRTC Series" actual combat Web side supports h265 hard solution - Nuggets

Guess you like

Origin blog.csdn.net/irainsa/article/details/130020347