Architecture Design and Evolution of Cloud Editing-B-end Online Editing Tool

Tencent Cloud Audio and Video's cloud editing is committed to allowing customers to quickly integrate editing capabilities in their own applications (Web, applets), and powerful template capabilities can greatly improve video production efficiency. We encountered many challenges in the process of exploring B-side online editing products: how to meet the two integration scenarios of fast and customized? How to design a general-purpose, high-performance, and flexibly expandable rendering engine? How to ensure the efficiency and quality of cloud video synthesis? LiveVideoStackCon 2022 Beijing Station invited Mr. Cheng Ruilin to share with us how their team answered this series of questions.

Text/Cheng Ruilin

Edit/LiveVideoStack

Hello everyone, my name is Cheng Ruilin, and I am the person in charge of the cloud editing module of Tencent Cloud Intelligent Creation Platform. Today, I would like to share with you some interesting things in the process of making a cloud video front-end editing tool. Today's sharing is mainly about three contents:

① The first one is why there is cloud editing, let everyone understand the application scenarios of B-side cloud editing;

②The second content is the design and evolution of the front-end and back-end architecture of cloud editing. This part mainly focuses on the design part of the rendering engine and the main related applications of the Web, as well as the design of the front-end page and the server-side design around such an engine;

③The third part is the technology prospect of online video editing business. Practitioners in the online video editing business will naturally be more interested in the WebCodecs officially supported by Chrome 94, and talk about some of its application limitations and prospects in the field of video production.

5a053b2706545e24169b5013b240c486.png

Cloud Clipping is a sub-module of Tencent Smart Creation Platform. The business background of its birth is to facilitate the customer's users to conveniently produce video content on the customer's platform. After the production of video content is completed, a series of processes such as video review, video live broadcast, and video sharing can be carried out.

-01-

Cloud clip application

d826ff2d5d5c9350adc4444a15c75a1b.png

For the application scenario of cloud editing, on the web side, we have implemented a powerful online editing tool that allows users to open the webpage to complete the video editing work. A plug-in for the Jane applet, which allows you to quickly integrate an editing service in your own applet.

7268d1247f3f2254cf8fa2a143b7b612.png

It also supports powerful video templates. There are two main forms of video templates. The first is a video template generated by editing a project. After completing a video project on the editing page, you can mark the elements inside as slots, and then you can generate new videos in batches by replacing the contents of the slots on the web, applet, or server. The second is to export the AE project as a video template through the AE plug-in. The system will automatically recognize the content of the card slot in AE, and replace the content of the card slot in the Web and server to generate a new video.

f74b8c35c88d622aa9bcaaa44e5a7308.png

This is the preview effect of the video template made by AE on the front page. You can replace the video picture or text content on the right to make a new video, and at the same time, the open API can replace the content in the form of API to generate this template in batches.

022f8e144dc42d86010e6fc192cc2d0f.png

Recently, we also launched a new digital human editing capability. There are three main characteristics.

First of all, the digital human and the editing track are deeply combined. Users can flexibly configure and edit digital human videos;

The second feature is that it supports two modes of text-driven and audio-driven to make digital human videos;

The third feature is that it can quickly customize a personal exclusive image. It can provide photos or videos to make an exclusive digital human, and then drive it to produce more exclusive personal digital human videos through text or voice.

250fb2f4646d90a6a44ffcc2296829d3.png

Customers can use the cloud clip function in two ways. Let's take a look at how customers use our cloud clips. The first method is PaaS access. As long as you follow the initial development on Tencent Cloud's official website, you can embed the complete editing function into the customer's own web application. The picture on the left shows that Tencent Conference accesses the front-end editing project through iFrame. The second method is through our front-end components and server-side API. The picture on the right is a scene similar to the simple cropping in YouTube in a Tencent conference. After the meeting, customers can simply crop the key content of the video to generate a new video. At the same time, there is a C-side address. Essentially, our C-end address is also the B-end client of Cloud Clips, and all capabilities and interfaces are the same as those of external clients. Tencent Conference uses front-end components and server-side APIs to build a simple editing scene. Cloud Clips offers much more than that. Customers can combine their own web application design style and business capabilities to build a completely different front-end editing page.

0b8faaa7203377f76dbc5590a6a62c4d.png

Cloud Clips realizes B-terminalization through three steps. The first step is to create a Tencent Cloud account and activate Cloud VOD. All media resources of Tencent Cloud are stored on the cloud on demand. The second step is to create a project through the API, import the media resources in the cloud on demand into this project, and then return a signature to the front end. In the third step, the front end initializes the iFrame page through the signature returned in the previous step. At this time, the editing project can be opened immediately, and there are media resources injected by the server in it, and the front end page can interact with the iFrame through the API and can be modified at any time. Inside the content, inject new elements, and allow users to upload their own media resources.

-02-

Design and evolution of cloud editing front-end and back-end architecture

The basic capabilities and application scenarios of B-side cloud editing are introduced above, and the design and evolution of the front-end and back-end architecture of cloud editing are introduced below.

20b29870523350da86e08daf6e7160db.png

There are three main technical requirements for cloud editing. The first is the requirement to be able to render in real time. The screen should be able to respond to the update of the time axis in real time. The second requirement allows for more complex interactions. Including the operation of media resources, the operation of the timeline, and the update of canvas elements, to ensure smooth operation and data stability. The third is multi-end rendering. The figure above shows three rendering scenarios on the Web side, applet side and server side. The consistency of multi-terminal rendering effects is achieved by design.

f7b75636ce2604f7fbe4ff3676e02f77.png

This is an overall architecture of our rendering engine. First of all, Tencent Cloud's rendering engine is driven by orbit data and time, and the team has done a lot of work to ensure that the content of each frame can be rendered accurately. It is easy to write some non-blocking code on the front end. As can be seen on many online editing tools, when playing or seeking to a certain frame, certain elements will appear delayed. Rendering uncertainty introduced by the network is unacceptable. Second, the rendering engine is designed through the gamified parent-child relationship hierarchical tree design, which greatly improves the scalability of material types. We call all track elements Clip. Tencent Cloud also supports PAG materials, which can communicate well with PAG material effects and template ecology.

f7c33585c8018c2ae73215e9b999697c.png

As mentioned earlier, our rendering is driven by data and time. There are mainly four content updates here.

The first is the update of the timer. After the data is ready, the timer will drive the update of the entire canvas. The update of the screen is divided into two steps. The first step is the user's playback behavior, and the second step is the user's actual operation behavior on the canvas. The update of each frame requires a certain amount of preparatory work to find out elements on the timeline that should currently be rendered, elements that should not be rendered, and elements that will be rendered according to predictions. The second is the update of the cache. The preloader preloads elements and manages the creation and destruction of caches.

The third is the update of Clip. Clip is the base class for all elements. For example, basic properties such as the width and height of the element, such as dragging, rotating, and scaling operations. The last is the update of user behavior. Users can perform many operations in the rendering engine, such as dragging and dropping video stickers. We will synchronize the update of screen elements back to the track data to ensure data consistency.

1012928e89bd28453bbd3427edebca8f.png

The addition of various special effects is indispensable for video editing projects. In the project, some video effects are realized through Shader, such as special effect transition, mask, home animation, etc.

These are some effects that are common in short videos. How to realize the development and reuse of such effects?

aefbce9d24f83b47a5eaccd8c19d87d7.png

This is the code of the source shader. The core part of this code is a main function. The return value of the function is a color value, which is RGBA. Return 0000 without doing any operation, then the screen on the right is completely black. If the resolution of the picture on the right is 720×1280, the main function will be executed 720×1280 times. The second is the input of the texture. Input two image textures in this program, and you can get the color value of the texture pixel point mapping in the main function. Return the color value of Figure 1 without doing anything, and the final picture is the complete picture of Figure 1. If the color value of picture 2 is returned, the final picture is picture 2.

This is the logic of the animation. If the animation duration is 2 seconds and the current running time is 1 second, the shutter effect will be in the middle of an animation. Through calculation, some pixels will display the pixel color in Figure 1, and some pixels will display the pixel color in Figure 2. Here is a more interesting news, #iChannel, #iUniform, the form of # at the beginning is not a standard Shader writing method, and will be parsed into a standard input parameter by our VSCode plug-in.

10a3e9a89be9a68104cd2ad199183538.png

After writing, you can perform real-time Shader debugging by adjusting this thing in the lower right corner. Writing Shader is more casual, and you can input arbitrary textures and variables when writing special effects. At the same time, there is no native reuse mode of import components, which is not conducive to program design and reuse. Tencent Cloud solves these problems through the VSCode plug-in solution. When designing the Shader Controller module of the rendering engine, strictly restrict the uniform of the input parameters, only use the input parameters with specified variables, and encapsulate reusable methods as much as possible. Subsequent developers only need to write the part he renders, and don't care about the general logic. The general uniform is the variable at the beginning of # mentioned above. For example, the progress variable, with the progress variable, the current animation progress can be obtained in the main function, so that the animation style to be displayed can be calculated according to the current progress in the main function. Similar public input parameters also have some standardized pixel coordinates, UV, canvas scale, etc., which can be adjusted in the lower right corner to achieve real-time preview.

The Import method is implemented by merging Shader. Shader itself does not provide native input stages and mechanisms. Concatenation of common methods is employed during the compilation phase of the tool. At the stage of writing, developers only need to write the specific logic in the main function, and then they can use its general methods, such as calculating the ratio, calculating the position, calculating the color, calculating the progress, etc., and at the stage of generating the code, put this part of the code Merge with the code of the rendering part written by the developer later. After debugging, click Export in the upper right corner to generate a new Shader file required by the editor and put new special effects on the shelves.

c08fb539b361c06ece6261c8509ca574.png

The rendering engine does not exist in isolation, it needs to cooperate with the track data, and the assembly of the track data is also inseparable from the editor. The front-end editor mainly has four modules. The first is the already mentioned real-time rendering engine. The second is the material module. Every type of material introduced will go through careful research and thinking. At the beginning, only the main mode of the Web is allowed to be used for rendering, and it is not allowed to superimpose environment-dependent things on the campus engine. Third, clip the track. Fourth, the supplementary module of the material. It is convenient for customers to import general AE templates for making platform-specific stickers, text effects, etc.

52275573510956f2b2e82ee69738d3cd.png

The first version of the clip track performance was poor. Through the view control, when the element is clicked, the clicked data is submitted to the view control, and a dragged entity is generated. All other elements can be updated by monitoring the changes of the view control entity. During the dragging process, the drive coordinates are updated to find areas that allow splicing or automatic alignment, and then render shadow elements. The real track update happens when the user drags and drops the lock. Through this design, the track operation surface has been greatly improved.

c5b028bfe4e4735427b9636ebcaf481a.png

Operational logic is inseparable from media elements. Currently, there are mainly two modes for resource management of online editing tools. One is pure cloud and the other is pure local. At the beginning, the pure cloud model was used, and all resources were on-demand around the cloud. The purely local mode is similar to Clipchamp abroad. Pure local mode cannot cooperate across devices, and there is a risk of cache file loss. However, users in pure cloud mode have to wait for the video upload and transcoding to complete before editing the video. Tencent Cloud adopts the local cloud dual mode to support the editing workflow. When a file is imported, parse the video to determine whether the media resource can be edited directly. If it can be directly edited, start the local editing workflow, capture the cover image and sprite image, and import the video to the editing track. Resource uploading and transcoding work will be performed behind the editor. After the upload and transcoding are completed, the cloud-based replacement will be performed. Since then, whether the user changes the device or the user, the project will always keep the data stable and available.

50634a50106bef2346753fe18708dbd2.png

The green part is the front-end application scenario. Tencent Cloud's real-time rendering engine has already well supported the rendering work on the Web side and the applet side. In the export part of the server, the protocol is connected with the background, and the same rendering protocol is agreed, and the final track data is spliced ​​by FFmpeg and OpenGL in the background. Doing so will not only lead to inconsistencies between the front and back ends, but also consume a lot of back-end manpower.

62498c1108a3300aba972a5d1331098e.png

After verification, it is feasible to throw the rendering engine to the server. The whole program is driven by the Node process of the rendering engine, encapsulates a Node extension module of shared memory, which is used to quickly transfer the video frame and audio frame data, and then encapsulates a Node codec extension module. The bottom layer is based on the modified FFmpeg. When the rendering engine is applied on the front end, it is driven based on data and time. Due to the design of the layered architecture, most transformations only need to transform the part of the preloader to load data. The API for external rendering of Clip is consistent, and part of the rendering logic can be reused very well.

e1e90835363f603d1450feb3645417ff.png

In order to avoid IO loss, it is inevitable to encapsulate a shared memory extension, which is used to provide the rendering engine and codec module for the transmission of audio and video frame data. The shared memory is divided into two parts - a shared memory write module and a shared memory read module. After FFmpeg receives the preloading event, it will prefetch the video frame and put it in the shared memory. When a certain frame of the rendering engine needs a certain video frame, it will fetch this part of the buffer from the shared memory through the handle for rendering. After rendering, put the rendering result into the shared memory for the encoder to read.

da7e9d5d9a803f5b6efdda22da588d26.png

Node extension of codec module. A codec Node extension program is encapsulated here, which is provided to the main process of the rendering engine for calling. The rendering engine will create an encoding sub-process at the beginning. During the rendering process, it will also create a decoding sub-process on demand according to the preloaded results. Information is passed between processes through shared memory. The frame rate is aligned, how many frames are decoded, and the audio frame and video frame with the corresponding amount of data will be returned. The rendering engine will get the data of this frame for image rendering and audio processing.

e01fc5e1a28e47f995adab3f42251063.png

The overall composition and scheduling process of the video is shown in the figure. Due to the frame-accurate design of the rendering engine, rendering consistency can always be maintained regardless of the number of fragments. A 30-second video can be divided into three or ten pieces. No matter how many pieces are rendered from, the final rendering results are exactly the same. This provides a good underlying support for distributed rendering.

At the same time, we analyze the track data frame by frame. Only the content that really needs to be rendered will enter the rendering logic, otherwise it will be sent to encoding or transcoding. After completing all fragmentation tasks, the total fragmentation will be transcapsulated to complete the video synthesis process.

3e65d8012839cbe4c4f4875df9a74ba9.png

After the above process is completed, you can go online. How to ensure the consistency of rendering effect? By writing a set of test cases for all elements and effects, first generate the MP4 of the expected result, and compare the differences between the two videos frame by frame through the SSMI structural similarity in each subsequent iteration, and finally ensure that the synthesized video is the same as the original. . However, in many cases, the video editing effect is complicated. To ensure the rendering consistency in complex situations, the online data must be directly used as a test case set. Before each release, there will be a shadow environment for sampling and comparison, and each task is also to compare the differences of each frame through the SSMI structural similarity. Release will only be permitted if all comparisons pass. The error-prone cases will be summarized into bad cases, and then we will continue to improve to ensure the quality of the release in the iterative process. The local test case set plus the shadow environment before the release, background service release is the least burden in the entire center, and it can be released if it can be run.

-03-

Browser native codec capability helps cloud editing

6f12b1b3db6ad721b11a45099997e472.png

Chrome introduced WebCodecs from version 86. Also starting from version 86, there will always be some inexplicable bugs in rendering. Probably version 92 was fixed and stabilized. WebCodecs intends to provide efficient audio and video codec APIs in browsers. Before the emergence of WebCodecs, there were already two codec-related APIs, VideoRecorder and MSE, but they all had many limitations. The emergence of WebCodecs has given the audio and video business more room for imagination.

59481667bbb7d3208f60d47612cf58e3.png

Although the pure browser clip does not involve the server, it cannot avoid video transcoding. Because browsers have limited support for video formats, many formats cannot be played directly in the browser. Some unsupported video formats need to be transcoded before editing. I believe everyone has heard of it, or used the wasm version of FFmpeg in business. The memory limitation of Wasm makes it more tense for video editing scenes. In addition, the most critical point is the performance issue. Even if SIMD is supported, 1080p MOV video transcoding can only achieve a multiple of 0.3, and the user experience is very poor. In order to improve transcoding efficiency, ten workers can be set up in the browser, and whoever is free can use them. First obtain the original information of the video, perform distributed transcoding, and finally transcode. In order to avoid problems with the sound, the sound can be taken out directly and transferred to the package. In this mode, transcoding efficiency can be achieved at three or four times the speed, and a one-minute video can be encoded in about ten seconds.

f745ddc3e72e73cbeca653f8044446d6.png

The combination of WebCodecs and the rendering engine can greatly improve the efficiency. The 33-second original film only needs more than nine seconds to complete the entire process of decoding, rendering and encoding. In the process of doing it, I found that the API of WebCodecs is only responsible for decoding and encoding, and decapsulation and encapsulation need to be done by yourself. Both Video frame and Audio data are very lightweight objects, but hold heavy memory references. When encoding, if different widths and heights are passed in, the encoder will automatically scale. At this time, some scaling logic can be placed in the encoder, which can reduce the probability of rendering and improve performance.

02f5dd10e449d3423bd545b67c3c20f6.png

WebCodecs has excellent performance. But there is still a big gap to completely use it to replace FFmpeg on the browser side. Until the end of 2021, its audio encoding format was only Opus format, which cannot be supported by native players of MAC and WINDOWS. Although you can use the wasm version of FFmpeg to convert its audio to AEC, this will lose the joy of making videos with pure browsers. In May 22, the AAC encoding protocol was finally finalized. The support for H265 has been very high in the forum, and Chrome version 104 finally supports the hard solution of H265. At the same time, WebCodecs also supports H265 decoding. The mainstream solution for the integration of decapsulation and encapsulation is supported by the wasm version of FFmpeg. Or purely manual splicing through MP4 box. WebCodecs developers were forced to develop the decapsulator wasm.

There are also many places that can be optimized in the coding part. Configuring the size of the buffer allows developers to perform fine-grained control according to the performance of the browser.

Thanks to the construction method of our rendering engine and the layered design, the loader part can be replaced with WebCodecs very quickly. It would be very convenient for many developers if the decoding of audio and video can be returned in a frame-aligned manner like on the server side. The decoding and encoding configuration items allow developers to use WebCodecs more conveniently.

That’s all for today’s sharing, thank you all!


8815a634aab716e85c1dc4f8dbc75a9c.png

LiveVideoStackCon 2023 Shanghai lecturer recruitment

LiveVideoStackCon is everyone's stage. If you are in charge of a team or company, have years of practice in a certain field or technology, and are keen on technical exchanges, welcome to apply to be a lecturer at LiveVideoStackCon. Please submit your speech content to the email address: [email protected].

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/130612594