Explore pure front-end real-time video frame preview

This article chronicles the front end of my exploration of pure real-time video frame preview, and summarize how I use WebAssembly, the FFmpeg video processing capabilities to the Web platform. At least the article gives the answer to the following questions:

  • How to use FFmepg generate thumbnail video puzzle? What FFmpeg is included?
  • How to use WebAssembly portable C program to the browser?
  • How to parse mp4 files in your browser for which byte data of a certain frame?
  • How to send HTTP range requests?

Article outline :

  • What is a preview of the video frame?
  • Common implementations
  • Advanced implementation
  • The final realization
  • to sum up

Explore all means in front of the unknown, I hope this article will take the reader to review the process I explored together while learning all sorts of things unknown encountered. Start now.

What is a preview of the video frame?

In the video player interface some video-on-demand site, the user moves the mouse to the progress bar, a floating window will pop up and show a picture, intended to tell the user this picture is the point in time where the position of the mouse corresponding video screen . And the current implementation, the user experience is good enough, the preview appears speed is very fast, and also different time frames show a different picture, to simulate real-time preview of the effect, as shown:

This video screen preview function, I call it a preview video frame. And I want to explore is how to achieve video frame preview every step through the front-end technology, and true real-time preview. Before exploring, first look at the current common implementation.

Common implementations

By look at the major video sites, found in pop screen is generally a background image, open the background image is a link to see a video thumbnail puzzle. Open the Chrome browser DevTools the Elements panel, you can see:

The link opens a picture like this:
Can be seen in this picture is a mosaic of different video thumbnail pictures made, I would call it a video thumbnail puzzle. Well, this puzzle is how to generate? One way is to use FFmpeg.


FFmpeg is a very powerful tool for audio and video processing, its official website is so described:

Note: for a complete cross-platform recording, converting, streaming audio and video solutions.


I wrote a C application, realized how to use FFmpeg video thumbnail generate the puzzle. It receives a video file path as a parameter, the parameter obtained using the method of FFmpeg reads the video file and then through a series of steps (demultiplexing - -> frame decoder> frame transcoding ...) process is generated in the current directory a puzzle. Summed up what the program logic to perform the steps of:

  1. Initialization input: This step is mainly done some initial work, such as access to the Senate, read video files, initialize the object and allocate the necessary memory and so on;
  2. Initialization Decoder: Get fit video decoder file and open it;
  3. Reads the video frame data at a specified time interval: reading the frame data from a video file according to the parameters specified time interval;
  4. Array data by the specified number of columns: The parameters specified nest into each row contains a number of pictures, the decoded frame data are arranged;
  5. Jigsaw file generation: The sequence of bytes written to the aligned puzzle picture file.

The above is a step of generating a thumbnail video of the program logic puzzles. Because this part of the topic has nothing to do with this article, so it will not paste code. Interested students can go to GitHub - VVangChen / video-thumbnail- sprite View full source code, you can download an executable file to run locally.

Advanced implementation

The above talked about how to use the FFmpeg video thumbnail generate puzzles, next to the target of further exploration. In a typical implementation, the most important part of the puzzle is to generate the video thumbnail, then these can achieve the most important part of it in the browser? The answer is yes, and should be able to think of the recent relatively hot WebAssembly, because it was born to do this.


WebAssembly, is a being designed to run the compiler target language in the browser, the browser is intended to bring in the ability to be native applications through transplantation. If you want to know more, you can browse its official website WebAssembly or go WebAssembly | MDN learning. Then talk about, it is how to achieve the above-generation video thumbnail puzzle C program ported to Chrome.


In simple terms, a "simple transplant" just two steps:

  1. Use emconfigure and emmake configure and compile FFmpeg;
  2. Emcc C compiler using the above procedures.

Emsdk tool which provides emconfigure, emmake and emcc are Emscripten, you can very simply be C / C ++ program ported to the browser by emsdk. Emcc can use the C compiler to wasm module, while also generating a JS file, it exposes a set of tools methods, making JS can access methods C module exports, access to the memory modules C. Reference emsdk mounting the Download and the install . After installing the transplant we started our C program:

  1. First enter the pre-downloaded FFmpeg directory, run the following command to configure the compiler:
emconfigure ./configure --prefix=/usr/local/ffmpeg-web --cc="emcc" --enable-cross-compile --target-os=none --arch=x86_64 --cpu=generic \
  --disable-ffplay --disable-ffprobe --disable-asm --disable-doc --disable-devices --disable-pthreads --disable-w32threads --disable-network \
  --disable-hwaccels --disable-parsers --disable-bsfs --disable-debug --disable-protocols --disable-indevs --disable-outdevs --enable-protocol=file
复制代码
  1. Configuration is complete before running emmake make && sudo make install
  2. Into the above C program directory, run command to compile:
emcc -o web_api.html web_api.c preview.c \
-s ASSERTIONS=1 -s VERBOSE=1 \
-s ALLOW_MEMORY_GROWTH=1 -s TOTAL_MEMORY=67108864 \
-s WASM=1 \
-s EXTRA_EXPORTED_RUNTIME_METHODS='["ccall", "cwrap"]' \
`pkg-config --libs --cflags libavutil libavformat libavcodec libswscale`
复制代码

We can see the run command generates wasm and js files, so even if we complete the migration.


But after the application of a "simple transplant" it can not be run directly, because the browser can not directly operate the user's local file. So we need a little bit before the transformation of C programs, and an increase in side code for the Web application logic after transplantation something like this:

  1. Acquiring video data uploaded by the user;
  2. C video data to the module;
  3. After the C module acquires video, video thumbnails generated;
  4. Web video thumbnails will return to the program;
  5. Web clients to obtain the video thumbnail, it will be drawn out by Canvas.

Next, a simple analysis of Web applications after the transplant. Because this section is not today's topic, is not posted the complete code, interested students can go to GitHub - VVangChen / video-thumbnail- sprite View full source code and examples.


Web applications after the transplant, before the biggest difference is the way the C program acquired video data. Before C programs can be loaded directly to a local file, and now need to end the Web users to upload video cached in memory, and then by calling the methods exposed C module will pass the memory address to the C module. C module after obtaining the memory address, data is read from the video file in memory and processed later. In addition to the acquired video data different manner, C modules no longer need to generate picture file, but the aligned RGB data through the memory back to the Web terminal. Main Web look end portion to interact with the C module, the key code is as follows:
function generateSprite(data, cols = 5, interval = 10) {
  // 获取 c 模块暴露的 getSpriteImage 方法
  const getSpriteImage = Module.cwrap('getSpriteImage', 'number',
                  ['number', 'number', 'number', 'number']);
  const uint8Data = new Uint8Array(data.buffer)
  // 分配内存
  const offset = Module._malloc(uint8Data.length)
  // 将数据写入内存
  Module.HEAPU8.set(uint8Data, offset)
  // 调用 getSpriteImage,得到生成的拼图地址
  const ptr = getSpriteImage(offset, uint8Data.length, cols, interval)

  // 从内存中取出拼图的内存地址
  const spriteData = Module.HEAPU32[ptr / 4]
  ...
  ...
  ,,,
  // 获取拼图数据
  const spriteRawData = Module.HEAPU8.slice(spriteData, spriteData + size)

  // 释放内存
  Module._free(offset)
  Module._free(ptr)
  Module._free(spriteData)

  return ...
}
复制代码

Further, if the method call module Web terminal C wants, you need to use macrotick want exposed to the Web terminal in the C code, as follows:

EMSCRIPTEN_KEEPALIVE // 用来标记想要暴露给 Web 端的方法
SpriteImage *getSpriteImage(uint8_t *buffer, const int buff_size, int cols, int interval);
复制代码

This can be invoked directly getSpriteImage JS method C module, the module waits C generating a video thumbnail rear end back to the Web puzzle, and then draw it in the Canvas and display canvas. You can go to GitHub - VVangChen / video-thumbnail- sprite View full source code and examples.

The final realization

In the previous section, we completed independently to achieve a complete video frame in the browser preview feature. So only one step away from the target exploration, is truly generate real-time video preview. Beginning said before, there are two conditions are true real-time, generate go when one is not ready to advance the picture, but in the mouse over the progress bar; the second is the preview for each point of time is different, that show the picture must be that one second of video. A condition wherein only the first time delay, the re-strike action of the jigsaw generated whenever the mouse moved on the progress bar on the line; the second condition, as long as the shortened sampling frequency thumbnail puzzle 1 second to 1 on the line . Existing programs are implemented based on puzzles, but in fact, now the demand is not need to pre-generate thumbnails of all pictures, only you need to generate that second on the line. Taking into account all been able to generate a thumbnail picture, then only a generation is certainly achievable. In addition, since now only need to generate a thumbnail, not puzzles all the video screen, then this is not the only need to get a thumbnail image data on the line? The answer is yes. So how to get a video thumbnail data point in time, it is to explore the key to success. First look at the program and ultimately, the execution logic is how:

  1. Get the mouse pointer frame data corresponding to the selected point of time of the video picture;
  2. Frame data to the module C;
  3. FFmpeg C decoded frame data module and converted into RGB data;
  4. The generated RGB data back to the Web terminal;
  5. RGB data plotted on Canvas canvas.

Which 2--5 same as the one implemented method, it will not go to see the full source code, please visit: github.com/VVangChen/v... . The remaining contents, mainly about how the first step, the frame data acquisition mouse pointer to the selected point of time. It can be disassembled into two steps:

  1. Since the frame picture data of the video part of the video file, the video file that should be a contiguous sequence of bytes of data, so that in a first step, it is necessary to calculate the offset of the frame data in the video file, and a frame data length
  2. The second step, needs to initiate a request to acquire a video file [offset, frame data length + offset] data range

Here, considering only video files more popular mp4 format. In the first step can be converted into: How mp4 format video file, the computed offset and the size of a point in time corresponding to the frame data of? This involves parsing mp4 file structure. mp4 file is referred to as a continuous 'box' of structural units, each 'box' composed of a header and data, comprising at least the size and type header, data may be a 'box' own data, may be one or more 'box'. Different 'box' have different functions, to calculate the offset of the frame data, the following main need to use 'box':

  • moov: Save the video codec required data
  • mdhd: save the video-related metadata
  • stts: Time for sample query representation
  • stss: the index for all key frames query file
  • stsc: a query belongs sample block and sample index in the index block
  • stco: used to query offset chunk of sample location
  • stsz: used to query sample size
  • mdat: storing the audio and video data streams

These 'box', according to a certain algorithm can be obtained offset and a size of the frame data:

  1. After first need to get mp4 file in the root structure, location moov may be at the beginning or end of the file, know its location you can get the data moov;
  2. Analytical moov, access to all the above-mentioned box and analytical data;
  3. Acquiring the frame represents a time in the code stream;
  4. Byte frame index in time sequence by calculating;
  5. Block access to the index sequence of bytes in the block and its index by indexing the frame belongs;
  6. Frame belongs is calculated offset byte block sequence;
  7. Through its index block, which calculates the offset in the block;
  8. It is obtained by the offset and frame offset in the block byte sequence belongs to the block.

We realized how to calculate the length of the frame data in the mp4 file offset, and frame data. Subsequently the second step, obtain video file [offset, frame data length + offset] data range. It can be converted to the following questions: how to get a portion of the data URL resources? It can be achieved by a range of HTTP requests. If the resource server support, only need to specify the scope of a Range request header, its value is want to get the resource data in the HTTP request, look at an example:

function fetchRangeData(url, offset, size) {
  return new Promise((resolve) => {
    const xhr = new XMLHttpRequest()
    xhr.onload = (e) => {
      resolve(xhr.response)
    }
    xhr.open('GET', url)
    // 设置 Range 请求头
    xhr.setRequestHeader('Range', `bytes=${offset}-${offset + size - 1}`)
    xhr.responseType = 'arraybuffer'
    xhr.send()
  })
}
复制代码

FetchRangeData by calling the function, passing the resource URL, byte offset and size in bytes you want to request, you can get a sequence of bytes you want.


So far, it has been realized that it takes a time-frame of video data, but that does not mean that users will be able to generate a preview want. Even if it can be found from the acquired partial frame data size, they are very small, some only a few dozen bytes, obviously not describe a picture. If these data frames directly to FFmpeg, it can not be successfully decoded. This is why? This is because in H.264 coding, the frame is divided into three types: 1. I-frame: independently decodable frame, also known as a key frame (Intra frame), which represents the decoded independent of other frames 2. P-frame: Forward predictive frame, decoding it indicates the need to refer to a sequence of frames of the frame 3. B: bidirectional predictive frame, decoding it indicates the need to refer to a sequence of frames and a rear apparent, P and B frames with respect to I frame, will be much smaller. This is why some frames require only a few dozen bytes. Can be seen from the above description of the frame type, the frame and the relationship between a dependent frame (reference), if not an I frame, it can not be independently decoded in decoding. Non-I-frame to be decoded, it is necessary to obtain all frames it references. In the H.264 coded bit stream, the sequence of frames arranged in frames are reference relationship, reference relationship also determines the order of the decoded frames, since the decoding order of reference frames must be in reference to the preceding frame. Since only the I-frame can be decoded independently, so that in a certain set of reference relation is at the top. If you want the non-decoded I frame, only the selected frame to obtain all frames between which it is in front of a reference set I frame. Usually the reference group sequence can be decoded independently is referred to as a group of frames (GOP), which is usually a period of a sequence of frames between two I-frames. As shown in the example in FIG:

See the code sequence of frames acquired visit: github.com/VVangChen/v... After obtaining the data frame pointer of the selected point of time the mouse after it passed to module C, generates RGB data back to the Web terminal, and then the canvas Canvas and drawing on display, the user can see the selected point in time the video screen. At this point, I realized the video frame preview of pure front-end technology to achieve real-time.

to sum up

Grateful to have the patience to read readers. Sure some people will ask, meaning do this thing where? I can answer that since it is explored, certainly should be in front of the unknown, end of the road who do not know what's come before, not to mention the pace of exploration did not stop. Currently the program implementation are still many problems, such as:

  • Each preview generated, all data needs to re-acquire the frame;
  • Frame data acquired is only used for the preview function, will retrieve the data browser to play the video;
  • The wasm compiled file size is too large;
  • Do not use multiple threads to prevent blockage of the main process;
  • A memory leak;

The next will be to address these issues and continue to explore how it applies to the production environment, making it worth more practical use. So the pace of exploration did not stop, so stay tuned, encourage each other.

If there are errors or place the article to be open to question, please point out or discuss, thanks!

Reproduced in: https: //juejin.im/post/5cf3f2d7f265da1bb7765273

Guess you like

Origin blog.csdn.net/weixin_34281477/article/details/91452752