ffmpeg tutorial notes (C++ ffmpeg library application development) command line usage - Chapter 3: FFmpeg conversion encapsulation - audio and video conversion to MP4, FLV - MP4 format standard, MP4 box list

FFmpeg from beginner to proficient

Article directory

Chapter 3 FFmpeg transfer encapsulation

Chapter 2 introduced the functions of FFmpeg, which are divided into media format encapsulation, audio and video encoding and decoding, transmission protocol conversion, and support for Filter processing, etc. This chapter will focus on how to use FFmpeg to re-encapsulate media formats. We have already introduced the diversity and comprehensiveness of the media encapsulation formats supported by FFmpeg. This chapter will not introduce them one by one. Instead, it will focus on common media encapsulation formats in detail. introduction.

The main contents introduced in this chapter are as follows.

  • Section 3.1 introduces the MP4 format standard and the corresponding format parsing method, how to obtain the data required for MP4 format file parsing, and briefly introduces MP4 visual analysis tools, how to use FFmpeg to encapsulate MP4 files, etc.
  • Section 3.2 introduces the format standard of FLV and the corresponding format parsing method, how to obtain the data required for parsing FLV format files, and briefly introduces FLV visual analysis tools, how to use FFmpeg to encapsulate FLV files, etc.
  • Section 3.3 introduces the M3U8 format standard and the method of using FFmpeg to encapsulate M3U8.
  • Sections 3.4 and 3.5 introduce the basic operations of FFmpeg slicing. You can use segment encapsulation operations or FFmpeg public parameter operations.
  • Section 3.6 mainly analyzes the resource usage of trans-encapsulation.

3.1 Convert audio and video files to MP4 format

Among the common formats on the Internet, the best cross-platform should be MP4 files, because MP4 files can be played not only in Flashplayer on PC platforms, but also on mobile platforms such as Android and iOS, and use the system default The player can play it, so we say that the MP4 format is the most common multimedia file format.
This chapter first focuses on the basic format of MP4 encapsulation.

3.1.1 Introduction to MP4 format standard

basic component

The MP4 format standards are ISO-14496 Part 12 and ISO-14496 Part 14. The content of the standard is not particularly extensive. Here we will focus on introducing some important information.
If you want to understand the format information of MP4, you must first understand several concepts, as follows.

- MP4 files are composed of many Boxes (also called atoms) and FullBoxes
- Each Box consists of two parts: Header and Data
- FullBox is an extension of Box. Based on the Box structure, it adds an 8-bit version flag and a 24-bit flags flag to the Header.
- Header contains the size (size) and type (type) of the entire Box length

When size is equal to 0, it means that this Box is the last Box of the file. When size is equal to 1, it means that the Box length requires more bits to describe. A 64-bit largesize will be defined later to describe the Box length. When Type is uuid, it means that the data in this Box is a user-defined extended type

- Data is the actual data of the Box, which can be pure data or more sub-Boxes
- When the Data in a Box is a series of sub-Boxes, the Box can also be called a Container Box.
box list (the type of Box consists of four English letters, as shown in the table)

The composition of Boxes in MP4 files can be arranged using the list shown in Table 3-1. Boxes marked with "√" in Table 3-1 are necessary Boxes, otherwise they are optional Boxes.

Insert image description here
Insert image description here
Insert image description here

In MP4 files, the structure of Box is generally not much different from that described in Table 3-1. Of course, because of the storage of moov (metadata information of audio and video data) and mdat (media data container) described in the MP4 standard There is no mandatory requirement for the position before and after, so sometimes the moov Box is behind mdat, and sometimes moov is stored in front of mdat. In Internet video on demand, if you want MP4 files to be opened quickly, you need to store moov in front of mdat; if you put it at the back, you need to download the MP4 file before it can be played.

Example of parsing MP4 multimedia files

Some key information is required when parsing MP4 multimedia files. The main information is introduced below.

Parse moov container (Movie Box)

As introduced in Table 3-1, the moov container defines the data information in an MP4 file. The type is moov, which is a container Atom. It must contain at least one of the following three types of Atom:

  • mvhd tag, Movie Header Atom, a header container that stores uncompressed movie information
  • cmov tag, Compressed Movie Atom, compressed movie information container,This container is not commonly used
  • rmra tag, Reference Movie Atom, reference movie information container,This container is not commonly used

Can also contain other container information, such as movie clip information Clipping atom (clip), one or several trakAtom (trak), a Color Table Atom (ctab) and a User Data Atom (udta).

Among them, mvhd defines the time scale, duration and display characteristics of multimedia files. Trak defines the information of a track in a multimedia file. A track is a media unit in a multimedia file that can be operated independently. For example, an audio stream is a track, and a video stream is a track.

Use a binary viewing tool to open an MP4 file to view its contents. You can learn the MP4 file container information mentioned earlier:

Insert image description here

For how to read this moov container, please refer to Table 3-2.

Insert image description here

By parsing the byte length of the moov container, we can see that the container contains a total of 0x000022bb (8891) bytes, and the container type is moov;

(The following is my test and has nothing to do with the text)

Insert image description here

Insert image description here

我用xxd命令打开一个MP4文件,有一行是这个,000022bb6d6f6f760000006c6d766864,解析器是怎么知道moov 容器的大小是8891字节的?
This question involves the format and structure of MP4 files. MP4 files are composed of multiple "boxes" or "atoms". Each box has a header that contains the length and type. We can use the xxd command to convert a binary file into a hexadecimal representation and then analyze its contents.

In the example you gave, "000022bb6d6f6f760000006c6d766864" is the hexadecimal encoding. This string can be divided into several parts:

  1. "000022bb": This is the size of the box. Here, it is a 32-bit integer, expressed in hexadecimal. Convert this value from hexadecimal to decimal to get 8891. This is
    the size of the moov box, in bytes.

  2. "6d6f6f76": This is the type of box. If we convert each hex value to ASCII characters, we get 'moov', which is a common MP4 box type.

  3. "0000006c6d766864": This part is the content inside the 'moov' box. What it represents needs to be parsed according to the internal structure of the 'moov' box.

Therefore, the parser can know the size and type of the box by reading the header information of each box.

解析器怎么知道容器的大小是从00 0022bb获取?而不是从其他地方

The parser knows that the size of the container is obtained from "000022bb" because it follows the MP4 file format. In the MP4 file format, each container (also called a "box" or "atom") starts with a header that contains two pieces of information:

  1. A four-byte (32-bit) length field, which represents the size of the entire container (including headers and data).
  2. A four-byte type field indicating the type of container, such as 'moov', 'trak', 'mdia', etc.

So, when the parser reads an MP4 file, it will first read the first four bytes (i.e. 8 hexadecimal digits) and convert them to decimal to get the length of the container. It then reads the next four bytes to determine the container type. In the example you gave, "000022bb" is the length of the moov container, and "6d6f6f76" represents the type of container (i.e. 'moov').

(Test ends)

Parse moov-mvhd container (Media Header Box)

Then continue to parse down in this moov container. The size of the next container is 0x0000006c (108) bytes and the type is mvhd; then continue to parse down in the moov container:

Insert image description here

Parse moov-trak container (Track Header Box)

上一个容器的地址是00000028(0000006c的地址),可以定位到下一个容器的地址为00000028 + 0000006c = 00000094

Insert image description here

After analyzing mvhd, you can see from the above output that the container in the next moov is the trak label. The size of this trak container is 0x000011de (4574) bytes, and the type is trak.

Parse next moov-trak container

After parsing the trak, enter the moov container to parse the next trak. The parsing method of the next trak is the same as the parsing method of this trak. You can see that the size of the trak in the following file content is 0x00001007 (4103) bytes:

地址:00000094 + 000011de = 00001272

Insert image description here

Parse moov-udta container

After parsing the trak of this audio, you can see that there is another sub-container in the moov container, which is the udta container. The parsing method of this udta container is basically the same as the previous parsing of trak, which can be seen from the file data below As of now, the udta size is 0x00000062 (98) bytes:

地址:00001272 + 00001007 = 00002279

Insert image description here

According to the information described earlier, it can be known that the total size obtained after describing the size of udta+video trak+audio trak+mvhd+moov is exactly 8891 bytes, which is equal to the size of moov obtained previously.

The previous section describes the parsing of sub-containers under the moov container. Next, we will continue to parse the sub-containers in the moov sub-container.

Parse the actual content of moov-mvhd

Insert image description here

As can be seen from the file content, the size of the mvhd container is 0x0000006c bytes. The parsing method of mvhd is shown in Table 3-3.

Insert image description here

The information corresponding to the content of mvhd obtained by parsing the file data according to the method shown in Table 3-3 is shown in Table 3-4.

Insert image description here
Insert image description here

After parsing mvhd, you can see that the next track ID is 0x00000003, and then start parsing trak. When parsing trak, it also contains multiple sub-containers.

Insert image description here

Parse the actual content of moov-trak

The trak container defines the information of a track in the media file. A media file can contain multiple traks. Each trak is independent and has its own time and space occupation information. Each trak container has information associated with it. media container description information. The main purpose of trak container is as follows.

  • Contains references and descriptions of media data (media track)
  • Contains modifier track information
  • Packaging information of the streaming media protocol (hint track), hint track can quote or copy the corresponding media sampling data

Hint track and modifier track must ensure completeness and must exist together with at least one media track.
A trak container must have a Track Header Atom (tkhd) and a Media Atom (mdia). Other Atoms are optional, such as the following atom options.

  • Track clipping container: Track Clipping Atom(clip)
  • Track artboard container: Track Matte Atom(matt)
  • Edit container: Edit Atom(edts)
  • Track reference containerTrack Reference Atom(tret)
  • Track configuration loading container: Track Load Settings Atom(load)
  • Track output map container: Track Input Map Atom(imap)
  • User data container: User Data Atom(udta)

The parsing method is shown in Table 3-5.
Insert image description here
Refer to the occupancy situation in Table 3-5, and then open the MP4 file to view the binary data in the file, as follows:
Insert image description here
地址:0000009c + 0000005c = 000000f8
地址:000000f8 + 00000030 = 00000128

Insert image description here

As can be seen from the data content of the file, the size of this trak is 0x000011de (4574) bytes, the size of the following subcontainer is 0x0000005c (92) bytes, and the type of this subcontainer is tkhd; after skipping 92 bytes , the size of the trak sub-container read next is 0x00000030 (48) bytes, and the type of this sub-container is edts; after skipping 48 bytes, the size of the trak sub-container read next is 0x0000114a (4426) bytes section, the type of this sub-container is mdia; through analysis, we can get that the total size of trak+tkhd+edts+mdia sub-containers is exactly 4574 bytes, and trak has finished reading.

Parsing moov-trak-tkhd (video)

Please refer to Table 3-6 for how to parse the tkhd container.

Insert image description here
Insert image description here

Let's look at the content of a tkhd in detail, and then make a corresponding information based on the content in Table 3-6. The corresponding value of this tkhd is shown in Table 3-7.

Insert image description here

Parse moov-trak-tkhd (audio)

Table 3-7 analyzes the tkhd of the video trak container. Let’s analyze the tkhd of an audio:

地址:00000094 + 000011de = 00001272

Insert image description here

The method of parsing trak has been mentioned before. Now we focus on parsing the audio tkhd and expressing the data in the form of a table. See Table 3-8 for details.

Insert image description here

As can be seen from the above two examples, the tkhd size of audio and video traks is the same, and the content inside will vary depending on the type of audio and video trak. At this point, the analysis of trak's tkhd is completed.

Parsing moov-trak-mdia

After parsing tkhd, you can analyze the sub-containers of the trak container. The Media Atom type is mdia, which must contain the following container.

  • A media header: Media Header Atom (mdhd)
  • A handle reference: Handler Reference (hdlr)
  • A media information: Media lnfomation (minf) and User Data Atom (udta)

The parsing method of this container is shown in Table 3-9.

Insert image description here

Let’s first refer to the data of MP4 files:

Insert image description here

edts地址:0000009c + 0000005c = 000000f8
mdia地址:000000f8 + 00000030 = 00000128
mdhd地址:00000130
hdlr地址:00000130 + 00000020 = 00000150
minf地址:00000150 + 0000002d = 0000017d
vmhd地址:00000185

From the content of the file, we can see that the size of this mdia container is 0x0000114a (4426) bytes. The mdia container contains three sub-containers, namely mdhd, hdlr, and minf. The size of mdhd is 0x00000020 (32) bytes; hdlr size is 0x0000002d (45) bytes; minf size is 0x000010f5 (4341) bytes; mdia container information + mdhd + hdlr + minf container is exactly 4426 bytes; now the mdia container is parsed.

Parsing moov-trak-mdia-mdhd

The mdhd container is included in each track and describes the Media Header. The information it contains is shown in Table 3-10.

Insert image description here

According to the description in the IS014496-Partl2 standard, when the version field is 0, the parsing is slightly different from when the current version field is 1. The common parsing methods are introduced here.

The corresponding data is parsed according to the parsing method of the table:

Insert image description here

The corresponding data can be parsed one by one from the content of the opened file. See Table 3-11 for details.

Insert image description here
Insert image description here
From Table 3-11, we can see that the size of this Media Header is 32 bytes, the type is mdhd, the version is 0, the generation time and media modification time are both 0, the calculation unit time is 25 000, and the media timestamp length is 250 000. The language encoding is 0x55C4 (the specific language represented can refer to the standard ISO 639-2/T). At this point, the mdhd tag parsing is completed.

Insert image description here

Parsing moov-trak-mdia-hdlr

The hdlr container describes the playback process of the media stream. The content contained in the container is shown in Table 3-12:

Insert image description here
According to the reading method in Table 3-12, read the content data in the sample file. The data is as follows:

Insert image description here

According to the information seen in the file content, the content can be read out, and the corresponding values ​​are shown in Table 3-13.

Insert image description here

It can be seen from the corresponding values ​​parsed in Table 3-13 that this is the data corresponding to the track of a video. The name of the corresponding component is VideoHandler and ends with 0x00. The hdlr container has completed parsing.

Parsing moov-trak-mdia-minf

The minf container contains many important sub-containers, such as audio and video sampling and other information-related containers. The information in the minf container will exist as a mapping of audio and video data, and its content information is as follows.

  • Video information header: Video Media Information Header (vmhd subcontainer)
  • Audio information header: Sound Media Information Header (smhd subcontainer)
  • Data Information Data Information (dinf subcontainer)
  • Sample Table: Sample Table (stbl subcontainer)

The method of parsing minf has been introduced before. The following is a detailed introduction to the method of parsing vmhd, smhd, dinf and stbl containers.

Parsing moov-trak-mdia-minf-vmhd

The format of the vmhd container content is shown in Table 3-14.

Insert image description here
Read and parse the contents of the container according to this table. The data is as follows:

Insert image description here

The data is parsed according to the content in the file, and the corresponding values ​​are shown in Table 3-15.

Insert image description here

Parse moov-trak-mdia-minf-smhd

Table 3-15 shows the analysis of the video Header. Let's take a look at the analysis of the audio Header.

The format of the smhd container is shown in Table 3-16.

Insert image description here
According to Table 3-16, parse the data corresponding to the audio in the file. The parsed data is as follows:

Insert image description here
After parsing the data according to the file content, the corresponding values ​​are shown in Table 3-17.

Insert image description here

Parse moov-trak-mdia-minf-dinf

The dinf container is a container used to describe data information. It defines the information of audio and video data. This is a container that contains the sub-container dref. The following is an example of parsing dinf and its sub-container dref. The parsing method of dref is shown in Table 3-18.

Insert image description here

解析 moov-trak-mdia-minf-stbl

The stbl container is also called the container of the sampling parameter list (Sample Table Atom). This container contains information about converting the media time to the actual sample. It also explains the information about interpreting the sample, for example, whether the video data needs to be decompressed and what the decompression algorithm is. and other information. The sub-containers it contains are detailed as follows.

  • Sample description container: Sample Description Atom (stsd)
  • Sample time container: Time To Sample Atom (stts)
  • Sample synchronization container: Sync Sample Atom (stss)
  • Chunk sampling container: Sample To Chunk Atom (stsc)
  • Sample size container: Sample Size Atom (stsz)
  • Chunk offset container: Chunk Offset Atom (stco)
  • Shadow sync container: Shadow Sync Atom (stsh)

stbl contains all the time and data index of the media sample in the track. Using the sample information in this container, you can locate the sample's media time, determine its type and size, and how to find adjacent samples in other containers. If the track where the Sample Table Atom is located does not reference any data, then it is not a useful media track and does not need to contain any sub-Atoms.

If the track where the Sample Table Atom is located references data, it must contain the following sub-atoms.

  • Sample description container
  • Sample size container
  • Chunk sampling container
  • Chunk offset container

All subtables have the same number of samples.

stbl is an essential Atom and must contain at least one entry because it contains the directory information of the data reference Atom to retrieve the media sample. Without the sample description, it is impossible to calculate where the media sample is stored. Sync Sample Atom is optional, if not, it means that all samples are sync samples.

Parsing moov-trak-edts

The edts container defines a part of the media used to create a track in the Movie media file. All edts data is in a table, including the time offset and length of each part. If there is no such table, the track will start playing immediately, a The empty edts data is used to locate the start time offset
position of the track, as shown in Table 3-19.

Insert image description here

The edts data in Trak is as follows:

Insert image description here

The size of this Edts Atom is 0x00000030 (48) bytes, and the type is edts; it contains the elst sub-container, the size of the elst sub-container is 0x00000028 (40) bytes, the size of the edts container + elst sub-container is 48 bytes, so far , the edts container has been parsed.


So far, the format parsing standards of MP4 files have been introduced. According to the above parsing methods, readers will parse MP4 files according to the corresponding parsing methods, and then read the audio and video data and corresponding media information in MP4. Since using binary viewing tools to parse MP4 files requires parsing byte by byte, which is time-consuming and energy-consuming, we can use analysis tools to assist in parsing. Next, we will introduce commonly used viewing tools for MP4 files and support for FFmpegMP4 files.

3.1.2 MP4 analysis tools

There are many tools that can be used to analyze the MP4 encapsulation format. In addition to FFmpeg, there are also some commonly used tools, such as Elecard StreamEye, mp4box, mp4info, etc. Here is a brief introduction to these common tools.

1. Elecard StreamEye

Elecard StreamEye is a very powerful video information viewing tool that can view frame arrangement information, display I frames, P frames, and B frames as columns of different colors, and the length of the columns will be displayed according to the size of the frame; it can also be displayed through Elecard StreamEye analyzes the encapsulated content information of MP4, including stream information, macroblock information, file header information, image information, file information, etc.; it can also be viewed frame by frame according to the order of each frame, and you can see
each The detailed information and status of a frame, the content information of MP4 viewed by Elecard StreamEye is shown in Figure 3-1.

Insert image description here

2. mp4box

Mp4box is a component in the GPAC project. You can use mp4box to synthesize and disassemble media files. The operation information is roughly as follows:

Insert image description here

Insert image description here
As you can see from the above help information, mp4box also has many sub-help items, such as DASH slicing, encoding, metadata, BIFS stream, ISMA, SWF related help information, etc. Let's use mp4box to analyze the information of output.mp4. The content is as follows:

Insert image description here

It can be seen from the output content that the corresponding parsing information, such as Timescale, Duration, etc., is the same as the data obtained from the parsed MP4 file seen in the MP4 Principle section introduced earlier.

3. mp4info

mp4info is also a good MP4 analysis tool, and it is a visual tool (see Figure 3-2). It can parse out each Box in the MP4 file and display the data. Using mp4info when analyzing the content of the MP4 file will more convenient.

Insert image description here

As shown in Figure 3-2, MP4 file containers can be parsed through Mp4info, and the parsed Atom format can be displayed directly. The related Atom parsing information is relatively more convenient and easier to use than the previous byte-by-byte reading and parsing.

3.1.3 MP4 Demuxer (decapsulation) in FFmpeg

According to the method of viewing the Demuxer of FFmpeg's MP4 file introduced earlier, use the command line to ffmpeg -h demuxer=mp4view the Demuxer information of the MP4 file:

Insert image description here
As shown in the output content, by viewing the help information of Fmpeg, you can see that the Demuxer of MP4 is the same as the Demuxer of mov, 3gp, m4a, 3g2, and mj2. The parameters for parsing Mp4 files are shown in Table 3-20.

Insert image description here

When parsing MP4 files, you can also use the parameter ignore_editlist to ignore EditList Atom when parsing MP4 files; for MP4 Demuxer operations, you can usually use the default configuration, and there will not be too many explanations and examples here.

3.1.4 MP4 Muxer (encapsulation) in FFmpeg

As mentioned in Section 3.1.3, MP4 has the same Demuxer as mov, 3gp, m4a, 3g2, and mj2, and their Muxers are not much different, but they are different Muxers, although the same format is used in ffmpeg. Encapsulation and decapsulation. MP4 encapsulation is slightly more complicated than decapsulation because there are more options when encapsulating. You can learn about the relevant parameters through Table 3-21.

Insert image description here
Insert image description here

As can be seen from the parameter list, the MP4 muxer supports relatively complex parameters, such as slicing at video key frames, setting the maximum moov container size, and setting encryption. Below are examples of common parameters.

1. faststart parameter use case

Under normal circumstances, ffmpeg generates moov after mdat is written. You can move the moov container to the front of mdat through the parameter faststart. Here is an example:

./ffmpeg -i input.flv -c copy -f mp4 output.mp4

Then use mp4info to view the order in which the containers of output.mp4 appear, as shown in Figure 3-3.

Insert image description here

As you can see from Figure 3-3, the moov container is below mdat. If you use the parameter faststart, moov will be moved to the front of mdat after generating the above structure:

./ffmpeg -i input.flv -c copy -f mp4 -movflags faststart output.mp4 

Then use mp4info to view the container order of MP4. You can see that moov has been moved to the front of mdat, as shown in Figure 3-4.

Insert image description here

2. dash parameter use case

When using the DASH format, a special MP4 format used in it can be generated through the dash parameter:

./ffmpeg -i input.flv -c copy -f mp4 -movflags dash output.mp4

Use mp4info to view the format information of the container, which is slightly special. The specific information has been introduced in detail before, as shown in Figure 3-5.

As can be seen from Figure 3-5, the container information stored in this DASH format MP4 file is somewhat different from the conventional MP4 format. There are mainly three types of containers: sidx, moof and mdat.

Insert image description here

3. Use cases of isml parameters

ISMV is a streaming media format released by Microsoft. The ISML live stream can be published through the isml parameter. ISMV can be pushed to the IIS server and published through the isml parameter:

./ffmpeg -re -i input.mp4 -c copy -movflags isml+frag_keyframe -f ismv Stream 

Observe the format of the stream, which is roughly as follows:

Insert image description here
Insert image description here

The principle of the generated file format is similar to HLS, using XML format for indexing. The index content mainly contains key information of the audio stream, such as video width, height, bit rate and other key information, and then refreshes the sliced ​​content for live broadcast.

3.2 Convert video files to FLV (omitted) (FLV needs to rely on the Flash Player plug-in to play, and Adobe has stopped supporting flash. I feel that in the future, it will develop towards MP4 and webm supported by HTML5, so I will not read flv now because I have no time)

FLV is also a common format in live broadcast and on-demand scenarios on the Internet. FLV is an encapsulation format released by Adobe that can be used for live broadcast or on-demand. Its encapsulation format is very simple and exists in the form of FLVTAG, and each A TAG exists independently. Next, let’s introduce the FLV standard in detail.

Guess you like

Origin blog.csdn.net/Dontla/article/details/135188338