Video Encapsulation Format: Detailed Explanation of MP4 Format

1. Overview of MP4 format

1.1 Introduction

  MP4 or MPEG-4 Part 14 (MPEG-4 Part 14) is a standard digital multimedia container format. The extension is .mp4. While the only extension defined by the official standard is .mp4, third parties often use various extensions to indicate the content of the file:

  • MPEG-4 files with both audio and video usually use the standard extension .mp4;
  • Audio-only MPEG-4 files use the .m4a extension.

  Most of the data can be embedded into MP4 files through dedicated data streams, so MP4 files contain a separate track for storing stream information. Currently widely supported codecs or stream formats are:

  • Video format: H.264/AVC, H.265/HEVC, VP8/9, etc.
  • Audio format: AAC, MP3, Opus, etc.

The benefits of this article, free C++ audio and video learning materials package, technical video/code, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull stream, srs)↓↓↓ ↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

1.2 Terminology

  In order to understand this file format in a more standardized manner later, it is necessary to understand the following concepts and terms. These concepts and terms are the key to understanding the MP4 media packaging format and its operating algorithm.

(1) The concept of Box
  originated from the atom in QuickTime, that is to say, the MP4 file is composed of Boxes, which can be understood as a data block, which is composed of Header+Data. Data can store media metadata and actual data. audio and video stream data. Boxes can directly store data blocks but can also contain other types of Boxes. We call this kind of Boxes container boxes.

(2) Sample
  is simply understood as sampling. For video, it can be understood as one frame of data. One frame of audio data is audio data for a fixed period of time. It can be composed of multiple Sample data. In short: the unit of storing media data is sample .

(3) Track
  represents a collection of samples. For media data, it is a video sequence or audio sequence. The audio track and video track we often say can be compared to this concept. Of course, in addition to Video Track and Audio Track, there can also be non-media data, such as Hint Track. This type of Track does not contain media data, but can contain some instruction information or subtitle information for packaging other data into media data. To put it simply: Track is a media unit in a movie that can be operated independently.

(4) Chunk
  The unit consisting of several consecutive samples of a track is called a chunk. Each chunk has an offset in the file. The entire offset is calculated from the file header. In this chunk, the samples are stored continuously. of.
  In this way, it can be understood that there are multiple Tracks in the MP4 file, and a Track is composed of multiple Chunks. Each Chunk contains a set of continuous Samples. It is precisely because of the definition of the above concepts that the encapsulation format of MP4 It is easy to achieve flexible, efficient, and open features, so you must understand it carefully.

2. The overall structure of MP4

2.1 Overview of MP4 structure

  The MP4 format is a box format, and the box container covers the box sub-container, and the box sub-container covers the box sub-container.

A box consists of two parts: box header and box body.

  • box header: metadata of the box, such as box type, box size.
  • box body: the data part of the box, the actual stored content is related to the box type, such as the media data stored in the body part of mdat.
      In the box header, only type and size are mandatory fields. When size==1, there is a largesize field. If size==0, it means the box is the last box of the file. In some boxes, there are also version and flags fields, and such boxes are called Full Boxes. When other boxes are nested in the box body, such a box is called a container box.
    The mp4box icon is as follows:

  • in:
  • ftyp(file type box): file header, record some compatibility information
  • moov(movie box): record media information
  • mdat(media data): media load

Complete Box structure:

The data content carried by each Box is as follows:

2.2 Box structure

  The mp4 packaging format uses a structure called a box to organize data. The structure is as follows:

  +-+-+-+-+-+-+-+-+-+-+
  |  header  |  body  |
  +-+-+-+-+-+-+-+-+-+-+

All other boxes syntactically inherit from this basic box structure.

2.2.1 box header

  Box is divided into normal box and fullbox.
(1) The general box header structure is as follows:

field

type

describe

size

4 Bytes

The size of the entire box including the box header

type

4 Bytes

4 ascii values, if it is "uuid", it means that this box is user-defined and can be ignored

large size

8 Bytes

A field only available when size=1 is used for extension, for example, mdat box will require this field

(2) fullbox adds 2 new fields on the basis of the above:

field

type

describe

version

1 Bytes

version number

flags

3 Bytes

logo

2.2.2 box body

  A box may contain other multiple boxes, this kind of box is called container box. Therefore, the box body may be a specific box type, or it may be another box.
  Although there are many types of boxes, there are about 70 kinds, but not all of them are necessary. Generally, MP4 files contain necessary boxes and some non-essential boxes. I used the tool MP4info to analyze an MP4 file. , the specific Box is displayed as follows:

  Through the results analyzed by the above tools, we can probably summarize the following characteristics of MP4:

  1. MP4 files are composed of Boxes, and Boxes can also be nested with each other, so there is no redundant data in a compact arrangement;
  2. There are not many Box types, mainly composed of necessary ftyp, moov, mdat, and free, udta non-essential box components, that is, removing these two boxes will have no effect on playing audio and video.
  3. Moov generally stores media metadata, which is relatively complex and deeply nested. The meaning and composition of the fields of each box will be explained in detail later.

2.3 ftyp(File Type Box)

  ftyp generally appears at the beginning of the file to indicate the standard specification used by the mp4 file:

field

type

describe

major_brand

4 bytes

major version

minor_version

4 bytes

minor version

compatible_brands[]

4 bytes

Specify a compatible version. Note that this field is a list, which can contain multiple 4 bytes version numbers

An example is as follows:

2.4 moov(Movie Box)

  1. moov is a container box, there is only one file, and all the boxes it contains are used to describe the media information (metadata).
  2. The location of moov can appear immediately after ftyp, or it can appear at the end of the file.
  3. Since it is a container box, except for the box header, its box body is other boxes.

Child Box:

  • mvhd (moov header): Used to briefly describe some information shared by all media.
  • trak: track, track, used to describe audio stream or video stream information, there can be multiple tracks, as above, it appears 2 times, representing one audio stream and one video stream respectively.
  • udta(user data): user-defined, can be ignored.

An example is as follows:
(1) Structure

(2) data

(3) Ingredients

Sub-Box: mvhd
  is used to briefly describe some information shared by all media.

Sub-Box: trak
  track, track, used to describe audio stream or video stream information, there can be multiple tracks, as above, it appears 2 times, representing one audio stream and one video stream respectively.

2.5 mvhd(Movie Header Box)

  mvhd appears as a header of media information (note that this header is not a box header, but a header of moov media information), and is used to describe some basic information shared by all media.
  The mvhd syntax is inherited from fullbox. Note that the version and flags fields in the following examples belong to the fullbox header.
Box Body:


2.6 trak(track)

  1. The trak box is a container box, and its child boxes contain the media information of the track.
  2. An mp4 file can contain multiple tracks, and the tracks are independent, and the trak box is used to describe each media stream.
  3. Generally, there are two traks, corresponding to audio stream and video stream respectively.

An example is as follows:


in:

  • tkhd (track header box): used to briefly describe the information of the media stream, such as duration, width, etc.
  • mdia (media box): information used to describe the media stream in detail
  • edts(edit Box): The child Box is elst (Edit List Box), its function is to offset the timestamp of a certain track.

2.7 tkhd(track header box)

  1. tkhd appears as a header of media information (note that this header is not a box header, but a header of track media information), and is used to describe some basic information of the track.
  2. The tkhd syntax is inherited from fullbox. Note that the version and flags fields in the following examples belong to the fullbox header.
    Box Body:


2.8 edts (edit box)

  There is an elst (Edit List Box) under it, and its function is to offset the timestamp of a certain track. Take a look at some fields:

  • segment_duration: Indicates the duration of the edit segment, in units of timescale in the Movie Header Box (mvhd), that is, segment_duration/timescale = actual duration (unit s)
  • media_time: Indicates the start time of the edit segment, in units of timescale in the Media Header Box (mdhd) in the track. If the value is -1 (FFFFFF), it means an empty edit, and the last edit in a track cannot be empty.
  • media_rate: If the rate of the edit section is 0, the edit section is equivalent to a "dwell", that is, the screen stops. The screen will stop for segment_duration time at media_time point. Otherwise this value is always 1.
  • Issues to be aware of:

  In order to make the PTS start from 0, the media_time field is generally set to the value of the first CTTS. When calculating PTS and DTS, they can be adjusted to the value starting from 0 by subtracting the value of the media_time field from them respectively.
  If media_time is from a relatively large value, it means that the screen is displayed only when the PTS value is greater than this value. At this time, the first PTS greater than or equal to this value should be set to 0, and other PTSs and DTSs should be set accordingly. Adjustment.



2.9 mdia(media box)

  1. Define the track media type and sample data, and describe the sample information.
  2. It is a container box, it must contain mdhd, hdlr and minf.

An example is as follows:



in:

  • mdhd (Media Header Box): It is used to briefly describe the information of the media stream.
  • hdlr (Handler Reference Box): mainly defines the track type.
  • stbl (Media Information Box): It is used to describe the decoding related information and audio and video positions of the media stream.

2.10 mdhd(Media Header Box)

  1. mdhd appears as a header of media information (note that this header is not a box header, but a header of media information), and is used to describe some basic information of the media.
  2. The contents of mdhd and tkhd are roughly the same. However, tkhd usually sets the relevant attributes and content for the specified track. And mdhd is set for independent media.
  3. The mdhd syntax is inherited from fullbox. Note that the version and flags fields in the following examples belong to the fullbox header.
    Box Body:


Note: the timescale is the same as the timescale in mvhd, but it should be noted that although the meaning is the same, the values ​​may be different. The calculation of the timestamps such as stts and ctts below is based on the timescale in mdhd.

2.11 hdlr(Handler Reference Box)

  1. It mainly explains the playing process information of the media. Declare the type of the current track and the corresponding handler.
  2. The hdlr syntax is inherited from fullbox. Note that the version and flags fields in the following examples belong to the fullbox header.
    Box Body:

The benefits of this article, free C++ audio and video learning materials package, technical video/code, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull stream, srs)↓↓↓ ↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓



2.12 minf(Media Information box)

  1. Interpret the handler-specific information of the track media data, and the media handler uses this information to map the media time to the media data and process it. minf is also a container box, and the content that needs to be paid attention to inside is stbl, which is also the most complicated part of moov.
  2. In general, "minf" contains a header box, a "dinf" and a "stbl", where the header box is divided into "vmhd", "smhd", "hmhd" and " nmhd", "dinf" is the data information box, and "stbl" is the sample table box.

2.13 *mhd (Media Info Header Box)

  It can be divided into "vmhd", "smhd", "hmhd" and "nmhd". For example, the video type is vmhd, and the audio type is smhd.
(1) vmhd

  • graphics mode: video synthesis mode, copy the original image when it is 0, otherwise synthesize with opcolor.
  • opcolor: a set (red, green, blue), used by graphics modes.
    (2) smhd
  • balance: Stereo balance, [8.8] format value, generally 0 means the middle, -1.0 means all the left channels, 1.0 means all the right channels.

2.14 dinf(Data Information Box)

  1. Describes how to locate media information, and is a container box.
  2. "dinf" generally contains a "dref" (data reference box) .
  3. There will be several "url" or "urn" under "dref". These boxes form a table for locating track data. Simply put, a track can be divided into several segments, and each segment can obtain data according to the address pointed to by "url" or "urn". The serial numbers of these segments will be used in the sample description to form a complete track. In general, the location string in "url" or "urn" is empty when the data is completely contained in the file.

2.15 stbl(Sample Table Box)

  Before introducing the stbl box, you need to introduce the sample and chunk defined in mp4:

  • sample : According to ISO/IEC 14496-12, samples cannot share the same time stamp. Therefore, in an audio and video track, a sample represents a video or audio frame.
  • chunk : A collection of multiple samples. In fact, in audio and video tracks, chunks correspond to samples one by one.



  The stbl box is a container box, which is the most important box in the entire track. Its sub-boxes describe the decoding related information, audio and video position information, time stamp information, etc. of the media stream.

  The media data part of the MP4 file is in the mdat box, and stbl contains the index and time information of these media data.
An example is as follows:

in:

  • stsd (sample description box): Stores the encoding type and the information needed to initialize the decoder, and is related to the specific codec type.
  • stts(time to sample box): stores the time mapping relationship from each sample of the track to dts.
  • stss(sync sample box): for the video track, the serial number of the sample to which the key frame belongs.
  • ctts(composition time to sample box): store the time difference between cts and dts of each sample in this track.
  • stsc/stz2(sample to chunk box): stores the mapping relationship between each sample and chunk in the track.
  • stsz(sample size box): The byte size of each sample in the track is stored.
  • stco/co64(chunk offset box): stores the offset of each chunk in the track in the file.

2.16 stsd(sample description box)

  It mainly stores the encoding type and the information needed to initialize the decoder. Here we take the video as an example, including the sub-box: avc1, indicating that it is a H264 video.

2.16.1 h264 stsd

  For h264 video, the typical structure is as follows:

On it (only list avc1 and avcC, other boxes can be ignored):

  • avc1 is the name of avc/h264/mpeg-4 part 10 video codec format. It is a container box, but the box body also carries its own information.
    Box Body:

avcC (AVC Video Stream Definition Box), stores sps && pps, which is the AVCDecoderConfigurationRecord structure defined in ISO/IEC 14496-15


Note: In srs, refer to the srs/trunk/src/kernel/srs_kerner_codec.cpp::SrsFormat::avc_demux_sps_pps() function for avcc/AVCDecoderConfigurationRecord structure analysis.

2.16.2 aac stsd

  For aac audio, the typical structure is as follows:


It can be seen that the structure of aac stsd is relatively complicated, and there are many boxes. In fact, in ISO/IEC 14496-3, the AudioSpecificConfig type is defined, and the main information of the aac stsd structure comes from AudioSpecificConfig.
For specific analysis, please refer to srs:

  • The srs/trunk/src/kernel/srs_kerner_codec.cpp::SrsFormat::audio_aac_sequence_header_demux() function that parses the AudioSpecificConfig structure
  • The srs/trunk/src/kernel/srs_kernel_mp4.cpp::SrsMp4Encoder::flush() function that encapsulates the aac stsd structure

2.17 stts(time to sample box)

  1. The time mapping relationship between each sample of the track and dts is stored.
  2. Contains a compressed version of the table through which the decoding time is mapped to the sample number. Each item in the table is the number of consecutive identical encoding time increments (Decode Delta) and encoding time increments. A complete time to sample table can be established by accumulating time increments.



  Here, in order to save the number of entries, the compression storage method is adopted, that is, if the sample_count consecutive samples have the same length as the sample_delta, then one entry can be used to represent them.
An example of an audio track is as follows:

An example of a video track is as follows:

2.18 ctts(composition time to sample box)

  1. The time difference between pts and dts of each sample in the track is stored (cts = pts - dts):
  2. If a video only has I frames and P frames, the ctts table is unnecessary, because the decoding order and display order are consistent, but if there are B frames in the video, ctts is required.

Notice:

  • This box must exist when dts and pts are different, if they are the same, do not include this box.
  • If version=0 of the box, it means that all samples satisfy pts >= dts, so the difference is represented by an unsigned number. As long as there is a pts < dts, it must be represented by version=1 and signed difference.
  • For the generation of ctts, please refer to srs/trunk/src/kernel/srs_kernel_mp4.cpp::SrsMp4SampleManager::write_track() The functions pts, dts, and cts satisfy the formula: pts - dts = cts.

2.19 stss(sync sample box)

  It contains the sample table of keyframes in media. Keyframes are meant to support random access. If this table does not exist, it means that each sample is a key frame.


A video example follows:

2.20 stsc/stz2(sample to chunk box)

  The mapping relationship between each sample and chunk in the track is stored.

An audio example is as follows:

  • The first_chunk serial number of the first group of chunks is 1, and the number of samples of each chunk is 1. Because the first_chunk serial number of the second group of chunks is 2, it can be seen that there is only one chunk in the first group of chunks.
  • The first_chunk serial number of the second group of chunks is 2, and the number of samples in each chunk is 2. Because the first_chunk serial number of the third group of chunks is 24, we can see that there are 22 chunks and 44 samples in the second group of chunks.
  • This is not to say that this video stream has only 3 samples, that is, only 3 frames, which is impossible, but the third and fourth lines are omitted, that is to say, the third and fourth, etc., the following chunks There is only 1 sample in it, which is the same as the second chunk. This video stream has 239 chunks. Because this video stream has a total of 240 frames, there are 2 frames in the first chunk, and 1 frame in the following chunks, so there are only 239 chunks in calculation.

2.21 stsz(sample size box)

  Including the number of samples and the byte size of each sample, this box is relatively large in size. Indicates the size of the video frame or audio frame, the size data size of the AVPacket in FFmpeg comes from this box.

2.22 stco/co64(chunk offset box)

  1. The Chunk Offset table stores the position of each chunk in the file, so that the media data can be found directly in the file without parsing the box.
  2. It should be noted that once there is any change in the previous box, this table must be recreated.

There are two forms of stco. If your video is too large, it may cause the chunkoffset to exceed the 32bit limit. Therefore, an additional co64 Box is created for large Video. Its function is equivalent to stco, and it is also used to indicate the position of the sample in the mdat box. Only, the chunk_offset inside is 64bit.

  • It should be noted that here stco is only the specified offset position of each Chunk in the file, and does not give the offset of each Sample in the file. To obtain the offset position of each Sample, it needs to be obtained after calculating the Sample Size box and Sample-To-Chunk.

2.23 udta(user data box)

  User-defined data.

2.24 free(free space box)

  1. The contents of "free" are irrelevant and can be ignored. After the box is deleted, it will not have any impact on playback.
  2. Ftyp can be free or skip.

2.25 mdat(media data box)

  1. mdat is the specific encoded data.
  2. mdat is also a box with box header and box body.
  3. mdat can refer to external data, see moov --> udta --> meta, it is not discussed here, only the form of data stored in this file is discussed.
  4. For the box body part, it is stored in the form of samples one by one, that is, in the form of audio frames or video frames one by one.
  5. The code stream organization adopts the avcc format, that is, the form of AUD + slice size + slice.

The benefits of this article, free C++ audio and video learning materials package, technical video/code, including (audio and video development, interview questions, FFmpeg, webRTC, rtmp, hls, rtsp, ffplay, codec, push-pull stream, srs)↓↓↓ ↓↓↓See below↓↓Click at the bottom of the article to get it for free↓↓

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/132714706
Recommended