Audio and video codec: Notes on MP4 encapsulation format

1. Introduction:
MP4 encapsulation format has become one of the most common media encapsulation formats due to its cross-platform characteristics. An MP4 file consists of multiple boxes, each box stores different information, and nesting occurs between boxes. There are many MP4 boxes, but the most important top-level boxes are as follows:

ftyp : File Type Box, describes the MP4 specification and version that the file complies with
moov : Movie Box, metadata information of the media, there is only one
mdat : Media Data Box, stores the actual media data, generally there are multiple

Each box consists of two parts: box header and box body.
box header: metadata of the box, such as box type, box size.
box body: the data part of the box, the actual stored content is related to the box type, such as the media data stored in the body part of mdat.
When other boxes are nested in the box body, such a box is called a container box.

Two, the important box:

ftyp
mdat
moov
	-mvhd
		-(time_scale):1s包含的时间单位
		-(duration):影片时长，等于最长trak的duration
	-trak
		-tkhd:单个track的metadata
			-(id)：当前track的唯一标识
			-(duration)：当前track的持续时间，FFmpeg忽略了此值
			-(width)：视频宽
			-(height)：视频高
		-mdia:描述当前track的一些信息
			-hdlr:声明当前的track类型
				*vide:视频track
				*soun:音频track
				*m1a :MP2
				*subp/clcp：字幕
			-stbl:媒体数据的索引及时间信息（非常重要）
				-stsd:确认当前trak的format，匹配FFmpeg中的codec_id和codec_type等
			-stts：每个帧的时长
			-stss:该trak中关键帧的个数及序号
			-ctts:记录dts与pts的差值，仅B帧存在的码流才需要
			-stsc:每个chunk的sample数
			-stsz:当前trak包含的sample数
			-stco:chunk在文件中的偏移量
				-chunk_offsets:每个chunk相对于文件整体的偏移量

3. Analysis of MP4-related boxes in FFmpeg:
In the FFmpeg source code, the demuxer for parsing the MP4 format is mov, and the file path is:

libavformat/mov.c

Take a look at the definition of each member of the structure:

const AVInputFormat ff_mov_demuxer = {
    
    
    .name           = "mov,mp4,m4a,3gp,3g2,mj2",
    .long_name      = NULL_IF_CONFIG_SMALL("QuickTime / MOV"),
    .priv_class     = &mov_class,
    .priv_data_size = sizeof(MOVContext),
    .extensions     = "mov,mp4,m4a,3gp,3g2,mj2,psp,m4b,ism,ismv,isma,f4v",
    .flags_internal = FF_FMT_INIT_CLEANUP,
    .read_probe     = mov_probe,
    .read_header    = mov_read_header,
    .read_packet    = mov_read_packet,
    .read_close     = mov_read_close,
    .read_seek      = mov_read_seek,
    .flags          = AVFMT_NO_BYTE_SEEK | AVFMT_SEEK_TO_PTS | AVFMT_SHOW_IDS,
};

The parsing of each box is done in mov_read_header:

static int mov_read_header(AVFormatContext *s)
{
    
    
    MOVContext *mov = s->priv_data;
    AVIOContext *pb = s->pb;
    int j, err;
    /* atmo为box解析中的最小单位 */
    MOVAtom atom = {
    
     AV_RL32("root") };
	...
    /* check MOV header */
    do {
    
    
        if (mov->moov_retry)
            avio_seek(pb, 0, SEEK_SET);
        /* 读取box中内容，有嵌套的话持续往下读 */
        if ((err = mov_read_default(mov, pb, atom)) < 0) {
    
    
            av_log(s, AV_LOG_ERROR, "error reading header\n");
            return err;
        }
    } while ((pb->seekable & AVIO_SEEKABLE_NORMAL) && !mov->found_moov && !mov->moov_retry++);
    if (!mov->found_moov) {
    
    	//是否读取完的标志位
        av_log(s, AV_LOG_ERROR, "moov atom not found\n");
        return AVERROR_INVALIDDATA;
    }
	...
}

Take a look at mov_read_default:

static int mov_read_default(MOVContext *c, AVIOContext *pb, MOVAtom atom)
{
    
    
    int64_t total_size = 0;
    MOVAtom a;
    int i;
	/* 记录atom的嵌套层数 */
    if (c->atom_depth > 10) {
    
    
        av_log(c->fc, AV_LOG_ERROR, "Atoms too deeply nested\n");
        return AVERROR_INVALIDDATA;
    }
    c->atom_depth ++;

	if (atom.size < 0)
	atom.size = INT64_MAX;
    while (total_size <= atom.size - 8 && !avio_feof(pb)) {
    
    
    	/* parse函数指针用于指向各个解析box */
		int (*parse)(MOVContext*, AVIOContext*, MOVAtom) = NULL;
		...
		/* 遍历各个数组，找根据type找到对应的box函数进行解析 */
		for (i = 0; mov_default_parse_table[i].type; i++)
            if (mov_default_parse_table[i].type == a.type) {
    
    
                parse = mov_default_parse_table[i].parse;
                break;
            }
        ...
        if (!parse) {
    
     /* skip leaf atoms data */
            avio_skip(pb, a.size);
        } else {
    
    
			int64_t start_pos = avio_tell(pb);
            int64_t left;
            /* 调用对应的box解析函数 */
            int err = parse(c, pb, a);
            if (err < 0) {
    
    
                c->atom_depth --;
                return err;
            }
            ...
		}
		...
	}
	...
}

The default box parsing array for MP4 files is mov_default_parse_table:

static const MOVParseTableEntry mov_default_parse_table[] = {
    
    
{
    
     MKTAG('A','C','L','R'), mov_read_aclr },
{
    
     MKTAG('A','P','R','G'), mov_read_avid },
{
    
     MKTAG('A','A','L','P'), mov_read_avid },
{
    
     MKTAG('A','R','E','S'), mov_read_ares },
{
    
     MKTAG('a','v','s','s'), mov_read_avss },
{
    
     MKTAG('a','v','1','C'), mov_read_glbl },	
...
}

4. MP4 box analysis tool:
1.mp4info:
insert image description here
It is convenient to obtain some basic information of the box, but it cannot fully display some key information in the box.

2. MP4 explorer:
insert image description here
It will list the key information of each box in more detail, which will help us analyze the code stream and download address .

5. Analysis of some common problems of MP4:
1. The playback time of Samba and other shared access is longer:
this is mainly because the box-moov for recording metadata is at the end of the file, and the metadata needs to be downloaded only after the entire source is downloaded. , parse out the decoding related information.
insert image description here
The solution to this problem is to move moov to the beginning of the source. For local sources, you can use FFmpeg to transcode:

ffmpeg -i xxx.mp4 -codec copy -movflags faststart output.mp4

For short videos, it is recommended to uniformly shift the moov when uploading, such as vibrato, etc. For streams where the online moov is already at the end of the file, you can consider using cloud transcoding to start broadcasting in seconds.

Audio and video codec: Notes on MP4 encapsulation format

Guess you like