Detailed explanation of FLV format

Table of contents

1. Overview of file structure

2. Content description

2.1 Field description

2.2 Tag

2.2.1 Tag Header

2.2.2 Tag Data

3. Analysis process

4. Summary of Knowledge Points

5. FLV usage scenarios and advantages and disadvantages

6. QA

1. Is the previous tag size necessary? Because the tag header is fixed at 11 bytes, there is also a field DataSize in the tag header to describe the size of the TagData, the size of the Tag = 11 (TagHeader size) + TagHeader->DataSize, the previous tag size is not needed at this time, and the previous The tag size describes the size of the previous tag, and the previous tag size will be read only after all the tags are read, so as to know the size of the tag

2. How to identify this piece of data is the previous tag size? Is there any tag to identify it?

3. What is the function of AAC/AVC sequence header and what content does it contain?

4. Under what circumstances will there be more than one script tag and AAC/AVC sequence header?

5. Why audio-related parameters are defined in the audio tag header, and we still need to pass AudioSpecificConfig?

6. What is the use of "relative timestamp" in video tag?

7. Is the data in the tag data a frame of data?

1. Overview of file structure

1. Written in the front: When studying, you must compare official documents and combine analysis tools to understand deeply

(1) Official document: http://download.macromedia.com/f4v/video_file_format_spec_v10_1.pdf

(2) Recommended analysis tools for FLV:

2. The FLV (Flash Video) packaging format is composed of a file header (flv header) and a file body (flv body).

3. The body of FLV consists of pairs (previous tag size + tag). The previous tag size records the size of the previous tag and is used for reverse reading processing. The size of the previous tag size itself is fixed and occupies 4 bytes. Generally, a flv file consists of a header information, a script tag, and several video tags and audio tags.

4. Tags can generally be divided into three types: script (frame) data type (script), audio data type (audio), and video data (video). Each tag is composed of header and data; (for example, if the tag is a script type, then the tag header is the script header, and the tag data is the script data). A tag can only be one of the three types (not a tag that contains three types at the same time) data type), the fields of the tag header are the same, but the content of the tag data is different, and the type of the tag is judged by the tag header. (The content of each type of tag: Detailed explanation of FLV format_dogdanerl's blog-CSDN blog )

5. Each data relationship diagram

 illustrate:

Tag has three types: audio, video and script. A tag can only be of one type, which type is distinguished by the tag header. The tag data content of different types of tags is also different, so the word "or" is marked after the tag data in the figure.

There are also two types of Video Tag Body and Audio Tag Body, which type is distinguished by Video Tag Header, so there is a word "or" after it. The gray mark is because the video only has the AVC sequence header when it is H.264, and other video encoding formats such as AVC NALU do not have these two contents. The audio only has the AAC sequence header when it is in AAC, and other video encoding formats such as AAC raw do not have these two contents.

6. File structure diagram: the real arrangement order of various data in flv

2. Content description

2.1 Field description

2.2 Tag

1. There are three types of tags: audio, video and script.

2. Each tag is also composed of tag header and tag data. The tag header stores information such as the type of the current tag and the length of the data area, and the content stored in the tag data varies according to the type of the tag.

2.2.1 Tag Header

1. The tag header contains the basic information of the tag

2. Which type of tag can be obtained from TagType, audio (8), video (9), script (18)

3. The size of the tag data can be obtained from DataSize (it is the size of the tag data, not the size of the entire tag)

2.2.2 Tag Data

1. Contains meta information and real audio and video data

2.2.2.1 Script Tag Data

1. When TagType==0x12 in the tag header, the tag type is scrpit tag

2. Generally, the type of the first tag is script tag

3. Save the parameter data (metadata) of several media files in the form of different types of key-value pairs ("FFmpeg audio and video development basics and actual combat", page 78)

4. Generally speaking, the script tag data structure contains two AMF packages. AMF (Action Message Format) is a general data encapsulation format designed by Adobe and used in many Adobe products. Simply put, AMF describes different types of data in a unified format. ( FLV format analysis - short book ) The first byte of the AMF packet indicates the packet type, as follows:

 The first AMF package: encapsulates string type data.
        The first byte indicates the AMF packet type, generally always 0x02, indicating a string. The 2-3 bytes are UI16 type values, identifying the length of the string, generally always 0x000A (the length of the string "onMetaData"). The following bytes are specific strings, generally always "onMetaData" (6F, 6E, 4D, 65, 74, 61, 44, 61, 74, 61).

The second AMF package: encapsulates an array type, which contains the names and values ​​of audio and video information items.

        The first byte indicates the AMF packet type, generally always 0x08, indicating an array. The 2-5 bytes are UI32 type values, indicating the number of array elements. The following is the encapsulation of each array element. Common array elements are shown below.

        The array elements mentioned above are pairs of element names and values. The first 1-2 bytes represent the length of the element name, assuming L. Followed by a string of length L. The L+3th byte indicates the type of the element value. Followed by the corresponding value, the number of occupied bytes depends on the type of value; the parameter data (metadata) of several media files are saved in the form of different types of key-value pairs ("FFmpeg audio and video development basics and actual combat", page 78) . It can be seen intuitively through mediaInfo analysis

        In the figure above, a piece of data is described by four fields, key and value are described by two fields respectively, and duration is used as an example to illustrate:

        StringLength: the length of the key, that is, the length of the string "duration"

        StringData: the content of the key, namely "duration"

        Type: the type of value, ie double

        Value: the value of value, that is, 5.120s

Obtain metadata through ffmpeg, the duration we see is 5.120s, in fact, there are 4 fields to describe in the script tag of flv.

2.2.2.3 Audio Tag Data

1. The audio audio tag data is divided into audio tag header and audio tag body data area; if the encoding format is not AAC, the audio tag header is 1 byte; if it is AAC, the audio tag header is 2 bytes, extra A AACPacketType , used to represent the type of audio tag body;

2. The audio tag structure is as follows

Field

Type

Comment

Audio Tag Header

audio format

SoundFormat

UB4

0 = Linear PCM, platform endian

1 =ADPCM

2 = mp3

3 = Linear PCM, little endian

4 = Nellymoser 16-kHz mono

5 = Nellymoser 8-kHz mono

6 = Nellymoser

7 = G.711 A-law logarithmic PCM

8 = G.711 mu-law logarithmic PCM

9 = reserved

10 = AAC

11 = Speex

14 = MP3 8-Khz

15 = Device-specific sound

flv does not support g711a, if you want to use it, you may need to use linear audio.

Sampling Rate

SoundRate

UB2

0 = 5.5-kHz

1 = 11-kHz

2 = 22-kHz

3 = 44-kHz

Always 3 for AAC

Sampling accuracy

SoundSize

UB1

0 = snd8Bit

1 = snd16Bit

audio channel

SoundType

UB1

0 = sndMono 单声道

1 = sndStereo 立体声,双声道

对于AAC总是1

AAC包类型

AACPacketType

UB8

只有在AAC时才有此字段

0 = AAC sequence header

1 = AAC raw

Audio Tag Body

音频数据

AudioData

UI[8*n]

如果是PCM线性数据,存储的时候每个16bit小端存储,有符号。

如果音频格式是AAC,则存储的数据是AAC AUDIO DATA,否则为线性数组

 

3、AAC tag body的结构如下,两种类型AAC sequence header、AAC raw都是相同的存储结构,如上表所示

4、AACPacketTYpe中的AAC sequence header存放的是AudioSpecificConfig,它存放了解码AAC音频所需要的详细信息,用于初始化编码器,包含了更加详细的音频信息,比如采样率、声道数等;AudioSpecificConfig详细内容见《ISO-14496-3 Audio》中的1.6.2.1 章节。而且在ffmpeg中有对AudioSpecificConfig解析的函数,ff_mpeg4audio_get_config(),可以对比的看一下,理解更深刻。(FLV视频封装格式详解_51CTO博客_flv 格式

5、通常情况下,AAC sequence header这种tag在flv文件中只出现1次,并且在第一个audio tag中。如果采样率、声道数等信息发生变化,则需要重新发送AAC sequence header。

为什么audio tag header中定义了音频的相关参数,我们还需要传递AudioSpecificConfig呢?

因为当SoundFormat为AAC时,audio tag header中的SoundType须设置为1(立体声),SoundRate须设置为3(44KHZ),但这并不意味着FLV文件中AAC编码的音频必须是44KHZ的立体声。播放器在播放AAC音频时,应忽略audio tag header中的参数,并根据AudioSpecificConfig来配置正确的解码参数。(FFmpeg代码导读——基础篇 - 腾讯云开发者社区-腾讯云)

6、AAC raw 存放的是真正的音频数据

2.2.2.4 Video Tag Data

1、视频video tag data又分为video tag header 和video tag body数据区;编码格式如果不是H.264则vide tag header是1个字节,如果是H.264,则video tag header是4个字节,多出来AVCPacketType和CompositionTime

  • AVCPacketType用来表示VIDEODATA的内容
  • CompositonTime相对时间戳,如果AVCPacketType=0x01,为相对时间戳,其它均为0;

2、video tag 结构如下

Field

Type

Comment

Video Tag Header

帧类型 

FrameType

UB4

1 = keyframe (for AVC, a seekable frame)——h264的IDR,关键帧,可重入帧。

2 = inter frame (for AVC, a non- seekable frame)——h264的普通帧

3 = d keyframe (reserved for server use only)

5 = video info/command frame

编码ID 

CodecID

UB4

使用哪种编码类型

1 = JPEG (currently unused)

2 = Sorenson H.263

3 = Screen video4: On2 VP6

5 = On2 VP6 with alpha channel

6 = Screen video version 2

7 = AVC

AVC包类型

AVCPacketType

UB8

只有在H.264时才有此字段

0 = AVC sequence header

1 = AVC NALU

2 = AVC end of sequence(lower level NALU sequence ender is not required or supported)

相对时间戳

CompositionTime

UB24

只有在H.264时才有此字段

相对时间戳,如果AVCPacketType=0x01,为相对时间戳,其它均为0;

Video Tag Body

视频数据

VideoData

UI[8*n]

AVCPacketType=0:数据部分为AVCDecoderConfigurationRecord;

AVCPacketType=1:数据部分为1个或多个NALU

AVCPacketType=2:数据部分为空

 

3、AVC sequence header中存放的是AVCDecoderConfigurationRecord,包含着是H.264解码相关比较重要信息,比如sps和pps信息,用于初始化编码器;详细信息见《ISO-14496-15 AVC file format》。它存放的是AVC的编码参数,解码时需设置给解码器后方可正确解码。

4、通常情况下,AVC sequence header这种Tag在FLV文件中只出现1次,并且在第一个video tag中。如果码率、分辨率等信息发生变化,则需要重新发送AVC sequence header。

5、ACV NALU存放的是真正的视频数据

两种类型ACV sequence header、AAC NALU都是相同的存储结构,如上表所示。

6、CompositionTime(相对时间戳)

相对时间戳的概念需要和PTS、DTS一起理解(FFmpeg代码导读——基础篇 - 腾讯云开发者社区-腾讯云

  • DTS : Decode Time Stamp,解码时间戳,用于告知解码器该视频帧的解码时间;
  • PTS : Presentation Time Stamp,显示时间戳,用于告知播放器该视频帧的显示时间;
  • CTS : Composition Time Stamp,相对时间戳,用来表示PTS与DTS的差值。

如果视频里各帧的编码是按输入顺序依次进行的,则解码和显示时间相同,应该是一致的。但在编码后的视频类型中,如果存在B帧,输入顺序和编码顺序并不一致,所以才需要PTS和DTS这两种时间戳。视频帧的解码一定是发生在显示前,所以视频帧的PTS,一定是大于等于DTS的,因此CTS=PTS-DTS。

FLV video tag中的TimeStamp,不是PTS,而是DTS,视频帧的PTS需要我们通过DTS + CTS计算得到。

为什么audio tag不需要CompositionTime呢?

因为audio的编码顺序和输入顺序一致,即PTS = DTS,所以它没有CompositionTime的概念。

三、解析过程

1、通过解析tag header中的TagType的内容获取tag的类型,通过DataSize获取tag data大小

2、读取tag data中的Audio/Video Tag Header获取音视频的编码信息

3、读取真正的音视频数据

四、知识点汇总

1、flv header通常是是9字节。

2、previous tag size固定是4字节,flv header后的第一个previous tag size的值为0。

3、tag header的大小是都是固定的11字节,tag data的大小则不一定,因为tag有三种类型(audio,video,script),不同类型data大小不同,不同的video tag也不一定相同。

4、tag header中的tagtype判断出tag是哪种类型,音频(tagtype==8)、视频(tagtype==9)、脚本(tagtype==18)。

5、script tag通常是flv文件的第一个tag,并且只有一个,跟在flv header后面。

script tag数量不是1的情况:如果有字幕,则script tag就不是一个了,因为字幕的tag类型也是script tag,字幕的tag数量是不固定的,和audio/video一样,字幕的script也贯穿整个flv文件,字幕的AMF包类型是“OnTextData”。读取和写入过程详见ffmpeg代码。

6、AAC sequence header是AAC tag body 的一种类型;通常情况下,AAC sequence header这种tag在flv文件中只出现1次,并且在第一个Audio Tag中,它存放了解码AAC音频所需要的详细信息。

AAC sequence header数量不是1的情况:如果采样率、声道数等信息变化时,ffmpeg会新写一个AAC sequence header,因为要重新初始化编码器,读文件时,读到中间的AAC sequence header,会将信息重新写入编码器。

7、AVC sequence header是AVC tag body 的一种类型;通常情况下,AVC sequence header这种tag在flv文件中只出现1次,并且在第一个Video Tag中,它存放的是AVC的编码参数,解码时需设置给解码器后方可正确解码。

AVC sequence header数量不是1的情况:如果分辨率、码率等信息变化时,ffmpeg会新写一个AVC sequence header,因为要重新初始化编码器,读文件时,读到中间的AVC sequence header,会将信息重新写入编码器。

8、FLV video tag中的TimeStamp,不是PTS,而是DTS,视频帧的PTS需要我们通过DTS + CTS计算得到。

五、FLV使用场景及优缺点

1、使用场景:目前主流的视频网站基本都支持FLV,直播、点播都有广泛的应用。

2、优点:

(1)封装后的音视频文件体积小、封装简单等特点,非常适合于互联网上使用。

(2)它的出现有效地解决了视频文件导入Flash后,使导出的SWF文件体积庞大,不能在网络上很好的使用等问题。

3、缺点:

(1)FLV 参考标准协议中没有定义可以存储 H.265 视频压缩数据,如果我们自己将 H.265 的视频数据存储到 FLV 容器中,其他播放器不一定能够很好地播放这个视频。所以在我们将视频流、音频流写入到一个封装容器中之前,需要先弄清楚这个容器是否支持我们当前的视频流、音频流数据。(03|如何做音视频的封装与转码?-极客时间

(2)不适合多音轨:flv只能有一条音轨和一条视频轨。

六、QA

1、previous tag size是否有存在必要?因为tag header固定是11字节,tag header中还有有一个字段DataSize描述TagData的大小,Tag的大小 = 11(TagHeader大小)+TagHeader->DataSize,此时就不需要previous tag size了,而且previous tag size描述的是前一个tag的大小,tag都读完之后才会读取到previous tag size,才能知道tag的大小

参考一位网友的回答:(Why flv file body use PreviousTagSize rather than NextTagSize?)

I think flv tag list in flv body is designed as double linked list. The PreviousTagSize represents the back node point. Actually, the 'NextTagSize' is already included in current node (DataSize syntax in tag header), and it can point to the next node.

With double direction linked points, it will be easy and fast to seek previous or latter media packets.

2、如何识别出这一段数据是previous tag size呢,有什么标识别吗?

没有标识,没有字段来描述这一段数据就是previous tag size;只能通过tag末尾再偏移4个字节来找到PreviousTagSize

3、AAC/AVC sequence header都有什么作用,包含哪些内容?

AAC sequence header:详见“2.2.2.3 Audio Tag Data -> 4”

AVC sequence header:详见"2.2.2.4 Video Tag Data -> 3"

4、script tag、AAC/AVC sequence header 在什么情况会有多个?

详见“2.3知识汇总 -> 5、6、7”

5、为什么audio tag header中定义了音频的相关参数,我们还需要传递AudioSpecificConfig呢?

详见“2.2.2.3 Audio Tag Data -> 5”

6、video tag中”相对时间戳“有什么用?

See "2.2.2.4 Video Tag Data -> 6" for details

7. Is the data in the tag data a frame of data?

In most cases, a tag in a normal file is one frame, but it does not rule out a case where a tag contains multiple frames, or less than one frame, or continuous arbitrary data (this is a specific audio and video format, There are frame headers that can be searched, such as MPV (without FLV), H264 (0x00000001 header) MP3, etc.), in this case, one more step before decoding, and search for the frame (commonly used MPV, MP3 decoder does not need to search , implemented internally, just need to send data continuously). ( About the tag in flv )

It can also be seen from the process of parsing tags by ffmpeg that a tag data corresponds to a pkt, and a pkt is a frame of data, so a tag is a frame of data

Guess you like

Origin blog.csdn.net/weixin_39399492/article/details/129986667