A brief introduction to the WAVE audio file format and its 64-bit extension

text

Regarding the WAVE file format, there are many introductions on the Internet, but there are almost no introductions about the WAVE 64-bit extended format.

So the purpose of this article is to briefly introduce the standard WAVE format, as well as the two main extensions.

All the codes in this article are described in C language. Although C language is not so convenient, it is very common; although I am better at Pascal language, and it may be more convenient to describe the code in this article, but there are really not many people using Pascal now.

If you want to view the most complete and detailed standard documents, I will provide a link to the network disk at the end of the article. (It is strongly recommended to look at the documentation if you want to know more about it)

At the same time, due to my limited level, it is inevitable that there will be mistakes. If you have any questions, please point them out.

The purpose of this article is not only to introduce these contents, but also to guide readers to create a wheel (at least for basic content like WAVE).

before the start

Let's first define the data types that need to be used to facilitate the explanation of the following content.

typedef char Int8;
typedef short Int16;
typedef long Int32;
typedef long long Int64;
typedef unsigned char UInt8;
typedef unsigned short UInt16;
typedef unsigned long UInt32;
typedef unsigned long long UInt64;
typedef UInt8 Byte;
typedef UInt16 Word;
typedef UInt32 DWord;
typedef UInt64 QWord;

typedef struct
{
    
    
    DWord D1;
    Word D2;
    Word D3;
    Byte D4[8];
} Guid;

typedef union
{
    
    
    DWord dw;
    char chr[4];
} FourCC;

The above integer types are well-known and easy to use. Among them, Byte~QWord is a common usage of assembly language memory access.


GUID (or UUID) is a 16-byte structure that uses a specific algorithm to ensure that its binary is unique (for Windows, ole32.dll has a CoCreateGuid to achieve such a function), if you look at Windows You will find that there are a large number of keys in the form of strings (described below) in the registry. Of course, those who understand COM programming are also very familiar with this. In short, it is an important part of the Windows system.

Here is a brief introduction to the corresponding relationship between its string form and binary.

We assign the following values ​​to D1~D4:

D1=0x12345678;
D2=0x9ABC;
D3=0xDEF0;
D4={
    
    'A','B','C','D','E','F','G','H'};

Then its string form is like this, which is represented by bytes in hexadecimal:

{12345678-9ABC-DEF0-4142-434445464748}

Note that this is the case of little-endian. Generally, the architecture processors we use x86(mainly IntelHe and AMDthe two) use little-endian storage.

If it is big endian, the corresponding situation should be like this:

D1=0x78563412;
D2=0xBC9A;
D3=0xF0DE;
D4={
    
    'A','B','C','D','E','F','G','H'};

For some problems in GUID string form and binary form, we will continue to discuss later.

Note: The previous Guid is to distinguish the definition of GUID in windows.h.


FourCCDWordThe (Four char code) type is used to mark the block, and can be used or replaced in actual use char[4](the char type must be single-byte).

The main use of this type is to use 4 characters to represent the type of a block, such as "RIFF"the block that will be mentioned later.

As for why a DWord and char[4] are used to form a union, it is for the convenience of reading and writing. If you use C++, it is not a big problem to have operator overloading. If it is for C language, it is more troublesome, so in addition to the C language DWord type is deprecated.

Note that if you use DWord to read and write, you should pay attention to the problem of big and small endian. For example, in the small endian, it 'FFIR'is "RIFF"equivalent to the same, but in the big endian case, the two are consistent. Because of this problem, it is not recommended to use the DWord type. .

However, if you use 'FFIR'a type like this as an integer, the compiler will give a warning. If you want to eliminate this problem, you can use a macro definition to solve it:

// 大端把后面的a,b,c,d反过来就行了
#define MAKE_DWORD(a,b,c,d) (DWord)(((a&0xff))|((b&0xff)<<8)|((c&0xff)<<16)|((d&0xff)<<24))
#define RIFF_CHUNK_DWID MAKE_DWORD('R','I','F','F')
// ...

But in fact, it is not "RIFF"as simple and direct as this, so if it is not too troublesome to use C language, you can use char[4]the and strncmpfunction (or write a macro to judge), in other languages, it is more convenient to directly overload the operator.

Note: The macro MAKEFOURCC in windows.h has the same effect as MAKE_DWORD here.


Okay, let’s not get too wordy, I believe everyone will have a solution for the specific details, let’s start the topic.

RIFF/WAVE standard format

For an introduction to the standard format and access to related documents, click this link .

The WAVE format belongs to the RIFF format. The file structure of this format is based on blocks. For details, you can search online or view documents.

A brief introduction to block headers:

typedef struct
{
    
    
    FourCC id; // 区块类型
    DWord size; // 区块大小(不包括id和size字段的大小)
} RIFFChunkHeader;
  • The size field can be an odd number, but the actual block size must be an even number, that is to say, it is aligned by 2 bytes, so 0 should be added at the end

Generally, each block can be defined in the following way:

typedef struct
{
    
    
    RIFFChunkHeader header;
    // ...
} SomeChunkName;

However, for the convenience of reading and writing, the following text is not defined in such a format.


The definition of RIFF file header is as follows:

typedef struct
{
    
    
    FourCC id; // 必须是 "RIFF"
    DWord size; // 文件大小(字节数)-8
    FourCC type; // 必须是 "WAVE"
} RIFFHeader;

After understanding the meaning of the block header, sizethe field does not need to be explained too much; typethe reason for the field idis the same as that of the field.


Immediately following it is a very important fmtblock.

typedef struct
{
    
    
    FourCC id; // 必须是 "fmt " (注意后面的空格哦)
    DWord size; // 必须是 16
    Word FormatTag;
    Word Channels;
    DWord SampleRate;
    DWord BytesRate;
    Word BlockAlign;
    Word BitsPerSample;
} WaveChunkFormat;

Briefly

  • FormatTagGenerally 1 or 3, where 1 represents PCM and 3 represents IEEE floating point numbers
  • ChannelsIs the number of channels, generally 1 or 2, representing mono and stereo respectively
  • SampleRateIs the sampling rate, generally 8000, 44100, 48000, etc.
  • BytesRateis the number of bytes played per second, equal toBlockAlign*SampleRate
  • BlockAlignis the number of bytes per audio frame, equal toBitsPerSample*Channels/8
  • BitsPerSampleIt is the number of bits per sample, generally 8, 16, 24, 32, 64

Since the sound is alternately stored 1 , the size of each audio frame is the size of each sample multiplied by the number of channels


For the extended Format block, there are generally the following two

first of all

typedef struct
{
    
    
    FourCC id; // 必须是 "fmt "
    DWord size; // 一般是18
    Word FormatTag;
    Word Channels;
    DWord SampleRate;
    DWord BytesRate;
    Word BlockAlign;
    Word BitsPerSample;
    Word ExSize; // 一般是0
} WaveChunkNonPCMFormat;
  • sizeThe field is generally 18, it depends on ExSizethe size, the actual size is equal to ExSize+18
  • ExSizeThe field is generally 0, if it is not 0, you must add your own corresponding structure behind

This structure is usually used in non-PCM encoded formats (often compressed formats), but it is generally uncommon now, and the specific format is FormatTagspecified by the field. To be honest, I have seen some files in this format, but most of them FormatTaghave a value of 3, which means the sampling format IEEE floating point number 2 . Although IEEE floating-point numbers are not PCM 3 , but because the effect is similar to 4 , so if we need to use IEEE format sampling, we will use the standard WaveChunkFormatinstead of this embarrassing one WaveChunkNonPCMFormat, unless you have an extended format defined by yourself.

For files in this format, there is generally one factblock (for IEEE floating point, there may be none):

typedef struct
{
    
    
    FourCC id; // "fact"
    DWord size; // 12
    DWord FactSize; // 每个通道的采样总数
} WaveChunkFact;

The main purpose of this block is to provide the total number of samples (that is, the total number of audio frames) to estimate the actual size of the compressed format after decompression, which is not used in the uncompressed format.


After that

typedef struct
{
    
    
    FourCC id; // 必须是 "fmt "
    DWord size; // 必须是40
    Word FormatTag; // 必须是0xFFFE
    Word Channels;
    DWord SampleRate;
    DWord BytesRate;
    Word BlockAlign;
    Word BitsPerSample;
    Word ExSize; // 必须是22
    Word ValidBitsPerSample;
    DWord ChannelMask;
    Guid SubFormat;
} WaveChunkFormatExtensible;

This is the most used format by Microsoft, and if you know WASAPI, you will find that this is the format used internally by Windows Mixer.

  • VaildBitsPerSampleRefers to the actual number of sampling bits, such BitsPerSampleas 24, then this field can be 17-24, which means that some or all of the 24 bits are used; common combinations are 12/16, 20/24, but generally Equal BitsPerSample, because this is so rare
  • ChannelMaskRefers to the arrangement of multi-channel speakers, such as 5.1-channel and 7.1-channel. For details, see the document
  • SubFormatIt is a 16-bit Guid. Since FormatTagthe field must be 0xFFFE, it needs to be redefined later. The first two bits (that is, a Word) represent the original FormatTag, and the last few bits are fixed, but for convenience, in fact, the third- 6 bytes are all 0, such as {00000001-0000-0010-8000-00AA00389B71}representing PCM, {00000003-0000-0010-8000-00AA00389B71}representing IEEE floating point number

There can be various other blocks after that, but they are generally not used, unless there are special needs, such as author information block, playlist block, instrument type block, sample block, etc., but these contents are rarely encountered.

To learn more about the various other blocks, you can look them up in the standard documentation.


Then there is the data block that actually stores the data (also our main purpose after reading the pre-information).

The definition of a data block is simple:

typedef struct
{
    
    
    FourCC id; // 必须是 "data"
    DWord size; // 实际数据大小
    // 后面就是以音频帧为单位存放的数据了
} WaveChunkData;

At this point, one of the simplest—and at the same time the most important—parts is completed, and based on these, a standard WAVE format file can be created or read.

Extended WAVE format

If you see a file in the WAVE format, you will generally think of it as a lossless audio file. However, in fact, the WAVE format is a container that can also store data in other compressed formats, such as, , and, etc., but generally we do use lossless ADPCMPCM ALawencoding MuLawas host.

But you have also seen that the field used in the header of the standard WAVE format file sizeis of one DWordtype, and its maximum size can only represent 4GiBthe size of the data. For the requirements of multi-channel lossless storage, such a small size is only enough to save a few Ten minutes of data is too little, so we need to expand the standard WAVE format.

How to expand it? Microsoft didn't give an answer, and the broadcast television industry and the recording industry have developed their own standards for this.

So there are the following two protagonists.

But before introducing them, let's add a little pre-requisite knowledge.


Those things about JUNK (garbage) blocks.

JUNK chunks are part of the RIFF standard and apply to all file formats that use RIFF, and are not specific to WAVE.

So first answer a question, why do we need JUNK blocks?

Since the data of the WAVE file is continuously stored in the data block, once the writing is started, the offset of the data block in the file is fixed, so what if we want to add other blocks before the data block Woolen cloth? Then use a garbage block to occupy the place. This block does not have any data, and it will be skipped directly when reading, but we can go back and rewrite this block after the file is written, and change part of the content to other areas. block, and reduce the size of this JUNK block. Of course, another reason is to fill the garbage to achieve file alignment. I have seen that the offset of the data block of some WAVE files is 4088 bytes, while the offset of the actual audio data is 4096 bytes (just 4K), so it is possible Facilitates sequential reading of files.

In summary, the definition of a garbage block should be very simple:

typedef struct
{
    
    
    FourCC id; // "JUNK"
    DWord size;
    // 垃圾数据
} RIFFChunkJunk;
  • idLowercase is also used "junk", but rare
  • Generally filled with 0, of course, it can also be filled with garbage (write a random memory)

For the extended format, why do we need to use JUNK blocks? I will give corresponding answers later. But now let's take a look at what these two formats look like.

RF64/WAVE Extended Format

As the name suggests, RF64it is RIFFa 64-bit extension of the format, but it is only for the WAVE format.

RF64 uses the same file header as RIFF. One of the differences is that idthe fields "RIFF"change from to "RF64", and the second difference is that sizethe fields are filled with 0xFFFFFFFF(in fact, it doesn't matter).

Slightly different from RIFF, RF64 requires that the file header must be ds64a block, which is defined as follows:

typedef struct
{
    
    
    FourCC id; // 必须是 "ds64"
    DWord size; // 一般是28
    UInt64 RIFFSize; // 实际的RIFF大小(即文件大小-8)
    UInt64 DataSize; // 实际的数据块大小
    UInt64 FactSize; // 实际的数据量(解压后)
    DWord TableLen; // 一般为0
    // 后面紧跟 RIFFChunkHeader64 数组
} RIFFChunkDS64;
  • sizeIt is generally 28, if TableLenit is not 0, it needs to be added12*TableLen
  • RIFFSizeUsed to represent the actual RIFF header size field
  • DataSizeUsed to represent the actual data block size field
  • FactSizeThe size used to compress the format, generally not compressed, equal DataSizeto
  • TableLenIt is generally 0, because there are very few fields that require a 64-bit size; if it is not 0, the size needs to be recalculated and an RIFFChunkHeader64array needs to be followed

which is RIFFChunkHeader64defined as follows:

typedef struct
{
    
    
    FourCC id;
    UInt64 size;
} RIFFChunkHeader64;

Then there are the fmt block and the data block, which are the same as RIFF (except that the size of the data block is set to 0xFFFFFFFF), but generally the fmt block is used because the broadcasting industry needs to support surround sound, but use the other WaveChunkFormatExtensibletwo One is also perfectly possible.

It can be seen that the changes of RF64 to RIFF are still quite small, so the compatibility of RF64 with the original format is still very good, and many programs can easily support this format without adding too much code.

Sony Wave64 Extended Format

For the extended WAVE format, Sonic Foundry gave their plan, and later this company was acquired by Sony, so their standard became Sony Wave64.

Sony Wave64 has changed a lot. First of all, it changed the block header.

typedef struct
{
    
    
    Guid id;
    Int64 size;
} SonyWave64Header;
  • idChanged to a 16-byte Guid, the definition is given in the specific value document
  • sizeChanged to an 8-byte signed integer, and its value includes the size of the sum iditself size, which is very different from RIFF and RF64

The definition of the file header is changed to this:

typedef struct
{
    
    
    Guid id;    // 必须是 {66666972-912E-11CF-A5D6-28DB04C10000}
    Int64 size; // 等于文件大小
    Guid type;  // 必须是 {65766177-ACF3-11D3-8CD1-00C04F8EDB8A}
} SonyWave64Wave;

If you observe carefully, you will find that the first 4 bytes of them are exactly 4 characters "riff" and "wave"

However, it did not make any changes to the specific definitions of the various blocks of WAVE, but only changed the block header of each block SonyWave64Header.

Of course, the definition of each block header is still different, mainly the last 12 bytes. The two main blocks are listed below:

  • 'fmt' block: {20746D66-ACF3-11D3-8CD1-00C04F8EDB8A}
  • 'data' block: {61746164-ACF3-11D3-8CD1-00C04F8EDB8A}

At the same time, it also makes strict requirements on the file structure-all blocks must be aligned by 8 bytes instead of the original 2-byte alignment.

Say back to JUNK

Regarding the expansion of the WAVE format, if RF64 is just a small repair, then Sony Wave64 has changed drastically.

But everything remains the same, these two extensions are still based on the original block mechanism, so we can use a JUNK block to occupy the space in advance, so that we can realize the dynamic expansion of RF64 or Sony Wave64 in WAVE format.

Of course, the JUNK block can also be used to perform 4K alignment for the starting position of the data, but this is not what we mainly discuss.

The specific implementation method of dynamic expansion is to calculate the size of the JUNK block in advance and fill it in, then write the data, and check the size of the written amount when it is about to end: if the written amount is less than 4GiB, then we can end it directly, no matter This is a garbage block; otherwise, we can use the space occupied by the garbage block to rewrite the information in the file header into information that conforms to the format.

Generally, the basic JUNK size required to expand WAVE to RF64 is 28 bytes; while the basic JUNK size required to expand to Sony Wave64 is 52 bytes, and each additional block requires an additional 16 bytes.

Of course, the specific code implementation is still very complicated, and it will take a lot of time to write it out. For example, a software called Reaper 5 realizes this function.

Besides GUID

Due to the problem of C language, it is more troublesome to convert GUID string form and binary form. This is not a problem for other high-level languages, so if you use C language, you cannot use strings.

For example, Pascal (including Free Pascal and Delphi) can use this form to define a certain GUID constant:

const
  WavSubFmtPCM:TGUID       = '{00000001-0000-0010-8000-00AA00389B71}';
  WavSubFmtIEEEFloat:TGUID = '{00000003-0000-0010-8000-00AA00389B71}';

I believe that the current mainstream languages ​​basically support similar methods.

For example, Microsoft C++ has uuid that can be used.

For C language, if you need to use it, you can use it with the following macro definition

#define MAKE_GUID(uid,dw1,w1,w2,b1,b2,b3,b4,b5,b6,b7,b8) \
    const Guid uid = {
      
      dw1,w1,w2,{
      
      b1,b2,b3,b4,b5,b6,b7,b8}};
#define MAKE_WAVE_SUBFORMAT(uid,d) \
    MAKE_GUID(uid,d,0,0x10,0x80,0,0,0xAA,0x00,0x38,0x9B,0x71)
MAKE_WAVE_SUBFORMAT(WAV_SUB_FMT_IEEE, 3)
MAKE_WAVE_SUBFORMAT(WAV_SUB_FMT_FLAC, 0xF1AC)
MAKE_GUID(SONY_WAVE64_RIFF,0x66666972,0x912e,0x11cf,0xa5,0xd6,0x28,0xdb,0x04,0xc1,0,0)
// ...

postscript

In fact, after writing the AIFF article, I wanted to write this WAV article, but I never wrote it (because the workload is much larger than that of AIFF), and even stopped after writing half of it. Finally, I finally finished writing it (it took a month before and after ) ).

In the future, I will have the opportunity to write some content on how to read WAV and play it with DirectSound or WASAPI, especially WASAPI. There is still relatively little information in this regard.

If there is a chance, I will add some simple sound processing content.

netdisc

https://lanzoui.com/b011tcx9g Password: fz7c

If the link cannot be opened, you can change lanzoui to lanzoux or others, and you can search for keywords 蓝奏云打不开.

There are multiple files in it, you can download them as needed. If you need all of them, you can download the waveformat.zip file, which packs all the files.


  1. There are a lot of information on the Internet about the specific storage method of the sound, so I won’t go into details here↩︎

  2. The purpose of using IEEE floating point numbers is to obtain greater dynamic range and avoid distortion. ↩︎

  3. The full name of PCM is pulse code modulation, which uses integers to quantize the sampled analog signals, while floating-point numbers do not. ↩︎

  4. Both are uncompressed encodings. ↩︎

  5. This is a very geek audio software, the official website is https://www.reaper.fm/index.php ↩︎

Guess you like

Origin blog.csdn.net/PeaZomboss/article/details/126311968