FFmpeg audio decoding - audio visualization

Recently, I am working on an audio visualization business. There is a Java layer implementation method on the Internet, but the business needs to be implemented in C. It is actually very simple from the principle. First decode the audio, and then calculate the decibel. It's easier than putting an elephant in a refrigerator. This article relies on the business of audio visualization, and uses FFmpeg as the basis to realize decoding, calculation, and drawing.

1. Decoding process

The decoding process is roughly divided into the following three parts, taking ffmpeg\doc\examples\decode_audio.c under the FFmpge source code as a reference.

1.1. Analyze audio information

avformat_open_input is responsible for opening the audio file that needs to be decoded. If the file is opened successfully, it will initialize the AVFormatContext. Avformat_find_stream_info starts the audio stream traversal. Av_find_best_stream finds the most suitable frame for parsing the data. After parsing, we can get the decoder we need through the returned AVStream id, number of channels, sampling rate, bit depth, audio duration, data arrangement structure. To get the decoder id, we get the decoder avcodec_find_decoder through the decoder id. Some decoders are not built into FFmpeg, so some need to be added during compilation. My previous article also talked about AAC and MP3 codec third-party libraries. If the decoder is found, the next step is avcodec_alloc_context3 to initialize the decoder context AVCodecContext. After initialization, avcodec_parameters_to_context sets the decoder parameters to the decoder context, such as the number of channels, sampling rate, sampling bit depth and other information. If it is not set, an invalid argument error may occur, resulting in failure to continue. Finally , open the decoder through avcodec_open2 . If it is successfully opened, we can start to read the audio data.

1.2. From raw data packet to frame

The purpose of our decoding is to get the PCM data that the underlying player can play. Now that we have obtained the decoder, the following is the data parsed by reading the decoder frame by frame. First of all, we need av_packet_alloc to initialize the packet object AVPacket. The packet object is undecoded data. The original audio data is packed into packets one by one, and then sent to the decoder to open the packet and turn it into a frame object, so we need to pass av_frame_alloc Initialize the frame object AVFrame, send it to the decoder, and the decoder returns it after filling it with data. av_read_frame is to read a packet from the opened file, for AAC/MP3 they are undecoded compressed data. Then send the data packet to the decoder through avcodec_send_packet , and return 0 to indicate that the decoder unpacks successfully, and then you can read data from the decoder. At this time, the data exists in the form of a frame, and avcodec_receive_frame reads the frame, because a packet may There are several frames, so it needs to be read in a loop. When avcodec_receive_frame returns 0, it means that the reading is successful, and the next operation can be carried out. When the return value is AVERROR_EOF, it means that the reading can be jumped out of the loop. Returning AVERROR (EAGAIN) means that the decoder output Already in an unavailable state, a new packet must be sent to the decoder to activate the output, and it is also possible to break out of the loop of reading and parsing frames.

1.3. From frame to PCM byte

Our PCM data is in the data of the frame, but we can't get it directly. First of all, we need to know how much to get and how to get it. How much depends on the number of samples, the number of channels and the number of samples in the frame. For example, at 44100HZ, there are 44100*channel samples per second. In that frame, there are a total of sampling bits/8*channels*sample bytes of data. How to get it depends on the storage method of audio data. There are two audio storage methods:

planar: Audio left and right channel data are placed separately, and the data storage format is

data[0]:LLLLLLLLLLLLLL

data[1]:RRRRRRRRRRRRR

packet: Audio left and right channel data are placed alternately, and the data storage format is

data[0]:LRLRLRLRLRLRLRLR

The finally obtained data is arranged in the form of LRLRLRLRLR. Here we can send it to the player or add some of our own audio algorithms before sending it to the player. After all the decoding is completed, remember to release the relevant resources at last . Here we are simple, calculate its decibels, and realize the function of audio visualization.

2. Calculation of decibels

The decibels of our audio often do not need to calculate the number of decibels of each sample. First, the calculation density is too large to exceed the perception of the human eye. Because the sound has a certain continuity, we can calculate the average value within a period of time to obtain the distribution value of an audio range, which not only reduces the workload but also achieves a reasonable visualization effect. The first is to obtain the average value. Suppose we want to obtain 10 decibel values per second, then we need to calculate the average value of sampling rate*channels*sample bits/8/10 bytes of data. We might as well call it dB sampling The number of samples in the interval. A 16-bit bit-depth audio consists of two bytes per sample. The samples in the interval are added and then divided by the number of samples to get the average value. The next step is the calculation of dB. In fact, dB does not specifically refer to decibels, it is only in the field of audio description. It describes the gain relationship of audio. If you want to know more about what db is, you can read " What is dB ". The formula for calculating decibels is

20*log10(value)

So the decibel of the sound does not describe a linear relationship but an exponential relationship. For example, the sound of 70db is 20 times louder than that of 50db. For example, the audio range that can be described by 16bit is 0-65535, then its maximum dB value is about 96.3, and 32bit can describe it. The audio range is 0-4294967296, so its maximum dB value is 192.6. Bring the average value we just calculated into value to get the decibel of our range, save it and return it after parsing or callback one by one, depending on your business needs. The following is the method of calculating the decibel of 16bit sampling digits. The processing method of 32bit is similar, mainly paying attention to the size of the value and the byte step size of each displacement. After getting the decibels, we can turn them into bars and blocks and draw them to the screen.

void getPcmDB16(const unsigned char *pcmdata, size_t size) {
    int db = 0;
    short int value = 0;
    double sum = 0;
    for(int i = 0; i < size; i += bit_format/8)
    {
        memcpy(&value, pcmdata+i, bit_format/8); //获取2个字节的大小（值）
        sum += abs(value); //绝对值求和
    }
    sum = sum / (size / (bit_format/8)); //求平均值（2个字节表示一个振幅，所以振幅个数为：size/2个）
    if(sum > 0)
    {
        db = (int)(20.0*log10(sum));
    }
    memcpy(wave_buffer+wave_index,(char*)&db,1);
    wave_index++;
}

It should be noted that in addition to the two major categories of packet and planar, the audio format types of ffmpeg are divided into 32-bit integer and 32-bit floating-point types for 32-bit audio.

enum AVSampleFormat {
    AV_SAMPLE_FMT_NONE = -1,
    AV_SAMPLE_FMT_U8,          ///< unsigned 8 bits
    AV_SAMPLE_FMT_S16,         ///< signed 16 bits
    AV_SAMPLE_FMT_S32,         ///< signed 32 bits
    AV_SAMPLE_FMT_FLT,         ///< float
    AV_SAMPLE_FMT_DBL,         ///< double

    AV_SAMPLE_FMT_U8P,         ///< unsigned 8 bits, planar
    AV_SAMPLE_FMT_S16P,        ///< signed 16 bits, planar
    AV_SAMPLE_FMT_S32P,        ///< signed 32 bits, planar
    AV_SAMPLE_FMT_FLTP,        ///< float, planar
    AV_SAMPLE_FMT_DBLP,        ///< double, planar
    AV_SAMPLE_FMT_S64,         ///< signed 64 bits
    AV_SAMPLE_FMT_S64P,        ///< signed 64 bits, planar

    AV_SAMPLE_FMT_NB           ///< Number of sample formats. DO NOT USE if linking dynamically
};

The value range of the floating point type is between -1 and 1, so we need to multiply it by 0x7fff to obtain the data with the same ratio as 16 bits to achieve the same display effect.

void getPcmDBFloat(const unsigned char *pcmdata, size_t size) {
    int db = 0;
    float value = 0;
    double sum = 0;
    for(int i = 0; i < size; i += bit_format/8)
    {
        memcpy(&value, pcmdata+i, bit_format/8); //获取4个字节的大小（值）
        sum += abs(value*0x7fff); //绝对值求和
    }
    sum = sum / (size / (bit_format/8)); 
    if(sum > 0)
    {
        db = (int)(20.0*log10(sum));
    }
    memcpy(wave_buffer+wave_index,(char*)&db,1);
    wave_index++;
}

3. Realize the effect

Welcome everyone to exchange and discuss, criticize and correct. I will push the complete code to my Audio project on GitHub in stages. If you need to analyze a single C file of the audio, you can send it to everyone in private chat.

Update: The resource has been uploaded to CSDN, and the download is 0 points. If you are interested, you can download it and have a look. Decoding is only for ffmpeg, and the method of calculating db can be used universally. Audio decoding, decibel calculation resource download https://download.csdn.net/download/qq_37841321/85203912?spm=1001.2014.3001.5503