Brief Introduction of Audio Basics & Esp-sr Getting Started Guide

This blog first briefly describes the basics of audio, and then helps readers get started with the esp-sr SDK.

1 Basic Concepts of Audio

1.1 The nature of sound

The essence of sound is the phenomenon of wave propagation in the medium, and the essence of sound wave is a wave, a physical quantity. The two are different. Sound is an abstraction, a phenomenon of sound wave propagation, and sound wave is a physical quantity.

1.2 Three elements of sound

  • Loudness: The size of the sound (commonly known as the volume) that people perceive subjectively, which is determined by the amplitude and the distance between the person and the sound source.
  • Pitch: The difference in frequency determines the pitch of the sound (treble, bass). The higher the frequency, the higher the pitch (the unit of frequency is Hz, Hertz). The human hearing range is 20-20000 Hz. Below 20 Hz is called infrasound, above 20000 Hz is called ultrasonic).
  • Timbre: Due to the characteristics of different objects and materials, the sound has different characteristics. Timbre itself is an abstract thing, but the waveform is this abstract and intuitive performance. Waveforms vary from pitch to pitch, and different pitches can be distinguished by their waveforms.

img

1.3 Several basic concepts of digital audio

1.3.1 Sampling

The so-called sampling is to digitize the signal only on the time axis.

  • According to Nyquist's law (also known as the sampling law), sampling is performed at twice the highest frequency of the sound. The frequency (pitch) range of human hearing is 20 Hz–20 KHz. So at least greater than 40 kHz. The sampling frequency is typically 44.1 kHz, which ensures that sounds up to 20 kHz can also be digitized. 44.1 kHz means 44100 samples per second.

Espressif AI voice uses a sampling rate of 16 kHz. Half of the sampling frequency of 16 kHz corresponds to the upper limit of the commonly used frequency band of human speech, which is about 8 kHz. In addition, the sampling rate of 44.1 kHz is another commonly used sampling rate. The sampling rate of 44.1 kHz Half the frequency corresponds to the upper audible frequency limit of the human ear, about 20 kHz. Because in the same length of time, the higher the sampling rate, the greater the amount of data, so: Usually instant messaging audio will use a sampling rate of 16 kHz or even lower to ensure the timeliness of signal transmission , but it will also affect the audio quality Cause certain losses (such as dull sound); while recording audio resources that focus on high-quality sound will use a sampling rate of 44.1 kHz or even 48 kHz to ensure high fidelity of the playback signal at the cost of more data storage.

img
Therefore, this part mainly includes the following three parameters:

  • Bit Rate: Bit rate is the number of bits transmitted per second. Units are bits (bps bits per second).
  • Sampling: Sampling is to convert a continuous time signal into a discrete digital signal.
  • Sampling rate: The sampling rate is how many samples are taken per second.

1.3.2 Quantization

Quantization refers to the digitization of a signal on the magnitude axis. If a sample is represented by a 16-bit binary signal, the range represented by a sample is [-32768, 32767].

Espressif AI voice uses 16- bit quantization.

1.3.3 Number of channels

The number of channels is the number of channels of sound, and the common ones are mono, binaural and stereo.

  • Monophonic sound can only be produced by one speaker, or it can be processed into two speakers to output the same channel sound. When playing back monophonic information through two speakers, we can clearly feel that the sound is from two speakers. It is impossible to determine the specific location of the sound source if it is transmitted to our ears between the two speakers.

  • Binaural means that there are two sound channels. The principle is that when people hear the sound, they can judge the specific position of the sound source according to the phase difference between the left ear and the right ear. The sound was split into two separate channels during the recording, resulting in great sound localization.

1.3.4 Calculation of audio size

For example: To record a piece of audio with a time of 1 s, a sampling rate of 16000 HZ, a sampling size of 16, and a channel number of 2, the space occupied is: 16000 * 16 * 2 * 1 s= 500 k

2 Acoustic front-end (Audio Front-End, AFE)

A set of Espressif AFE algorithm framework can be used for acoustic front-end processing based on the powerful ESP32 and ESP32-S3 SoC, enabling users to obtain high-quality and stable audio data, so as to build intelligent voice products with excellent performance and high cost performance.

2.1 Acoustic Echo Cancellation (AEC)

The acoustic echo cancellation algorithm uses adaptive filtering to eliminate echoes when audio is input from a microphone. This algorithm is suitable for scenarios such as voice devices playing audio through speakers.

The algorithm supports dual-mic processing at most, which can effectively remove the self-playing sound in the mic input signal. In this way, applications such as voice recognition can be performed well under the condition of playing music by itself.

2.2 Blind Source Separation (BSS)

The Blind Source Separation algorithm uses multiple microphones to detect the direction of incoming audio and emphasizes audio input from a certain direction. This algorithm improves the sound quality of the desired audio source in noisy environments.

2.3 Noise Suppression (NS)

The noise suppression algorithm supports single-channel audio signal processing, which can effectively eliminate useless non-human sounds (such as vacuum cleaners or air conditioners), thereby improving the audio signal to be processed.

3 Scenarios supported by Espressif AFE

The functions of Espressif AFE are aimed at the following two different scenarios:

  1. speech recognition scene

  2. voice call scene

3.1 Speech Recognition Scenario

Model steps:

  1. audio input

  2. AEC for echo cancellation (removes its own audio announcement, which requires an echo channel)

    • Hard callback: directly read the data written to the speaker through IIS (can share one IIS with the microphone)
    • Soft callback: software copy data written to the speaker (not yet supported, waiting for development)
  3. BSS/NS

    • The BSS (Blind Source Separation) algorithm supports dual-channel processing, which can blindly separate the target sound source from other interfering sounds, thereby extracting useful audio signals and ensuring the quality of subsequent speech.
    • The NS (Noise Suppression) algorithm supports single-channel processing, which can suppress non-human voice noise in single-channel audio, especially for steady-state noise, and has a good suppression effect.
    • Which algorithm to use depends on the number of configured microphones.
  4. WHAT

    • VAD (Voice Activity Detection) algorithm supports real-time output of the voice activity status of the current frame
  5. WakeNet

    wake word

The corresponding flow chart is as follows:
Please add a picture description

3.2 Voice Call Scenario

Model steps:

  1. audio input
  2. AEC for echo cancellation (removes its own audio announcement, which requires an echo channel)
    • Hard callback: directly read the data written to the speaker through IIS (can share one IIS with the microphone)
    • Soft callback: software copy data written to the speaker (not yet supported, waiting for development)
  3. BSS/NS
    • The BSS (Blind Source Separation) algorithm supports dual-channel processing, which can blindly separate the target sound source from other interfering sounds, thereby extracting useful audio signals and ensuring the quality of subsequent speech.
    • The NS (Noise Suppression) algorithm supports single-channel processing, which can suppress non-human voice noise in single-channel audio, especially for steady-state noise, and has a good suppression effect.
    • Which algorithm to use depends on the number of configured microphones.
  4. MISO
    • MISO (Multi Input Single Output) algorithm supports dual-channel input and single-channel output. It is used to select an audio output with a high signal-to-noise ratio in a dual-mic scenario without wake-up enabled.
  5. AGC
    • AGC (Automatic Gain Control) dynamically adjusts the amplitude of the output audio. When a weak signal is input, the output amplitude is amplified; when the input signal reaches a certain strength, the output amplitude is compressed.

The corresponding flow chart is as follows:
Please add a picture description

3.3 Configuration code reference

#define AFE_CONFIG_DEFAULT() {
      
       \
    .aec_init = true, \                      	     	//AEC 算法是否使能
    .se_init = true, \									//BSS/NS 算法是否使能
    .vad_init = true, \									//VAD 是否使能 ( 仅可在语音识别场景中使用 )
    .wakenet_init = true, \								//唤醒是否使能.
    .voice_communication_init = false, \				//语音通话是否使能。与 wakenet_init 不能同时使能.
    .voice_communication_agc_init = false, \        	//语音通话中AGC是否使能
    .voice_communication_agc_gain = 15, \               //AGC的增益值,单位为dB
    .vad_mode = VAD_MODE_3, \                      	    //VAD 检测的操作模式,越大越激进
    .wakenet_model_name = NULL, \                       //选择唤醒词模型
    .wakenet_mode = DET_MODE_2CH_90, \              	//唤醒的模式。对应为多少通道的唤醒,根据mic通道的数量选择
    .afe_mode = SR_MODE_LOW_COST, \						//SR_MODE_LOW_COST: 量化版本,占用资源较少。 
        												//SR_MODE_HIGH_PERF: 非量化版本,占用资源较多。
    .afe_perferred_core = 0, \                      	//AFE 内部 BSS/NS/MISO 算法,运行在哪个 CPU 核
    .afe_perferred_priority = 5, \                  	//AFE 内部 BSS/NS/MISO 算法,运行的task优先级。
    .afe_ringbuf_size = 50, \                       	//内部 ringbuf 大小的配置
    .memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM, \	//绝大部分从外部psram分配
    .agc_mode = AFE_MN_PEAK_AGC_MODE_2, \               //线性放大喂给后续multinet的音频,峰值处为 -4dB。
    .pcm_config.total_ch_num = 3, \                     //total_ch_num = mic_num + ref_num
    .pcm_config.mic_num = 2, \							//音频的麦克风通道数。目前仅支持配置为 1 或 2。
    .pcm_config.ref_num = 1, \                          //音频的参考回路通道数,目前仅支持配置为 0 或 1。
}

4 AI voice model

4.1 WakeNet

4.1.1 Select model through menuconfig

wn9_hiesp (the latest wn9 is the default 8-bit quantization): version 9, the wake word is hi, esp

Please add a picture description

4.2 multinet

4.2.1 Select model through menuconfig

mn4q8_cn : Version 4, 8 bit quantization, Chinese command word
Please add a picture description

4.3 Add command words

4.3.1 Add command words through menuconfig

  • Add pinyin directly to Chinese command words: turn on the air conditioner (da kai kong tiao), and also support multiple sentences to support the same COMMAND ID, maximum wind speed/maximum wind speed

    Add dialect command words: add the corresponding pronunciation
    Please add a picture description

  • English command words need to add corresponding phonemes, which are generated through python scripts
    Please add a picture description

4.3.2 Dynamically add command words in the code

esp_mn_commands_add(i, token);

Dynamically add command words by calling api.

algorithm performance

Consumes only about 20% CPU, 30 KB SRAM and 500 KB PSRAM

5 Microphone Design

5.1 Microphone performance recommendation

  1. Microphone type: omnidirectional MEMS microphone.

  2. Sensitivity:

    • Under 1 Pa sound pressure, the analog sensitivity should not be lower than -38 dBV, and the digital sensitivity should not be lower than -26 dB
    • Tolerance is controlled at ±2 dB, ±1 dB is recommended for Mike arrays
  3. SNR

    The signal-to-noise ratio is not lower than 62 dB, and >64 dB is recommended:

    ​ The higher the signal-to-noise ratio, the more fidelity the sound

    • Frequency response: The fluctuation of the frequency response in the range of 50 ~ 16 kHz is within ± 3 dB
    • Power Supply Rejection Ratio (PSRR): n >55 dB(MEMS MIC)

6 Structural Design Suggestions

  1. The hole diameter or width of the microphone is recommended to be greater than 1 mm, the pickup pipe should be as short as possible, and the cavity should be as small as possible to ensure that the resonance frequency of the microphone and structural components is above 9 KHz.

  2. The depth to diameter ratio of the pickup hole is less than 2:1, and the shell thickness is recommended to be 1 mm. If the shell is too thick, the hole area needs to be increased.

  3. The holes on the rack need to be protected by dust-proof nets.

  4. A silicone sleeve or foam must be added between the microphone and the device shell for sealing and shockproof, and an interference fit design is required to ensure the microphone’s tightness.

  5. The microphone hole cannot be blocked, and the bottom microphone hole needs to be raised on the structure to prevent the microphone hole from being blocked by the table.

  6. The microphone should be placed away from speakers and other objects that generate noise or vibration, and should be isolated and buffered from the speaker cavity by rubber pads.

7 code explanation (CN_SPEECH_COMMANDS_RECOGNITION)

7.1 Header files

#include "esp_wn_iface.h"                   //唤醒词模型的一系列API
#include "esp_wn_models.h"					//根据输入的模型名称得到具体的唤醒词模型
#include "esp_afe_sr_iface.h"				//语音识别的音频前端算法的一系列API
#include "esp_afe_sr_models.h"              //语音前端模型的声明
#include "esp_mn_iface.h"                   //命令词模型的一系列API
#include "esp_mn_models.h"                  //命令词模型的声明
#include "esp_board_init.h"                 //开发板硬件初始化
#include "driver/i2s.h"                     //i2s 驱动
#include "speech_commands_action.h"         //根据识别到的 command 进行语音播报/闪烁 LED
#include "model_path.h"                     //从 spiffs 文件管理中返回模型路径等 API

7.2 app_main

void app_main()
{
    
    
    models = esp_srmodel_init("model");                                //spiffs 中的所有可用模型或  model 默认是从`flash`读
    ESP_ERROR_CHECK(esp_board_init(AUDIO_HAL_08K_SAMPLES, 1, 16));     //Special config for dev board   
    // ESP_ERROR_CHECK(esp_sdcard_init("/sdcard", 10));                //初始化 SD card
#if defined CONFIG_ESP32_KORVO_V1_1_BOARD
    led_init();                                                        //LED 初始化
#endif

    afe_handle = &ESP_AFE_SR_HANDLE;                                   
    afe_config_t afe_config = AFE_CONFIG_DEFAULT();					   //音频前端的配置项

    afe_config.wakenet_model_name = esp_srmodel_filter(models, ESP_WN_PREFIX, NULL);;  //从有所可用的模型中找到唤醒词模型的名字
#if defined CONFIG_ESP32_S3_BOX_BOARD || defined CONFIG_ESP32_S3_EYE_BOARD
    afe_config.aec_init = false;
#endif
    //afe_config.aec_init = false;                                       //关闭 AEC
    //afe_config.se_init = false;                                        //关闭 SE
    //afe_config.vad_init = false;                                       //关闭VAD
    //afe_config.pcm_config.total_ch_num = 2;                            //设置为单麦单回采
    //afe_config.pcm_config.mic_num = 1;                                 //麦克风通道一
    esp_afe_sr_data_t *afe_data = afe_handle->create_from_config(&afe_config);

    xTaskCreatePinnedToCore(&feed_Task, "feed", 4 * 1024, (void*)afe_data, 5, NULL, 0);        //feed 从 i2s 拿到音频数据
    xTaskCreatePinnedToCore(&detect_Task, "detect", 8 * 1024, (void*)afe_data, 5, NULL, 1);    //将音频数据喂给模型获取检测结果

#if defined  CONFIG_ESP32_S3_KORVO_1_V4_0_BOARD || defined CONFIG_ESP32_KORVO_V1_1_BOARD
    xTaskCreatePinnedToCore(&led_Task, "led", 2 * 1024, NULL, 5, NULL, 0);                     //开启LED
#endif
#if defined  CONFIG_ESP32_S3_KORVO_1_V4_0_BOARD || CONFIG_ESP32_S3_KORVO_2_V3_0_BOARD || CONFIG_ESP32_KORVO_V1_1_BOARD
    xTaskCreatePinnedToCore(&play_music, "play", 2 * 1024, NULL, 5, NULL, 1);                  //开启语音播报
#endif
}

7.2 Feed operation

void feed_Task(void *arg)
{
    
    
    esp_afe_sr_data_t *afe_data = arg;
    int audio_chunksize = afe_handle->get_feed_chunksize(afe_data);
    int nch = afe_handle->get_channel_num(afe_data);
    int feed_channel = esp_get_feed_channel();         //3;
    int16_t *i2s_buff = malloc(audio_chunksize * sizeof(int16_t) * feed_channel);
    assert(i2s_buff);
    size_t bytes_read;

    while (1) {
    
    
        //第一种方式 
        //audio_chunksize:音频时间 512->32ms 256->16ms
        //int16_t:16位量化
        //feed_channel:两麦克风通道数据一回采通道数据
        esp_get_feed_data(i2s_buff, audio_chunksize * sizeof(int16_t) * feed_channel);
        //第二种方式
        i2s_read(I2S_NUM_1, i2s_buff, audio_chunksize * sizeof(int16_t) * feed_channel, &bytes_read, portMAX_DELAY);
        afe_handle->feed(afe_data, i2s_buff);
    }
    afe_handle->destroy(afe_data);
    vTaskDelete(NULL);
}

7.3 detect operation

void detect_Task(void *arg)
{
    
    
	esp_afe_sr_data_t *afe_data = arg;
    int afe_chunksize = afe_handle->get_fetch_chunksize(afe_data);
    int nch = afe_handle->get_channel_num(afe_data);
    char *mn_name = esp_srmodel_filter(models, ESP_MN_PREFIX, ESP_MN_CHINESE);       //从模型队列中获取命令词模型名字
    printf("multinet:%s\n", mn_name);
    esp_mn_iface_t *multinet = esp_mn_handle_from_name(mn_name);                     //获取命令词模型
    model_iface_data_t *model_data = multinet->create(mn_name, 5760);                //创建
    esp_mn_commands_update_from_sdkconfig(multinet, model_data); 					 // Add speech commands from sdkconfig
    int mu_chunksize = multinet->get_samp_chunksize(model_data);
    int chunk_num = multinet->get_samp_chunknum(model_data);
    assert(mu_chunksize == afe_chunksize);
    printf("------------detect start------------\n");
    // FILE *fp = fopen("/sdcard/out1", "w");
    // if (fp == NULL) printf("can not open file\n");
    while (1) {
    
    
        afe_fetch_result_t* res = afe_handle->fetch(afe_data);                       //获得AEF的处理结果
        if (!res || res->ret_value == ESP_FAIL) {
    
    
            printf("fetch error!\n");
            break;
        }
#if CONFIG_IDF_TARGET_ESP32
        if (res->wakeup_state == WAKENET_DETECTED) {
    
                                    
            printf("wakeword detected\n");
            play_voice = -1;
            detect_flag = 1;
            afe_handle->disable_wakenet(afe_data);
            printf("-----------listening-----------\n");
        }
#elif CONFIG_IDF_TARGET_ESP32S3
        if (res->wakeup_state == WAKENET_DETECTED) {
    
                              
            printf("WAKEWORD DETECTED\n");                                           //如果被唤醒将唤醒标志置位True
        } else if (res->wakeup_state == WAKENET_CHANNEL_VERIFIED) {
    
    
            play_voice = -1;
            detect_flag = 1;
            printf("AFE_FETCH_CHANNEL_VERIFIED, channel index: %d\n", res->trigger_channel_id);
        }
#endif

        if (detect_flag == 1) {
    
    
            esp_mn_state_t mn_state = multinet->detect(model_data, res->data);       //将AFE处理后的音频数据给命令词模型

            if (mn_state == ESP_MN_STATE_DETECTING) {
    
    
                continue;
            }

            if (mn_state == ESP_MN_STATE_DETECTED) {
    
    
                esp_mn_results_t *mn_result = multinet->get_results(model_data);    //得到结果
                for (int i = 0; i < mn_result->num; i++) {
    
    
                    printf("TOP %d, command_id: %d, phrase_id: %d, prob: %f\n", 
                    i+1, mn_result->command_id[i], mn_result->phrase_id[i], mn_result->prob[i]);
                }
                printf("\n-----------listening-----------\n");
            }

            if (mn_state == ESP_MN_STATE_TIMEOUT) {
    
                                     //超时关闭
                afe_handle->enable_wakenet(afe_data);
                detect_flag = 0;
                printf("\n-----------awaits to be waken up-----------\n");
                continue;
            }
        }
    }
    afe_handle->destroy(afe_data);
    vTaskDelete(NULL);
}

8 Espressif AI related Github reference

Guess you like

Origin blog.csdn.net/Marchtwentytwo/article/details/129370026