【iOS】IOS Voice Call Acoustic Echo Cancellation (AEC) Technology Realization

I. Introduction

In voice calls, interactive live broadcasts, voice-to-text applications or games, it is necessary to collect audio data from the user's microphone, and then send the audio data to other terminals or voice recognition services. If the collected microphone data is used directly, there will be an echo problem. The so-called echo means that during a voice call, if the user turns on the loudspeaker, the voice of his own speech and the voice of the other party (that is, the voice of the speaker) will be mixed together. If the voice of the other party is not eliminated, then what the other party hears is There is a problem with the sound of echo. Therefore, after the microphone data is collected, the echo must be eliminated in order to obtain a good user experience.

The English term for echo cancellation is Acoustic Echo Cancellation, or AEC for short. How to implement echo cancellation is a relatively complicated mathematical problem in terms of technical details. Generally, mobile phone manufacturers provide the implementation of the underlying echo cancellation technology, and the app only needs to call the relevant API. But it is not a simple call, you need to have a certain understanding of some basic knowledge, such as what audio frameworks are provided on the mobile phone, what is the function of each audio framework, which frameworks can implement AEC, how to call the API that implements AEC, What precautions and so on are what we need to pay attention to and summarize. This article mainly sorts out the relevant knowledge on how to implement echo cancellation on iOS devices.

2. Overview of iOS audio framework

The iOS system provides five types of audio frameworks, from the upper layer to the lower layer are Media Player, AV Foundation, OpenAL, Audio Toolbox, AudioUnit, as shown in the figure below. All audio technologies in iOS are built on the audio unit (AudioUnit). AudioUnit is the underlying implementation of all audio frameworks, such as Media Player, AV Foundation, OpenAL, and Audio Toolbox are all AudioUnit wrappers, which provide dedicated and simplified APIs for specific tasks.
insert image description here
Basic audio functions, such as recording and playback, and sound effects, can be solved by using the upper-level audio framework. Use AudioUnit only when you need the most demanding control, performance or flexibility, or when you need specific functionality that can only be obtained by using Audio Unit directly (such as AEC).

When we need to implement one of the following functions, use the low-level AudioUnit directly instead of passing through the higher-level API:

  • Simultaneous audio I/O (input and output) with low latency, such as VoIP (Voice over Internet Protocol) applications
  • Responsive playback of synthetic sounds, such as music-type games or synthetic instruments
  • Use specific Audio Unit functions such as acoustic echo cancellation, mixing or tonal equalization
  • A processing chain architecture that lets you assemble audio processing modules into flexible networks. This is the only audio API in iOS that provides this functionality.

The current game needs to use the iOS audio framework to collect microphone data. The APIs that can provide recording functions in iOS include AVAudioRecorder, AudioQueue, and AudioUnit. Among them, AVAudioRecorder can only record microphone audio into files, while AudioQueue can obtain microphone audio data in real time, but it does not provide AEC function. Therefore, AudioUnit must be selected to realize the recording audio in the game scene while eliminating the echo.

3. Classification of AudioUnit

Purpose Audio Units
Effect iPod Equalizer (iPod equalizer)
Mixing 3D Mixer (3D Stereo)
Multichannel Mixer (Multichannel Mixer)
I/O (input output) Remote I/O (remote input and output, connected to microphones and speakers)
Voice-Processing I/O (same as above, added features such as echo suppression, suitable for network voice processing)
Generic Output (general output, output to applications)
Format conversion Format Converter

(1) Effect unit

iOS 4 provides an effect, the iPod Equalizer, and the built-in iPod application uses the same equalizer. When using this audio unit, you must provide your own UI. This audio unit provides a set of preset equalization curves such as Bass Booster, Pop and Spoken Word.

(2) Mixer unit

iOS provides two mixer units. One is 3D Mixer (3D mixer), and the other is Multichannel Mixer (multichannel mixer). The 3D mixer unit is the basis upon which OpenAL is built. In most cases, if you need the functionality of a 3D mixer unit, your best bet is to use OpenAL, which provides a higher level API that is well suited for gaming applications. The multi-channel mixer unit provides mixing for any number of mono or stereo streams, with stereo output. You can turn each input on or off, set its input gain, and set its stereo pan position.

(3), input/output unit

iOS provides three I/O units, namely Remote I/O, Voice-Processing I/O, and Generic Output.
1. The Remote I/O unit is the most commonly used. It connects to the device's input and output audio hardware (such as microphones and speakers), giving you low-latency access to individual input and outgoing audio sample values. It provides format conversion between hardware audio formats and your application audio formats via the included format converter unit.
2. The Voice-Processing I/O unit extends the remote I/O unit by adding acoustic echo cancellation for VoIP or voice chat applications. It also provides automatic gain correction, speech processing quality adjustment and mute function.
3. The Generic Output unit is not connected to the audio hardware, but rather provides a mechanism to send the output of the processing chain to your application. You would typically use a general purpose output unit for offline audio processing.

(4), format converter unit

iOS 4 provides a Format Converter unit, which is usually used indirectly through the I/O unit.

AudioUnit is uniquely identified by three parts: Type, SubType and Manufacturer ID (manufacturer ID). As requested in the table below.

Name and description Identifier keys Corresponding four-char codes
Format Converter Unit (supports audio format conversion between linear PCM) kAudioUnitType_FormatConverter
kAudioUnitSubType_AUConverter
kAudioUnitManufacturer_Apple
aufc
conv
appl
iPod Equalizer Unit (provides the function of iPod equalizer) kAudioUnitType_Effect
kAudioUnitSubType_AUiPodEQ
kAudioUnitManufacturer_Apple
aufx
ipeq
appl
3D Mixer Unit (supports mixing multiple audio streams, output panning, sampling rate conversion, etc.) kAudioUnitType_Mixer
kAudioUnitSubType_AU3DMixerEmbedded
kAudioUnitManufacturer_Apple
aumx
3dem
appl
Multichannel Mixer Unit (supports mixing multiple audio streams into one stream) kAudioUnitType_Mixer
kAudioUnitSubType_MultiChannelMixer
kAudioUnitManufacturer_Apple
aumx
mcm
xappl
Generic Output Unit (supports conversion to and from linear PCM format; can be used to start and stop graphics) kAudioUnitType_Output
kAudioUnitSubType_GenericOutput
kAudioUnitManufacturer_Apple
auou
genr
appl
Remote I/O unit Unit (connects to device hardware for input, output, or simultaneous input and output) kAudioUnitType_Output
kAudioUnitSubType_RemoteIO
kAudioUnitManufacturer_Apple
auou
rioc
appl
Voice-Processing I/O Unit (with the characteristics of an I/O unit, adding echo suppression for two-way communication) kAudioUnitType_Output
kAudioUnitSubType_VoiceProcessingIO
kAudioUnitManufacturer_Apple
auou
vpio
appl

From the above classification of AudioUnit, it can be seen that to achieve echo cancellation, the AudioUnit of the Voice-Processing I/O type must be used.

Four, AudioUnit internal architecture

The various parts of the audio unit are organized into Scope and Element, as shown in the figure. When calling a function to configure an AudioUnit, you must specify a Scope and Element to identify the specific target of the function.
insert image description here


Functions to configure or control AudioUnit:

UInt32 busCount = 2;
 
OSStatus result = AudioUnitSetProperty (
    mixerUnit,
    kAudioUnitProperty_ElementCount,   // the property key
    kAudioUnitScope_Input,             // the scope to set the property on
    0,                                 // the element to set the property on
    &busCount,                         // the property value
    sizeof (busCount)
);

Scope is the programming context in AudioUnit. Despite what the name Global Scope might mean otherwise, these contexts are never nested.
Element is a programming context nested within the scope of an audio unit. When an element is part of an input or output range, it is analogous to a signal bus in physical audio equipment - hence the sometimes called bus. These two terms - Element and Bus - refer to the exact same thing in Audio Unit programming. This document uses "bus" when emphasizing signal flow and "element" when emphasizing specific functional aspects of an Audio Unit, such as the input and output elements of an I/O Unit.
You specify an element (or bus) by its zero-indexed integer value. If setting an attribute or parameter that applies to the entire scope, specify an element value of 0.
The diagram above illustrates a common architecture for an audio unit, where the number of elements on the input and output are the same. However, various audio units use various architectures. For example, a mixer unit may have multiple input elements but only one output element.
The global scope, as shown at the bottom of the figure above, applies to the entire audio unit and is not associated with any specific audio stream. It has only one element, element 0. Some properties, such as the maximum number of frames per slice ( kAudioUnitProperty_MaximumFramesPerSlice ), are only available at the global scope.

The architecture diagram of IO Unit is as follows:

insert image description here


An IO unit contains exactly two elements, and while these two elements are part of one audio unit, your application treats them primarily as separate entities. Like attributes kAudioOutputUnitProperty_EnableIO, you can enable or disable each element independently, depending on the needs of your application.
Element 1 of the I/O unit connects directly to the audio input hardware on the device, represented in the diagram by a microphone. This hardware connection (within the input range of element 1) is not transparent to you. Your first access to audio data coming in from the input hardware is in the output range of Element 1.
Likewise, element 0 of the I/O unit connects directly to the audio output hardware on the device, as shown in the speaker in the image above. You can route audio to element 0's input range, but its output range is opaque.

When working with audio units, you will often hear the two elements of an I/O unit described not by number but by name: The
input element is element 1 (mnemonic: the letter "I" in the word "Input" Appears like the number 1)
The output element is element 0 (mnemonic: the letter "O" of the word "Output" looks like the number 0)

Five, AudioUnit workflow

Audio Units typically work within the context of an enclosing object called an Audio Processing Graph, as shown. In this example, your application sends audio to the first audio unit in the diagram through one or more callback functions, and controls each audio unit individually. The output of an I/O unit, or the last audio unit in any audio processing diagram, is directly connected to the output hardware.

insert image description here
(EQ is the abbreviation of Equalizer Unit (equalizer unit).)

The figure below is an audio processing diagram composed of a classic multichannel mixer unit (Multichannel Mixer Unit) and Remote IO Unit components, which are used to mix and play two synthetic sounds. The sound is first routed to the mixer's two input buses. The mixer output goes to the output element of the I/O unit, which then outputs the sound to the hardware.

insert image description here

6. Use AudioUnit to achieve echo cancellation

The use of AudioUnit is roughly divided into five steps: obtaining an AudioUnit instance, setting audio parameters (such as sampling rate, number of channels, etc.), setting acquisition or output callbacks, initializing AudioUnit, and starting AudioUnit.
The following is Voice-Processing I/O Unitan example to illustrate how to eliminate echo during recording.

6.1. Get AudioUnit instance

In some cases, we can directly use a certain AudioUnit to complete our functions, and sometimes we need multiple AudioUnits to process audio together.
iOS has an API for directly processing audio units, and another for manipulating audio processing graphs for multiple AudioUnits to collaboratively process audio.
To work with Audio Units directly (configure and control them), use the functions described in Audio Unit Component Services Reference .
To create and configure an audio processing graph (a processing chain of audio units), use the functions described in Audio Unit Processing Graph Services Reference .

6.1.1 Create AudioComponentDescription to identify AudioUnit
//AudioUnit描述
AudioComponentDescription ioUnitDescription;
 
//AudioUnit的主类型
ioUnitDescription.componentType = kAudioUnitType_Output;

//AudioUnit的子类型,支持回音消除
ioUnitDescription.componentSubType = kAudioUnitSubType_VoiceProcessingIO;

//AudioUnit制造商,目前只支持苹果
ioUnitDescription.componentManufacturer = kAudioUnitManufacturer_Apple;

//以下两个字段固定是0
ioUnitDescription.componentFlags = 0;
ioUnitDescription.componentFlagsMask = 0;

The above settings can uniquely identify an AudioUnit, and the following starts to get the AU instance.

6.1.2 Create AudioComponent to get AudioUnit instance
//查找AudioComponent
//第一个参数传递NULL,告诉此函数使用系统定义的顺序查找匹配的第一个系统音频单元
AudioComponent foundIoUnitReference = AudioComponentFindNext (NULL,&ioUnitDescription);

//需要实例化的AudioUnit
AudioUnit ioUnitInstance;

//实例化AudioUnit
AudioComponentInstanceNew(foundIoUnitReference,&ioUnitInstance);

AudioComponent is an audio component, which represents a type of AudioUnit. It is used to instantiate an AudioUnit, and an AudioComponent can be used to instantiate multiple AudioUnits, similar to the relationship between classes and objects in object-oriented programming.

6.1.3 You can also use the audio processing graph (AUGraph) to obtain AudioUnit instances

When you need to use multiple AudioUnits to work together, you need to use AuGraph. Only one AudioUnit is recommended to use the above method.

// Declare and instantiate an audio processing graph
AUGraph processingGraph;
NewAUGraph (&processingGraph);
 
// Add an audio unit node to the graph, then instantiate the audio unit
AUNode ioNode;
AUGraphAddNode (
    processingGraph,
    &ioUnitDescription,
    &ioNode
);
AUGraphOpen (processingGraph); // indirectly performs audio unit instantiation
 
// Obtain a reference to the newly-instantiated I/O unit
AudioUnit ioUnit;
AUGraphNodeInfo (
    processingGraph,
    ioNode,
    NULL,
    &ioUnit
);

6.2 Set the basic parameters of AudioUnit

Next, you need to set the basic parameters of AudioUnit, such as sampling rate, number of channels, sampling depth, etc.

    AudioStreamBasicDescription mAudioFormat;
    mAudioFormat.mSampleRate = 16000;//按照需要设置采样率,越大声音越精细,人声16000足够
    mAudioFormat.mFormatID = kAudioFormatLinearPCM;
    mAudioFormat.mFormatFlags = kAudioFormatFlagIsSignedInteger | kAudioFormatFlagIsPacked;
    mAudioFormat.mReserved = 0;
    mAudioFormat.mChannelsPerFrame = 1;//声道数
    mAudioFormat.mBitsPerChannel = 16;//采样深度
    mAudioFormat.mFramesPerPacket = 1;//每个包有多少帧
    //每帧有多少字节
    mAudioFormat.mBytesPerFrame = (mAudioFormat.mBitsPerChannel / 8) * mAudioFormat.mChannelsPerFrame; // 每帧的bytes数2
    //每个包有多少字节
    mAudioFormat.mBytesPerPacket =  mAudioFormat.mFramesPerPacket*mAudioFormat.mBytesPerFrame;//每个包的字节数2
    
    UInt32 size = sizeof(mAudioFormat);
    
    //CheckError是检查错误码的函数,如果有异常,打印后面的字符串
    CheckError(AudioUnitSetProperty(remoteIOUnit,
                                    kAudioUnitProperty_StreamFormat,
                                    kAudioUnitScope_Output,
                                    1,
                                    &mAudioFormat,
                                    size),
               "kAudioUnitProperty_StreamFormat of bus 1 failed");
    
    
    CheckError(AudioUnitSetProperty(remoteIOUnit,
                                    kAudioUnitProperty_StreamFormat,
                                    kAudioUnitScope_Input,
                                    0,
                                    &mAudioFormat,
                                    size),
               "kAudioUnitProperty_StreamFormat of bus 0 failed");

6.3 Set the callback of AudioUnit

6.3.1 Define an audio input callback
//定义音频输入回调
OSStatus AudioInputCallback(void *inRefCon,
                            AudioUnitRenderActionFlags *ioActionFlags,
                            const AudioTimeStamp *inTimeStamp,
                            UInt32 inBusNumber,
                            UInt32 inNumberFrames,
                            AudioBufferList *__nullable ioData) {
    
    
                            
    AudioUnitRecorder *recorder = (__bridge AudioUnitRecorder *)inRefCon;
    
    AudioBuffer buffer;//创建音频缓冲
  
    UInt32 size = inNumberFrames * recorder->mAudioFormat.mBytesPerFrame;
    buffer.mDataByteSize = size; //定义缓冲大小
    buffer.mNumberChannels = 1; //声道数
    buffer.mData = malloc(size); //申请内存
    
    AudioBufferList bufferList;//创建缓冲数组
    bufferList.mNumberBuffers = 1;//只需要一个音频缓冲
    bufferList.mBuffers[0] = buffer;//给数组元素赋值
    
    OSStatus status = noErr;
    
    //调用AudioUnitRender函数获取麦克风数据,存入上面创建的AudioBufferList中。
    status = AudioUnitRender(recorder->remoteIOUnit, ioActionFlags, inTimeStamp, 1, inNumberFrames, &bufferList);
 
    //有异常则中断
    if (status != noErr) {
    
    
        printf("AudioUnitRender %d \n", (int)status);
        return status;
    }
    //这里将麦克风音频放入一个缓存区,提供给阿里云语音识别SDK
    if(recorder.isStarted){
    
    
        NSData *frame = [recorder _bufferPCMFrame:&buffer];//缓存PCM音频
        if(frame){
    
    
            //缓存满了就把数据交给阿里云SDK
            [recorder _handleVoiceFrame:frame];
        }
    }else{
    
    
        NSLog(@"WARN: audio, - recorder is stopped, ignoring the callback data %d bytes",(int)buffer.mDataByteSize);
    }
    //释放内存
    free(buffer.mData);
    return status;
}
6.3.2 Set input callback for AudioUnit

To collect microphone data, you need to set the input callback: kAudioOutputUnitProperty_SetInputCallback
get the microphone audio in the callback. Currently, microphone audio needs to be collected, so this callback is used.

    AURenderCallbackStruct callbackStruct;
    callbackStruct.inputProc = AudioInputCallback;//AudioInputCallback是上面定义的回调函数
    callbackStruct.inputProcRefCon = (__bridge void *)(self);
    OSStatus status = AudioUnitSetProperty(remoteIOUnit, kAudioOutputUnitProperty_SetInputCallback, kAudioUnitScope_Output, 0, &callbackStruct, sizeof(callbackStruct));
    //CheckError是检查错误码的函数,如果有异常,打印后面的字符串
    CheckError(status, "SetInputCallback error");

If you are playing an audio file or URL, you need to set the rendering callback: kAudioUnitProperty_SetRenderCallback, and provide the audio stream in the callback.

6.4 Initialize AudioUnit

 CheckError(AudioUnitInitialize(remoteIOUnit),"AudioUnitInitialize error");

6.5 Start AudioUnit

CheckError(AudioOutputUnitStart(remoteIOUnit),"AudioOutputUnitStart error");

After starting this AudioUnit, you can get the microphone data that eliminates the echo of the device in the callback.

Seven, about AudioSession (audio session)

Before using AudioUnit, the AudioSession must be set correctly , otherwise AudioUnit cannot work properly.

7.1 AudioSession working mechanism

iOS uses AudioSession to manage audio behaviors within applications, between applications, and at the device level. As shown below.

insert image description here

Before using system audio features, you need to tell the system how you intend to use audio in your application. AudioSession acts as an intermediary between your application and the operating system, and in turn, the underlying audio hardware. You can use it to communicate the nature of your app's audio to the operating system without specifying specific behavior or required interaction with the audio hardware. Delegating the management of these details to the audio session ensures optimal management of the user's audio experience.

7.2 AudioSession usage steps

Interact with the application's AudioSession through the AVAudioSession instance.
The steps to use are as follows:

  1. Configure AudioSessionCategory (audio session category) and Mode (mode) to communicate to the system how you intend to use audio in your application
  2. Activate your app's audio session to put your class and mode configuration into effect
  3. Subscribe and respond to important audio session notifications, such as audio interruptions and Route changes (peripheral plugging and unplugging)
  4. Perform advanced audio device configuration, such as setting the sampling rate, I/O buffer time, and number of channels.
    For example, the current scene needs to record while playing audio, so set the AudioSession as follows.
//获取AudioSession实例
AVAudioSession *audioSession = [AVAudioSession sharedInstance];
     
//设置音频行为是播放和录音同时进行,默认只允许播放。
[audioSession setCategory:AVAudioSessionCategoryPlayAndRecord withOptions:AVAudioSessionCategoryOptionDefaultToSpeaker|AVAudioSessionCategoryOptionAllowBluetooth error:nil];
    
//缓冲时间0.02秒,这个决定了AudionUnit音频回调时的inNumberFrames的大小。
[audioSession setPreferredIOBufferDuration:0.02 error:nil];

//激活音频会话
[audioSession setActive:YES error:nil];
    
//监听耳机变化
[[NSNotificationCenter defaultCenter] addObserver:self selector:@selector(audioRouteChangeListenerCallback:)   name:AVAudioSessionRouteChangeNotification object:audioSession];

You need to use this Category: when the recording is playing at the same time AVAudioSessionCategoryPlayAndRecord, it will output audio from the earpiece by default, so you need to add the option to open the speaker: AVAudioSessionCategoryOptionDefaultToSpeakerand add the option that allows the earphone to be inserted midway:AVAudioSessionCategoryOptionAllowBluetooth

For the types and behavior characteristics of AudioSessionCategory, please refer to:

Types and behavioral characteristics of AudioSessionCategory

8. Some experience and precautions

  • To record and play audio scenes at the same time, the Category of AudioSession needs to be set to: AVAudioSessionCategoryPlayAndRecord.
  • AVAudioSessionCategoryPlayAndRecord, the audio will be played from the earpiece by default, and the option: AVAudioSessionCategoryOptionDefaultToSpeaker needs to be added to force the speaker to be turned on.
  • AVAudioSessionCategoryOptionDefaultToSpeaker, if you add this option to turn on the speaker, the sound will still be output from the speaker when wearing a Bluetooth headset (wired headsets will not be affected), you need to add option: AVAudioSessionCategoryOptionAllowBluetooth.
  • When AEC is enabled in the scene of recording and playing audio at the same time, you need to play it first and then record, otherwise the sound will be output from the earpiece, and AEC will not take effect. The sound is normal when AEC is not turned on.
  • Take off the earphones, if the recording and playback are not stopped, the audio will become the earpiece output, and the speaker can no longer be turned on. It is recommended that after monitoring that the earphones are taken off, stop the recording first, and then turn on the recording after a period of time, such as 2 seconds. If there is a better solution, please leave a message.
  • AVAudioSessionCategoryPlayback supports background playback and lock screen playback of audio. In addition, you need to check "Audio, AirPlay, and Picture in Picture" in Capabilities->Background Modes.
  • Recording is the same as above, if this is not checked, the app will automatically stop recording when it switches to the background.

References:
https://developer.apple.com/library/archive/documentation/MusicAudio/Conceptual/AudioUnitHostingGuide_iOS/Introduction/Introduction.html

Guess you like

Origin blog.csdn.net/devnn/article/details/130186424