Voiceprint recognition and sound source localization (2)

I. Introduction

        What is Sound Source Localization (SSL) technology? Sound source localization technology refers to the use of multiple microphones to measure the sound signal at different points in the environment. Since the sound signal arrives at each microphone with different degrees of delay, the measured sound signal is processed by an algorithm, thereby obtaining the sound source localization technology. The arrival direction (including azimuth and elevation angle) and distance of the source point relative to the microphone.

When it comes to sound source localization, we can easily think of human ear localization. Both one ear and both ears have the ability to localize. In monaural positioning, the various parts of the pinna reflect the incoming sound waves before entering the ear canal. Due to the difference in phase with the direct sound wave, the two interfere in the ear canal, producing a special auditory effect. This effect is called the pinna effect. Combined with the head rotation factor, the purpose of sound source localization can be achieved. In binaural positioning, the signals we receive through the left and right ears will have a time difference (Interaural Time Difference, ITD) and a sound level difference (Interaural Level Difference, ILD). The specific sound is positioned according to ITD and ILD, and the level The determination of the azimuth angle can be mathematically expressed as a two-dimensional sound direction estimation problem, as shown in Figure 1 below. ITD information has a better effect on bearing estimation at medium and low frequencies, while ILD information has a better effect on bearing estimation at high frequencies. Coupled with the pinna effect, head rotation, priority effect, etc., we will have a further and more accurate understanding of information such as angles and distances.

What is an Array Microphone?
        The microphone array is composed of a certain number of microphones, which samples and filters the spatial characteristics of the sound field. Currently commonly used microphone arrays can be classified into linear arrays, planar arrays, and stereo arrays according to their layout shapes. The geometry is known by design, the frequency response of all microphones is identical, and the sampling clocks of the microphones are synchronized.
Microphone arrays are generally used for: sound source localization, including measurement of angle and distance, suppression of background noise, interference, reverberation, echo, signal extraction, and signal separation. Among them, the sound source localization technology uses the microphone array to calculate the angle and distance between the sound source and the array, so as to realize the tracking of the target sound source.

Ring 6 Mic Array USB 4 Mic Array

        Speech separation based on a microphone array is to use a microphone array or multiple microphones to simulate the human ear, and use a speech separation algorithm to separate the interfering aliasing signals collected by the microphones to obtain the signal of interest. The sound source localization based on the microphone array is to firstly use the microphone array to collect the voice signal, and then use the related technology of digital signal processing to analyze and process the collected signal, and finally determine the spatial position of the sound source (that is, the sound source is in the plane or space) coordinates in ) for tracking.

2. Sound source localization technology

 Sound source localization technology mainly consists of the following two parts:

  • Direction-of-arrival (DOA) estimation, including azimuth and elevation.
  • distance estimate.

1. End-to-end model

The sound source localization end-to-end model extracts the features of the collected sound signal, and then uses the sound localization method to obtain the output, and the mapping method relies heavily on the acoustic propagation model.

Propagation Model. The most common acoustic propagation models for sound source localization are the free-field model and the far-field model. In a free field, the sound has only one direct path to the microphone, which also means that there is no obstruction between the sound source and the microphone, and there is no reflection of sound (no reverberation in the room), such as in an open outdoor or anechoic environment room . In the far field, the relationship between the distance between the microphones and the distance from the sound source to the microphone array is such that sound waves can be considered as plane waves.

Features. In the acoustic localization method used, the following acoustic features are used: time difference of arrival (TDOA), energy difference between microphones (Inter-microphone intensity difference, IID), spectral notches (Spectral notches), MUSIC pseudo-spectrum (Pseudo-spectrum), and beamforming steered-response (Beamforming steered-response), etc.

Mapping procedures. The mapping method in sound source localization refers to mapping the features in the array signal to its position information.

 2. Implementation method

(1) Estimation of direction of arrival

A method based on relative delay estimation. Due to the geometric structure of the array, the signals received by each array have different degrees of delay, and the method based on relative delay estimation uses cross-correlation, generalized cross-correlation (Generalized Cross-Correlation, GCC) or phase difference to estimate each The time delay difference between array signals, combined with the geometric structure of the array, is used to estimate the azimuth angle information of the sound source.

Beamforming-based approach. The algorithm usually uses all angle compensation phases for each element of the array to realize the scanning of the target area, and then performs weighted summation on each signal, and takes the direction of the maximum beam output power as the direction of the target sound source. Common sound source azimuth estimation algorithms based on beamforming include Delay and Sum (DS) algorithm, Minimum Variance Distortionless Response (MVDR) algorithm, steerable response power phase transformation method (Steered Response Power-Phase Transform, SRP-PHAT), etc.

Methods based on signal subspaces. Such algorithms can generally be divided into coherent subspace methods and incoherent subspace methods. Among the incoherent subspace algorithms, the most classic algorithm is the Multiple Signal Classification (MUSIC) algorithm. The variance is used for feature extraction, and the signal subspace and noise subspace are constructed by using the feature vector, and then the noise subspace is used to construct a high-resolution spatial spectrum. Since the sound source signal is a broadband signal, the sound source signal can be decomposed into multiple narrowband signals using Fourier transform, and then each narrowband is positioned using the MUSIC algorithm, and the results of each narrowband estimation are weighted and combined to obtain a wideband orientation estimate. The coherent subspace method is to converge narrowband signals to a certain reference frequency, so that the narrowband subspace processing method is used for orientation estimation.

Modal Domain-Based Approaches. The above methods are all processing methods in the array element domain, and a major characteristic of the modal domain is that its beam has nothing to do with the frequency of the steering vector. Based on this, a beamformer with low-frequency directivity can be designed, and the beamformer in the array element domain can also be reduced. The number of frequency points to scan. Compared with the processing method of the array element domain, the processing method of the modal domain has one more step of modal expansion operation in the beamforming. The modal expansion can be realized by Fourier transform. After expansion, each order mode has a corresponding spatial characteristic beam. , corresponding to a particular beam response, can be viewed as a set of bases combined into the desired beam response. Theoretically speaking, as long as the order of the mode expansion is high enough, the theory can be combined and approximated into any beam. The modal domain method is currently applied to spherical arrays and ring arrays with relatively good results.

Machine learning (or deep learning) based methods. In contrast to traditional model-based methods, machine learning-based methods are data-driven and do not even require the definition of a propagation model. The method based on machine learning regards sound source localization as a multi-classification or linear regression problem, and uses its very strong nonlinear fitting ability to directly map multi-channel data features into localization results. The method based on machine learning has mainly developed into two directions, namely the grid-based method and the gridless method. These two methods have their own advantages in positioning accuracy and estimating the number of sound sources.

(2) Distance estimation

Compared with DOA estimation, the research on the estimation of sound source distance started later. After obtaining the DOA estimation result, the sound source is located in the hyperbola between the microphone and the captured signal. If multiple microphone arrays are used to estimate the source signal DOA, the hyperbola intersection point of each microphone array can be used to Sound source localization. However, this method is not suitable for long-distance ranging, and many studies also stay on indoor short-distance sound source ranging.

Under room conditions, energy from reflected sound (such as a reverberant diffuse sound field in a room) can be assumed to remain constant while the energy from direct sound varies as the distance to the sound source changes. The ratio of these two energies is called the Direct-to-Reverberant ratio (DRR), which is closely related to the estimation of the distance to the sound source. Theoretically, the DRR of a signal can be directly calculated from the Room Impulse Responses (RIRs) of the sound source arriving at the microphone. However, the estimation of sound source distance is affected by many factors (such as unknown RIRs, mismatch between near-field and far-field models, reverberation energy will change due to distance changes, etc.), these methods are immature and cannot be well applied .

3. Evaluation indicators

For DOA estimation and distance estimation methods, it is necessary to rely on some indicators to measure the performance of sound source localization. The common evaluation indicators are as follows:

Average error. It measures the error of an estimate, usually by comparing the estimated value with the true value, and expressing the average difference between these values. The specific implementation methods include absolute error, mean square error, root mean square error and maximum error.

Accuracy. This indicator is usually used for DOA estimation. We assume that if the estimated value is within a certain error range of the real value, the estimation is considered correct, otherwise, it is considered to be wrong. It measures what proportion of detections are correct.

Precision, Recall and F1-score. These metrics are common in machine learning classification tasks. For estimating the position of a sound source, if the estimation is correct, it is called a true positive; if the estimation is wrong, it is called a false negative. Assuming that there is no sound source at this location, if the estimated result is not there, it is called a true negative example (True negative); if the estimated result has a sound source, it is called a false positive example (False positive). The recall rate measures the proportion of the detected correct sound source positions to all sound sources; the precision rate measures the proportion of the estimated sound source positions, how many position estimates are correct. Generally speaking, the precision rate and the recall rate are negatively correlated, and the F1 score is the harmonic mean of these two indicators, providing a balance between them.

The number of sound sources (Number of sources). This indicator measures the number of sound sources that can be estimated, regardless of the specific location of the sound source.

There are also some other performance indicators, such as using a certain sound source localization method in the preprocessing of speech recognition, sound source separation, and speech pickup tasks. The above tasks depend on the effect of sound source localization, and are indirectly evaluated through the performance of these tasks. The performance of sound source localization.

3. Speech separation and sound source localization algorithm Steered Response Power Phase Transform (SRP-PHAT) + Degenerate  Unmixing  Estimation  Technique (DUET)

Steered Response Power Phase Transform (SRP-PHAT), a controllable response power algorithm weighted by phase transformation, is an important algorithm for locating sound sources. For multi-source extension, Degenerate  Unmixing  Estimation  Technique (DUET) can be used to separate each source and pass it to the SRP-PHAT algorithm for multi-source tracking

3D Multiple Sound Sources Localization (SSL)

GitHub - BrownsugarZeer/Multi_SSL: Combine sound source separation with SRP-PHAT to achieve multi-source localization.Combine sound source separation with SRP-PHAT to achieve multi-source localization. - GitHub - BrownsugarZeer/Multi_SSL: Combine sound source separation with SRP-PHAT to achieve multi-source localization.https://github.com/BrownsugarZeer/Multi_SSL

1

2

3

Guess you like

Origin blog.csdn.net/shadowismine/article/details/128646993