Summary of audio feature extraction methods and tools

This article first appeared in: Walker AI

Most audio features originate from speech recognition tasks. They can streamline the original waveform sampled signal, thereby accelerating the machine's understanding of the semantic meaning of audio. Since the late 1990s, these audio features have also been used in music information retrieval tasks such as instrument recognition, and more features designed for audio music have emerged.

1. Audio feature categories

Understanding the different categories of audio features is not about accurately classifying a certain feature, but to deepen the understanding of the physical meaning of the feature. Generally, we can distinguish audio features from the following dimensions:

(1) Whether the feature is directly extracted from the signal by the model or is based on the statistics obtained from the output of the model, such as mean, variance, etc.;

(2) Whether the feature represents a transient state or a global value, the transient state is generally based on a frame and the global state covers a longer time dimension;

(3) The degree of abstraction of features. The lowest level of abstraction of the underlying features is also the easiest to extract from the original audio signal. It can be further processed into higher-level intermediate features that represent common musical elements in the score, such as pitch and start of notes. Start time, etc.; high-level features are the most abstract, mostly used for music style and emotional tasks;

(4) According to the difference of the feature extraction process, it can be divided into: features extracted directly from the original signal (such as zero-crossing rate), features obtained by converting the signal to frequency (such as spectral centroid), and those obtained through a specific model Features (such as melody), the features (such as MFCCs) obtained by changing the quantitative feature scale inspired by human auditory cognition.

We use "differences in feature extraction process" as the main classification benchmark, and list the more common features under each category:

Summary of audio feature extraction methods and tools

At the same time, we also found that some features do not completely belong to one of the categories such as MFCC, because extracting MFCC will convert the signal from the time domain to the frequency domain and then obtain it according to the MEL scale filter that mimics the human auditory response, so it belongs to both frequency domain features. It is also a perceptual feature.

2. Common extraction tools

Below is a list of some commonly used tools and platforms for extracting audio features.

name address Adaptation language
Aubio https://aubio.org c/python
Essentia https://essentia.upf.edu c++/python
Librosa https://librosa.org python
Madmom http://madmom.readthedocs.org python
pyAudioAnalysis https://github.com/tyiannak/pyAudioAnalysis python
Vamp-plugins https://www.vamp-plugins.org c++/python
Yaafe http://yaafe.sourceforge.net python/matlab

3. Audio signal processing

Audio digital signal is a series of digital representation of continuously changing samples in the time domain, which is often referred to as "waveform". To analyze the digital signal, the signal needs to be sampled and quantized.

Sampling refers to the process of discretization in continuous time. Uniform sampling refers to sampling once every equal time interval. The number of sound samples that need to be collected per second is called the sampling frequency. The 44.1kHz and 11kHz often seen in audio files refer to Sampling frequency.

Quantization converts continuous waveforms into discrete numbers. First, the entire amplitude is divided into a set of finite quantization steps. The amplitude division can be equal or unequal spacing. The sample values ​​falling within a certain step are given the same value. The quantized value. The bit depth in the audio file represents the quantization value, and the 16bit bit depth represents the quantization of the amplitude to 2^16.

Nyquist's law states that if the sampling frequency is greater than or equal to twice the highest frequency component in the signal, a signal can be accurately reconstructed from its sampling value. In fact, the sampling frequency is significantly greater than the Nyquist frequency.

4. Common transformations

4.1 Short-time Fourier transform

Short Time Fourier Transform (STFT) is suitable for spectrum analysis of slow time-varying signals, and has been widely used in audio and image analysis and processing. The method is to first divide the signal into frames, and then perform Fourier transform on each frame. Each frame of speech signal can be considered to be cut out from various different stationary signal waveforms, and the short-time frequency spectrum of each frame of speech is an approximation of the frequency spectrum of each stationary signal waveform.

Since the speech signal is short-term stable, the signal can be divided into frames to calculate the Fourier transform of a certain frame, so that the short-term Fourier transform is obtained.

Fourier transform (FFT) can transform a signal from time domain to frequency domain, while inverse Fourier transform (IFFT) can transform frequency domain into time domain signal; Fourier transform transforms signal from time domain to frequency domain Domain is the most common way of audio signal processing. The spectrogram obtained by STFT is also called spectrogram or spectrogram in audio signals.

Summary of audio feature extraction methods and tools

4.2 Discrete Cosine Transform

The Discrete Cosine Transform (DCT for Discrete Cosine Transform) is a transform related to the Fourier Transform, which is similar to the Discrete Fourier Transform (DFT for Discrete Fourier Transform), but only uses real numbers. The discrete cosine transform is equivalent to a discrete Fourier transform whose length is about twice as long. This discrete Fourier transform is performed on a real even function (because the Fourier transform of a real even function is still a real even function ), in some deformations, you need to move the input or output position by half a unit.

4.3 Discrete wavelet transform

Discrete Wavelet Transform (Discrete Wavelet Transform) is very useful in numerical analysis and time-frequency analysis. Discrete Wavelet Transform is to discretize the scale and translation of the basic wavelet.

4.4 Mel spectrum and Mel cepstrum

The spectrogram is often a very large picture. In order to obtain a sound feature of a suitable size, it is often passed through the mel-scale filter banks and transformed into a mel-scale filter bank.

The pitch perception of the human ear is roughly linear with the logarithm of the fundamental frequency of the sound. Under the Mel scale, if the Mel frequencies of two speeches differ by two times, the pitch that the human ear can perceive is about two times different. When the frequency is small, mel changes quickly with Hz; when the frequency is large, mel rises slowly, and the slope of the curve is small. This shows that the human ear is more sensitive to low-frequency tones, and the human ear is very dull at high frequencies, which is inspired by the Mel-scale filter bank.

The Mel scale filter is composed of multiple triangular filters. The filter group is dense at low frequencies and has a large threshold value. The filters at high frequency are sparse and the threshold value is low. It just corresponds to the objective law that the higher the frequency, the duller the human ear. The filter form shown in the figure above is called Mel-filter bank with same bank area, which is widely used in the field of human voice (speech recognition, speaker recognition) and other fields, but if non-human voice is used Field, a lot of high-frequency information will be lost. At this time, perhaps we prefer the Mel-filter bank with same bank height.

Equal area filter
Contour filter

MEL spectrum implementation in librosa:

  import numpy as np

  def melspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512,
                     power=2.0, **kwargs):
      S, n_fft = _spectrogram(y=y, S=S, n_fft=n_fft, hop_length=hop_length, power=power)

      # Build a Mel filter
      mel_basis = filters.mel(sr, n_fft, **kwargs)

      return np.dot(mel_basis, S)

The mel cepstrum is analyzed on the mel spectrum (logarithm and DCT transformation) to get the mel cepstrum.

  # -- Mel spectrogram and MFCCs -- #
  def mfcc(y=None, sr=22050, S=None, n_mfcc=20, **kwargs):
      if S is None:
          S = power_to_db(melspectrogram(y=y, sr=sr, **kwargs))

      return scipy.fftpack.dct(S, axis=0, type=dct_type, norm=norm)[:n_mfcc]

4.5 Constant Q transformation

In music, all the notes are composed of 12 equal temperaments of several octaves, which correspond to the twelve semitones on an octave in the piano. The frequency ratio between these semitones is 21/12. Obviously, for two octaves of the same pitch, the high octave is twice the frequency of the low octave. Therefore, in music, the sound is distributed exponentially, but the audio spectrum obtained by our Fourier transform is linearly distributed, and the frequency points of the two cannot be one-to-one correspondence, which will refer to the estimated value of some scale frequencies Produce errors. Therefore, the modern analysis of music sound generally uses a time-frequency transform algorithm with the same exponential distribution law: Constant Q transform (Constant Q transform).

CQT refers to a filter bank whose center frequency is distributed exponentially, with different filter bandwidths, but the ratio of center frequency to bandwidth is constant Q. The difference between it and Fourier transform is that the horizontal frequency of its spectrum is not linear, but based on log2, and the length of the filter window can be changed according to the frequency of the spectrum to obtain better performance. Since the distribution of CQT and scale frequencies is the same, by calculating the CQT spectrum of the music signal, the amplitude value of the music signal at each note frequency can be directly obtained.

Summary of audio feature extraction methods and tools

Reference

《A Tutorial on Deep Learning for Music Information Retrieval》

"STFT and spectrogram, Mel Bank Features and MFCCs"

"Spectrum Conversion Algorithm Based on Music Recognition-CQT"

《Book document》

Guess you like

Origin blog.51cto.com/15063587/2588660