Understanding the mel spectrum

Introduction

Speech processing often needs to use mel spectrogram. For example, in speech classification, the signal signal is often converted into the form of picture spectrogram, and then the algorithm for classifying pictures (such as CNN) is used to classify speech. This article mainly introduces what is mel specgrogram and how to obtain spectrogram and mel spectrogram through librosa

signal signal

It is often said how many hertz a signal is, referring to how many value points the signal has per second . The sound of 44.1kHZ means that this sound has 44100 values ​​per second.

Read sound:

import librosa
import matplotlib.pyplot as plt
%matplotlib inline

y, sr = librosa.load('./sample.wav')

plt.plot(y)
plt.title('Signal')
plt.title('Signal')
plt.xlabel('samples')
plt.ylabel('Amplitude')

 

Fourier Transform

Each signal can be regarded as a synthesis of sine or cosine signal units of different frequencies, and a signal can be decomposed into signal units of different frequencies by Fast Fourier Transform :

import numpy as np 
n_fft=2048
ft = np.abs(librosa.stft(y[:n_fft], hop_length=n_fft+1))

plt.plot(ft)
plt.title('Spectrum')
plt.xlabel('Frequency Bin')
plt.ylabel('Amplitude')

 

Short Time Fourier Transform

The frequency of the sound may change over time, so it is not appropriate to use FFT to decompose the entire signal directly for long signals, so short time Fourier transform (short time fourier transform) is used to divide the signal into many small segments , FFT operation is performed on each small segment .

 The window length is the length of each small segment. After a certain segment is calculated, the next segment will be calculated. The hop length is the jump interval between two segments. The final STFT obtained is the stacking of these small segments of FFT, and each segment has Amplitude and Frequency information, as well as the Time information of this segment. Summarize this information on the picture to get the Spectrogram.

 Spectrogram

Since humans are more interested in low-frequency and low-pitch clips, the log operation is performed on the Amplitude and Frequency information obtained through FFT transformation , and the high-frequency and high-pitch parts are compressed:

import librosa.display

spec = np.abs(librosa.stft(y, hop_length=512))
spec = librosa.amplitude_to_db(spec, ref=np.max) #将音高变为分贝。log运算

librosa.display.specshow(spec, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Spectrogram')

 

Colors represent decibels. 

The Mel Scale

Humans are more sensitive to low-frequency signals, you can easily distinguish the sound of 500HZ and 1000HZ, but it is not clear to distinguish the sound of 9000HZ and 9500HZ. But it is easy to distinguish physically. Make a nonlinear map of the frequency signal:

In this way, human beings can distinguish different frequencies of sounds directly through numerical differences.

Mel spectrogram 

The difference between Mel spectrogram and spectrogram is that the frequency of mel spectrogram is the frequency after mel scale transformation (you can imagine that the whole Spectrogram is pressed down,) 

mel_spect = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=2048, hop_length=1024)
mel_spect = librosa.power_to_db(mel_spect, ref=np.max)
librosa.display.specshow(mel_spect, y_axis='mel', fmax=8000, x_axis='time');
plt.title('Mel Spectrogram');
plt.colorbar(format='%+2.0f dB');

 

Summarize

The spectrogram and mel spectrogram can be easily calculated by using the librosa library. As for which effect will be better (there is also an MFCC), it depends on your own experimental results.

 

Guess you like

Origin blog.csdn.net/bo17244504/article/details/124707265