Need to understand, mel frame number * frame shift = audio length (the number of sampling points can be converted into audio duration, how to do it needless to say)
Therefore, for 22050 sampling rate, the hopsize size is set to 256, then the corresponding mel-spectrogram needs to be upsampled by 256 times
What if it is 16000 sampling rate? The frame length is 50ms, and the frame shift is 12.5ms, then the hop_size is 200 (16000*12.5/1000=200), so the upsampling multiple is 200 times.
1. Sampling rate (sampling frequency): the number of samples per second
The number of samples to take per second. The symbol is fS and the unit is Hz. The higher the sampling rate, the closer the shape of the digital waveform is to the original analog waveform, and the more realistic the sound reproduction will be.
According to the Nyquist–Shannon sampling theorem , only when the sampling frequency is twice as high as the highest frequency in the original analog signal can the original signal be perfectly restored . Commonly used sampling rates are shown in the figure below
Two, frame length
Three, frame shift
Four, hop_size
5. nb_samples
nb_samples indicates the number (number) of samples in a frame of audio data, nb_sample