The author is recently selecting an open source speech recognition model, and the first test is Baidu's paddlepaddle;
before the test, it is necessary to understand some basic technical points of audio analysis, so there is this pilot article.

The audio analysis I saw mainly includes:

soundfile
ffmpy
booksa

1 booksa

Installation code:

!pip install librosa  -i https://mirror.baidu.com/pypi/simple
!pip install soundfile  -i https://mirror.baidu.com/pypi/simple

Reference documentation: librosa

1.1 Audio reading

Document location: https://librosa.org/doc/latest/core.html#audio-loading

signal, sr = librosa.load(path, sr=None)

The parameters of load include:

librosa.load(path, *, sr=22050, mono=True, offset=0.0, duration=None, dtype=<class 'numpy.float32'>, res_type='kaiser_best')

Among them, sr = None, 'None' retains the original sampling frequency, and resampling will be performed if other sampling frequency is set, which is a bit time-consuming. You
can read .wav and .mp3;

1.2 Audio write out

There are several other articles on the Internet: python audio sampling rate conversion and python audio file sampling rate conversion . When exporting audio files, errors will occur. Post their codes

Code snippet one:

def resample_rate(path,new_sample_rate = 16000):

    signal, sr = librosa.load(path, sr=None)
    wavfile = path.split('/')[-1]
    wavfile = wavfile.split('.')[0]
    file_name = wavfile + '_new.wav'
    new_signal = librosa.resample(signal, sr, new_sample_rate) # 
    librosa.output.write_wav(file_name, new_signal , new_sample_rate)

Code snippet two:

import librosa
import os

noise_name="/media/dfy/fc0b6513-c379-4548-b391-876575f1493f/home/dfy/PycharmProjects/noise_data/"
noise_name_list=os.listdir(noise_name)

for one_name in noise_name_list:

    data=librosa.load(noise_name+one_name,16000)
    librosa.output.write_wav(noise_name+one_name,data[0],16000,norm=False)

if __name__ == '__main__':
    pass

The above are used librosa.outputto export, and the latest librosa has abandoned this function. An error occurs:

AttributeError: module librosa has no attribute output No module named numba.decorators错误解决

Version 0.8.0 blocks the output api, so either the version of librosa is lowered, such as to 0.7.2, or another method is used.

So I came to the official document: The way librosa
recommends using write is to use this library: PySoundFile

1.3 librosa read in + PySoundFile write out

If an error occurs:

Input audio file has sample rate [44100], but decoder expects [16000]

It's just that the audio sampling ratio is wrong and needs to be modified.

The author combined the 1+2 open source library, fine-tuned the python audio sampling rate conversion and the python audio file sampling rate conversion , and got the following function to switch the audio sampling frequency:

import librosa
import os
import numpy as np
import soundfile as sf

def resample_rate(path,new_sample_rate = 16000):

    signal, sr = librosa.load(path, sr=None)
    wavfile = path.split('/')[-1]
    wavfile = wavfile.split('.')[0]
    file_name = wavfile + '_new.wav'
    new_signal = librosa.resample(signal, sr, new_sample_rate) # 
    #librosa.output.write_wav(file_name, new_signal , new_sample_rate) 
    sf.write(file_name, new_signal, new_sample_rate, subtype='PCM_24')
    print(f'{file_name} has download.')

# wav_file = 'video/xxx.wav'
resample_rate(wav_file,new_sample_rate = 16000)

Change to 16000an audio file with a sample_rate of

1.4 Convert from other libraries to librosa format

Reference: https://librosa.org/doc/latest/generated/librosa.load.html#librosa.load

The first:

# Load using an already open SoundFile object
import soundfile
sfo = soundfile.SoundFile(librosa.ex('brahms'))
y, sr = librosa.load(sfo)

The second type:

# Load using an already open audioread object
import audioread.ffdec  # Use ffmpeg decoder
aro = audioread.ffdec.FFmpegAudioFile(librosa.ex('brahms'))
y, sr = librosa.load(aro)

2 PySoundFile

python-soundfile is an audio library based on libsndfile, CFFI and NumPy.

You can directly use the functions read() and write() to read and write sound files. To read a sound file in blocks, use blocks(). Alternatively, sound files can also be opened as SoundFile objects.

The official documentation of PySoundFile: readthedocs
download:

!pip install soundfile  -i https://mirror.baidu.com/pypi/simple

2.1 Read in audio

read files from zip compressed archives:

import zipfile as zf
import soundfile as sf
import io

with zf.ZipFile('test.zip') as myzip:
    with myzip.open('stereo_file.wav') as myfile:
        tmp = io.BytesIO(myfile.read())
        data, samplerate = sf.read(tmp)

Download and read from URL:

import soundfile as sf
import io
from six.moves.urllib.request import urlopen
url = "https://raw.githubusercontent.com/librosa/librosa/master/tests/data/test1_44100.wav"
data, samplerate = sf.read(io.BytesIO(urlopen(url).read()))

2.2 Export audio

To export audio:

import numpy as np
import soundfile as sf

rate = 44100
data = np.random.uniform(-1, 1, size=(rate * 10, 2))

# Write out audio as 24bit PCM WAV
sf.write('stereo_file.wav', data, samplerate, subtype='PCM_24')

# Write out audio as 24bit Flac
sf.write('stereo_file.flac', data, samplerate, format='flac', subtype='PCM_24')

# Write out audio as 16bit OGG
sf.write('stereo_file.ogg', data, samplerate, format='ogg', subtype='vorbis')

3 ffmpy

Python batch conversion video and audio sampling rate (with code) | Python tool

download:

pip install ffmpy -i https://pypi.douban.com/simple

See the original text for the specific code, only one section is intercepted:

def transfor(video_path: str, tmp_dir: str, result_dir: str):
    file_name = os.path.basename(video_path)
    base_name = file_name.split('.')[0]
    file_ext = file_name.split('.')[-1]
    ext = 'wav'
 
    audio_path = os.path.join(tmp_dir, '{}.{}'.format(base_name, ext))
    print('文件名:{}，提取音频'.format(audio_path))
    ff = FFmpeg(
        inputs={
            video_path: None}, outputs={
            audio_path: '-f {} -vn -ac 1 -ar 16000 -y'.format('wav')})
    print(ff.cmd)
    ff.run()
 
    if os.path.exists(audio_path) is False:
        return None
 
    video_tmp_path = os.path.join(
        tmp_dir, '{}_1.{}'.format(
            base_name, file_ext))
    ff_video = FFmpeg(inputs={video_path: None},
                      outputs={video_tmp_path: '-an'})
    print(ff_video.cmd)
    ff_video.run()
 
    result_video_path = os.path.join(result_dir, file_name)
    ff_fuse = FFmpeg(inputs={video_tmp_path: None, audio_path: None}, outputs={
        result_video_path: '-map 0:v -map 1:a -c:v copy -c:a aac -shortest'})
    print(ff_fuse.cmd)
    ff_fuse.run()
    return result_video_path

4 AudioSegment / pydub

Reference article:
Python | Speech Processing | Comparison of reading audio files with librosa / AudioSegment / soundfile

Another introduction to the parameters of pydub:
a brief introduction to pydub

Official website address: pydub

from pydub import AudioSegment #需要导入pydub三方库，第一次使用需要安装

audio_path = './data/example.mp3'

t = time.time()
song = AudioSegment.from_file(audio_path, format='mp3')
# print(len(song)) #时长，单位：毫秒
# print(song.frame_rate) #采样频率，单位：赫兹
# print(song.sample_width) #量化位数，单位：字节
# print(song.channels) #声道数，常见的MP3多是双声道的，声道越多文件也会越大。
wav = np.array(song.get_array_of_samples())
sr = song.frame_rate
print(f"sr={sr}, len={len(wav)}, 耗时: {time.time()-t}")
print(f"(min, max, mean) = ({wav.min()}, {wav.max()}, {wav.mean()})")
wav

The output is:

sr=16000, len=64320, 耗时: 0.04667925834655762
(min, max, mean) = (-872, 740, -0.6079446517412935)
array([ 1, -1, -2, ..., -1,  1, -2], dtype=int16)

5 paddleaudio

Install:

! pip install paddleaudio -i https://mirror.baidu.com/pypi/simple

One of the official packages of paddle, the basic audio operation should be librosa library
Specific reference:
https://paddleaudio-doc.readthedocs.io/en/latest/index.html

import paddleaudio
audio_file = 'XXX.wav'
paddleaudio.load(audio_file, sr=None, mono=True, normal=False)

inferred:

(array([-3.9100647e-04, -3.0159950e-05,  1.1110306e-04, ...,
         1.4603138e-04,  2.5625229e-03, -7.6780319e-03], dtype=float32),
 16000)

audio value + sample rate

6 Audio segmentation - auditok

The reference is: [Super Simple] Building Personal Voice Dictation Service Based on PaddleSpeech

!pip install auditok

The reason for the segmentation has been explained above, because PaddleSpeech recognizes the longest speech is 50s, so it needs to be segmented, and it can be called directly here.

from paddlespeech.cli.asr.infer import ASRExecutor
import csv
import moviepy.editor as mp
import auditok
import os
import paddle
from paddlespeech.cli import ASRExecutor, TextExecutor
import soundfile
import librosa
import warnings

warnings.filterwarnings('ignore')

# 引入auditok库
import auditok
# 输入类别为audio
def qiefen(path, ty='audio', mmin_dur=1, mmax_dur=100000, mmax_silence=1, menergy_threshold=55):
    audio_file = path
    audio, audio_sample_rate = soundfile.read(
        audio_file, dtype="int16", always_2d=True)

    audio_regions = auditok.split(
        audio_file,
        min_dur=mmin_dur,  # minimum duration of a valid audio event in seconds
        max_dur=mmax_dur,  # maximum duration of an event
        # maximum duration of tolerated continuous silence within an event
        max_silence=mmax_silence,
        energy_threshold=menergy_threshold  # threshold of detection
    )

    for i, r in enumerate(audio_regions):
        # Regions returned by `split` have 'start' and 'end' metadata fields
        print(
            "Region {i}: {r.meta.start:.3f}s -- {r.meta.end:.3f}s".format(i=i, r=r))

        epath = ''
        file_pre = str(epath.join(audio_file.split('.')[0].split('/')[-1]))

        mk = 'change'
        if (os.path.exists(mk) == False):
            os.mkdir(mk)
        if (os.path.exists(mk + '/' + ty) == False):
            os.mkdir(mk + '/' + ty)
        if (os.path.exists(mk + '/' + ty + '/' + file_pre) == False):
            os.mkdir(mk + '/' + ty + '/' + file_pre)
        num = i
        # 为了取前三位数字排序
        s = '000000' + str(num)

        file_save = mk + '/' + ty + '/' + file_pre + '/' + \
                    s[-3:] + '-' + '{meta.start:.3f}-{meta.end:.3f}' + '.wav'
        filename = r.save(file_save)
        print("region saved as: {}".format(filename))
    return mk + '/' + ty + '/' + file_pre

The core auditok.splitcode and parameters are detailed in auditok.core.split , and the input is the audio file name, not the audio data format.

7 A more difficult error to solve

AudioParameterError: Sample width must be one of: 1, 2 or 4 (bytes)

The author encountered the above error report when running the recognition of the speech model,
but searched around the Internet, but did not find the right solution.
When I was about to give up, I accidentally saw AudioSegmentthe magical function of the library.

Sample widthwhat is it
Sampling and quantization bit width (sampwidth)

import wave
file ='asr_example.wav'
with wave.open(file) as fp:
    channels = fp.getnchannels()
    srate = fp.getframerate()
    swidth = fp.getsampwidth()
    data = fp.readframes(-1)
swidth,srate

waveSeveral important parameters of an audio can be queried through .
They are:

nchannels: number of channels
sampwidth: returns the byte width of each frame of this instance.
framerate: sampling frequency
nframes: number of sampling points

If you encounter the above error, you need to readjust. Here AudioSegmentthe library directly has

from pydub import AudioSegment


file_in ='asr_example.wav' # 输入的音频名称
file_out = 'asr_example_3.wav'  # 输出的音频名称

sound = AudioSegment.from_file(file_in)
sound = sound.set_frame_rate(48000)  # 可以修改音频采样率
sound = sound.set_sample_width(4) # 重新设置字节宽度
sound.export(file_out, format="wav")

The above can be solved perfectly.

8 Download audio from URL

Several read-in methods:

8.1 soundfile

import soundfile as sf
def save_audio_func(video_url,save_samplerate = 16000):
    '''
    音频导出
    '''
    save_name = video_url.split('/')[-1]

    data, samplerate = sf.read(io.BytesIO(urlopen(video_url).read()))
    # Write out audio as 24bit PCM WAV
    sf.write(save_name, data, save_samplerate, subtype='PCM_24')
    #print('')
    return save_name

Reading in and reading out are both throughsoundfile

9 How to read mp3

Reference: https://blog.csdn.net/qq_37851620/article/details/127149729

soundfile.read:
can only read .wav, not .mp3;
default dtype = 'float64', output data between (-1, 1) (32768 normalized); modify to dtype = 'int16' , the output is between (-2 15, 2 15-1);
keep the original sampling frequency.

librosa.load:
can read .wav and .mp3;
the output is (-1, 1);
sr=None retains the original sampling frequency, and resampling will be performed if other sampling frequencies are set, which is a bit time-consuming;

pydub.AudioSegment.from_file:
can read .wav and .mp3;
the output is (-2 15, 2 15-1), manually divided by 32768 (=2**15), you can get the same result as 2;
keep the original sampling Frequency, resampling can use librosa.resample.

Speech Recognition Series︱Audio Analysis with Python (1)