The author is recently selecting an open source speech recognition model, and the first test is Baidu's paddlepaddle;
before the test, it is necessary to understand some basic technical points of audio analysis, so there is this pilot article.
The audio analysis I saw mainly includes:
- soundfile
- ffmpy
- booksa
Article directory
1 booksa
Installation code:
!pip install librosa -i https://mirror.baidu.com/pypi/simple
!pip install soundfile -i https://mirror.baidu.com/pypi/simple
Reference documentation: librosa
1.1 Audio reading
Document location: https://librosa.org/doc/latest/core.html#audio-loading
signal, sr = librosa.load(path, sr=None)
The parameters of load include:
librosa.load(path, *, sr=22050, mono=True, offset=0.0, duration=None, dtype=<class 'numpy.float32'>, res_type='kaiser_best')
Among them, sr = None, 'None' retains the original sampling frequency, and resampling will be performed if other sampling frequency is set, which is a bit time-consuming. You
can read .wav and .mp3;
1.2 Audio write out
There are several other articles on the Internet: python audio sampling rate conversion and python audio file sampling rate conversion . When exporting audio files, errors will occur. Post their codes
Code snippet one:
def resample_rate(path,new_sample_rate = 16000):
signal, sr = librosa.load(path, sr=None)
wavfile = path.split('/')[-1]
wavfile = wavfile.split('.')[0]
file_name = wavfile + '_new.wav'
new_signal = librosa.resample(signal, sr, new_sample_rate) #
librosa.output.write_wav(file_name, new_signal , new_sample_rate)
Code snippet two:
import librosa
import os
noise_name="/media/dfy/fc0b6513-c379-4548-b391-876575f1493f/home/dfy/PycharmProjects/noise_data/"
noise_name_list=os.listdir(noise_name)
for one_name in noise_name_list:
data=librosa.load(noise_name+one_name,16000)
librosa.output.write_wav(noise_name+one_name,data[0],16000,norm=False)
if __name__ == '__main__':
pass
The above are used librosa.output
to export, and the latest librosa has abandoned this function. An error occurs:
AttributeError: module librosa has no attribute output No module named numba.decorators错误解决
Version 0.8.0 blocks the output api, so either the version of librosa is lowered, such as to 0.7.2, or another method is used.
So I came to the official document: The way librosa
recommends using write is to use this library: PySoundFile
1.3 librosa read in + PySoundFile write out
If an error occurs:
Input audio file has sample rate [44100], but decoder expects [16000]
It's just that the audio sampling ratio is wrong and needs to be modified.
The author combined the 1+2 open source library, fine-tuned the python audio sampling rate conversion and the python audio file sampling rate conversion , and got the following function to switch the audio sampling frequency:
import librosa
import os
import numpy as np
import soundfile as sf
def resample_rate(path,new_sample_rate = 16000):
signal, sr = librosa.load(path, sr=None)
wavfile = path.split('/')[-1]
wavfile = wavfile.split('.')[0]
file_name = wavfile + '_new.wav'
new_signal = librosa.resample(signal, sr, new_sample_rate) #
#librosa.output.write_wav(file_name, new_signal , new_sample_rate)
sf.write(file_name, new_signal, new_sample_rate, subtype='PCM_24')
print(f'{file_name} has download.')
# wav_file = 'video/xxx.wav'
resample_rate(wav_file,new_sample_rate = 16000)
Change to 16000
an audio file with a sample_rate of
1.4 Convert from other libraries to librosa format
Reference: https://librosa.org/doc/latest/generated/librosa.load.html#librosa.load
The first:
# Load using an already open SoundFile object
import soundfile
sfo = soundfile.SoundFile(librosa.ex('brahms'))
y, sr = librosa.load(sfo)
The second type:
# Load using an already open audioread object
import audioread.ffdec # Use ffmpeg decoder
aro = audioread.ffdec.FFmpegAudioFile(librosa.ex('brahms'))
y, sr = librosa.load(aro)
2 PySoundFile
python-soundfile is an audio library based on libsndfile, CFFI and NumPy.
You can directly use the functions read() and write() to read and write sound files. To read a sound file in blocks, use blocks(). Alternatively, sound files can also be opened as SoundFile objects.
The official documentation of PySoundFile: readthedocs
download:
!pip install soundfile -i https://mirror.baidu.com/pypi/simple
2.1 Read in audio
read files from zip compressed archives:
import zipfile as zf
import soundfile as sf
import io
with zf.ZipFile('test.zip') as myzip:
with myzip.open('stereo_file.wav') as myfile:
tmp = io.BytesIO(myfile.read())
data, samplerate = sf.read(tmp)
Download and read from URL:
import soundfile as sf
import io
from six.moves.urllib.request import urlopen
url = "https://raw.githubusercontent.com/librosa/librosa/master/tests/data/test1_44100.wav"
data, samplerate = sf.read(io.BytesIO(urlopen(url).read()))
2.2 Export audio
To export audio:
import numpy as np
import soundfile as sf
rate = 44100
data = np.random.uniform(-1, 1, size=(rate * 10, 2))
# Write out audio as 24bit PCM WAV
sf.write('stereo_file.wav', data, samplerate, subtype='PCM_24')
# Write out audio as 24bit Flac
sf.write('stereo_file.flac', data, samplerate, format='flac', subtype='PCM_24')
# Write out audio as 16bit OGG
sf.write('stereo_file.ogg', data, samplerate, format='ogg', subtype='vorbis')
3 ffmpy
Python batch conversion video and audio sampling rate (with code) | Python tool
download:
pip install ffmpy -i https://pypi.douban.com/simple
See the original text for the specific code, only one section is intercepted:
def transfor(video_path: str, tmp_dir: str, result_dir: str):
file_name = os.path.basename(video_path)
base_name = file_name.split('.')[0]
file_ext = file_name.split('.')[-1]
ext = 'wav'
audio_path = os.path.join(tmp_dir, '{}.{}'.format(base_name, ext))
print('文件名:{},提取音频'.format(audio_path))
ff = FFmpeg(
inputs={
video_path: None}, outputs={
audio_path: '-f {} -vn -ac 1 -ar 16000 -y'.format('wav')})
print(ff.cmd)
ff.run()
if os.path.exists(audio_path) is False:
return None
video_tmp_path = os.path.join(
tmp_dir, '{}_1.{}'.format(
base_name, file_ext))
ff_video = FFmpeg(inputs={video_path: None},
outputs={video_tmp_path: '-an'})
print(ff_video.cmd)
ff_video.run()
result_video_path = os.path.join(result_dir, file_name)
ff_fuse = FFmpeg(inputs={video_tmp_path: None, audio_path: None}, outputs={
result_video_path: '-map 0:v -map 1:a -c:v copy -c:a aac -shortest'})
print(ff_fuse.cmd)
ff_fuse.run()
return result_video_path
4 AudioSegment / pydub
Reference article:
Python | Speech Processing | Comparison of reading audio files with librosa / AudioSegment / soundfile
Another introduction to the parameters of pydub:
a brief introduction to pydub
Official website address: pydub
from pydub import AudioSegment #需要导入pydub三方库,第一次使用需要安装
audio_path = './data/example.mp3'
t = time.time()
song = AudioSegment.from_file(audio_path, format='mp3')
# print(len(song)) #时长,单位:毫秒
# print(song.frame_rate) #采样频率,单位:赫兹
# print(song.sample_width) #量化位数,单位:字节
# print(song.channels) #声道数,常见的MP3多是双声道的,声道越多文件也会越大。
wav = np.array(song.get_array_of_samples())
sr = song.frame_rate
print(f"sr={sr}, len={len(wav)}, 耗时: {time.time()-t}")
print(f"(min, max, mean) = ({wav.min()}, {wav.max()}, {wav.mean()})")
wav
The output is:
sr=16000, len=64320, 耗时: 0.04667925834655762
(min, max, mean) = (-872, 740, -0.6079446517412935)
array([ 1, -1, -2, ..., -1, 1, -2], dtype=int16)
5 paddleaudio
Install:
! pip install paddleaudio -i https://mirror.baidu.com/pypi/simple
One of the official packages of paddle, the basic audio operation should be librosa library
Specific reference:
https://paddleaudio-doc.readthedocs.io/en/latest/index.html
import paddleaudio
audio_file = 'XXX.wav'
paddleaudio.load(audio_file, sr=None, mono=True, normal=False)
inferred:
(array([-3.9100647e-04, -3.0159950e-05, 1.1110306e-04, ...,
1.4603138e-04, 2.5625229e-03, -7.6780319e-03], dtype=float32),
16000)
audio value + sample rate
6 Audio segmentation - auditok
The reference is: [Super Simple] Building Personal Voice Dictation Service Based on PaddleSpeech
!pip install auditok
The reason for the segmentation has been explained above, because PaddleSpeech recognizes the longest speech is 50s, so it needs to be segmented, and it can be called directly here.
from paddlespeech.cli.asr.infer import ASRExecutor
import csv
import moviepy.editor as mp
import auditok
import os
import paddle
from paddlespeech.cli import ASRExecutor, TextExecutor
import soundfile
import librosa
import warnings
warnings.filterwarnings('ignore')
# 引入auditok库
import auditok
# 输入类别为audio
def qiefen(path, ty='audio', mmin_dur=1, mmax_dur=100000, mmax_silence=1, menergy_threshold=55):
audio_file = path
audio, audio_sample_rate = soundfile.read(
audio_file, dtype="int16", always_2d=True)
audio_regions = auditok.split(
audio_file,
min_dur=mmin_dur, # minimum duration of a valid audio event in seconds
max_dur=mmax_dur, # maximum duration of an event
# maximum duration of tolerated continuous silence within an event
max_silence=mmax_silence,
energy_threshold=menergy_threshold # threshold of detection
)
for i, r in enumerate(audio_regions):
# Regions returned by `split` have 'start' and 'end' metadata fields
print(
"Region {i}: {r.meta.start:.3f}s -- {r.meta.end:.3f}s".format(i=i, r=r))
epath = ''
file_pre = str(epath.join(audio_file.split('.')[0].split('/')[-1]))
mk = 'change'
if (os.path.exists(mk) == False):
os.mkdir(mk)
if (os.path.exists(mk + '/' + ty) == False):
os.mkdir(mk + '/' + ty)
if (os.path.exists(mk + '/' + ty + '/' + file_pre) == False):
os.mkdir(mk + '/' + ty + '/' + file_pre)
num = i
# 为了取前三位数字排序
s = '000000' + str(num)
file_save = mk + '/' + ty + '/' + file_pre + '/' + \
s[-3:] + '-' + '{meta.start:.3f}-{meta.end:.3f}' + '.wav'
filename = r.save(file_save)
print("region saved as: {}".format(filename))
return mk + '/' + ty + '/' + file_pre
The core auditok.split
code and parameters are detailed in auditok.core.split , and the input is the audio file name, not the audio data format.
7 A more difficult error to solve
AudioParameterError: Sample width must be one of: 1, 2 or 4 (bytes)
The author encountered the above error report when running the recognition of the speech model,
but searched around the Internet, but did not find the right solution.
When I was about to give up, I accidentally saw AudioSegment
the magical function of the library.
Sample width
what is it
Sampling and quantization bit width (sampwidth)
import wave
file ='asr_example.wav'
with wave.open(file) as fp:
channels = fp.getnchannels()
srate = fp.getframerate()
swidth = fp.getsampwidth()
data = fp.readframes(-1)
swidth,srate
wave
Several important parameters of an audio can be queried through .
They are:
- nchannels: number of channels
- sampwidth: returns the byte width of each frame of this instance.
- framerate: sampling frequency
- nframes: number of sampling points
If you encounter the above error, you need to readjust. Here AudioSegment
the library directly has
from pydub import AudioSegment
file_in ='asr_example.wav' # 输入的音频名称
file_out = 'asr_example_3.wav' # 输出的音频名称
sound = AudioSegment.from_file(file_in)
sound = sound.set_frame_rate(48000) # 可以修改音频采样率
sound = sound.set_sample_width(4) # 重新设置字节宽度
sound.export(file_out, format="wav")
The above can be solved perfectly.
8 Download audio from URL
Several read-in methods:
8.1 soundfile
import soundfile as sf
def save_audio_func(video_url,save_samplerate = 16000):
'''
音频导出
'''
save_name = video_url.split('/')[-1]
data, samplerate = sf.read(io.BytesIO(urlopen(video_url).read()))
# Write out audio as 24bit PCM WAV
sf.write(save_name, data, save_samplerate, subtype='PCM_24')
#print('')
return save_name
Reading in and reading out are both throughsoundfile
9 How to read mp3
Reference: https://blog.csdn.net/qq_37851620/article/details/127149729
soundfile.read:
can only read .wav, not .mp3;
default dtype = 'float64', output data between (-1, 1) (32768 normalized); modify to dtype = 'int16' , the output is between (-2 15, 2 15-1);
keep the original sampling frequency.
librosa.load:
can read .wav and .mp3;
the output is (-1, 1);
sr=None retains the original sampling frequency, and resampling will be performed if other sampling frequencies are set, which is a bit time-consuming;
pydub.AudioSegment.from_file:
can read .wav and .mp3;
the output is (-2 15, 2 15-1), manually divided by 32768 (=2**15), you can get the same result as 2;
keep the original sampling Frequency, resampling can use librosa.resample.