本文对比用 Python 读取音频文件 (.wav, .mp3) 的三种方式:

soundfile.read
librosa.load
pydub.AudioSegment.from_file

使用总结如下：

soundfile.read：
- 只能读 .wav ，不能读 .mp3；
- 默认 dtype = 'float64'，输出为 (-1, 1) 之间的数据 (做了 32768 归一化)；修改为 dtype = 'int16'，输出为 (-2**15, 2**15-1) 之间；
- 保留原始采样频率。
librosa.load：
- 可以读 .wav 和 .mp3；
- 输出为 (-1, 1)；
- sr=None 保留原始采样频率，设置其他采样频率会进行重采样，有点耗时；
pydub.AudioSegment.from_file:
- 可以读 .wav 和 .mp3；
- 输出为 (-2**15, 2**15-1)，手动除以32768(=2**15)，可得到同2一样的结果；
- 保留原始采样频率，重采样可借助 librosa.resample。

1) soundfile.read

只能读 .wav 格式, 读 .mp3 会报错。

基本使用示例如下:

import soundfile as sf
import time

audio_path = './data/example.wav'

t = time.time()
wav, sr = sf.read(audio_path)
print(f"sr={
      
      sr}, len={
      
      len(wav)}, 耗时: {
      
      time.time()-t}")
print(f"(min, max, mean) = ({
      
      wav.min()}, {
      
      wav.max()}, {
      
      wav.mean()}), dtype={
      
      wav.dtype}")
wav

sr=16000, len=64320, 耗时: 0.08734607696533203
(min, max, mean) = (-0.027984619140625, 0.032623291015625, -2.017282134857937e-05), dtype=float64
array([ 0.00000000e+00,  0.00000000e+00, -6.10351562e-05, ...,
       -3.05175781e-05,  6.10351562e-05, -1.22070312e-04])

sf.read 有个参数 dtype, 文档如下:

    dtype : {'float64', 'float32', 'int32', 'int16'}, optional
        Data type of the returned array, by default ``'float64'``.
        Floating point audio data is typically in the range from
        ``-1.0`` to ``1.0``.  Integer data is in the range from
        ``-2**15`` to ``2**15-1`` for ``'int16'`` and from ``-2**31`` to
        ``2**31-1`` for ``'int32'``.

为 ‘float64’ 或 ‘float32’ 时，得到的是 (-1, 1) 之间做了归一化的数据，默认为此项；
为 ‘int16’ 时，得到的是 (-32768, 32767) 的数据；除以 32768(=2**15) 之后得到的结果等同于 1;
为 ‘int32’ 时，得到的是 (-2147483648, 2147483647) 的数据；除以 2147483648(=2**31) 之后得到的结果等同于 1;

t = time.time()
wav, sr = sf.read(audio_path, dtype='int16')
print(f"sr={
      
      sr}, len={
      
      len(wav)}, 耗时: {
      
      time.time()-t}")
print(f"(min, max, mean) = ({
      
      wav.min()}, {
      
      wav.max()}, {
      
      wav.mean()}), dtype={
      
      wav.dtype}")
wav

sr=16000, len=64320, 耗时: 0.0010836124420166016
(min, max, mean) = (-917, 1069, -0.6610230099502488), dtype=int16
array([ 0,  0, -2, ..., -1,  2, -4], dtype=int16)

wav / (2**15)

array([ 0.00000000e+00,  0.00000000e+00, -6.10351562e-05, ...,
       -3.05175781e-05,  6.10351562e-05, -1.22070312e-04])

t = time.time()
wav, sr = sf.read(audio_path, dtype='int32')
print(f"sr={
      
      sr}, len={
      
      len(wav)}, 耗时: {
      
      time.time()-t}")
print(f"(min, max, mean) = ({
      
      wav.min()}, {
      
      wav.max()}, {
      
      wav.mean()}), dtype={
      
      wav.dtype}")
wav

sr=16000, len=64320, 耗时: 0.0007898807525634766
(min, max, mean) = (-60096512, 70057984, -43320.8039800995), dtype=int32
array([      0,       0, -131072, ...,  -65536,  131072, -262144],
      dtype=int32)

wav / (2**31)

array([ 0.00000000e+00,  0.00000000e+00, -6.10351562e-05, ...,
       -3.05175781e-05,  6.10351562e-05, -1.22070312e-04])

2) librosa.load

基本使用如下:

import librosa
import numpy as np
import time

# audio_path = './data/example.mp3'
audio_path = './data/example.wav'

# librosa.load 读取
t = time.time()
wav, sr = librosa.load(audio_path, sr=None)
print(f"sr={
      
      sr}, len={
      
      len(wav)}, 耗时: {
      
      time.time()-t}")
print(f"(min, max, mean) = ({
      
      wav.min()}, {
      
      wav.max()}, {
      
      wav.mean()})")
wav

sr=16000, len=64320, 耗时: 0.0007445812225341797
(min, max, mean) = (-0.027984619140625, 0.032623291015625, -2.0172821677988395e-05)
array([ 0.0000000e+00,  0.0000000e+00, -6.1035156e-05, ...,
       -3.0517578e-05,  6.1035156e-05, -1.2207031e-04], dtype=float32)

看 librosa.load 的源码会发现其底层用的也是 soundfile.read :
请添加图片描述

但是需要注意：

librosa.load会优先使用soundfile.read读取，但其只能读 .wav , 如果传入 .mp3 会报错；这时候会报 warning，然后使用 __audioread_load 方法读取，该方法在源码中如是说：

  /data/miniconda3/lib/python3.9/site-packages/librosa/core/audio.py:162: UserWarning: PySoundFile failed. Trying audioread instead.warnings.warn("PySoundFile failed. Trying audioread instead.")

  Open an audio file using a library that is available on this system.

同样有默认参数dtype=np.float32，输出为 (-1, 1) 的数据。但是笔者发现，传入 .wav 时，把 dtype 设置为 int16 或 int32 时会报错，不太好使。读者可自行尝试。报错内容为：
```
ParameterError: Audio data must be floating-point
```
默认参数 sr=None，会使用音频原始的采样频率；但是如果设置了与其不一样的频率，则会进行重采样(librosa.resample)，此时会花费较长的时间。重采样之后 wav 的长度也会相应地改变。

# audio_path = './data/example.wav'
audio_path = './data/example.mp3'

# librosa.load 读取
print("不设置 sr (uses the native sampling rate)")
t = time.time()
wav, sr = librosa.load(audio_path, sr=None)
print(f"sr={
      
      sr}, len={
      
      len(wav)}, 耗时: {
      
      time.time()-t}")
print(f"(min, max, mean)=({
      
      wav.min()}, {
      
      wav.max()}, {
      
      wav.mean()})")
print(wav)

print("====================================================================")

print("设置了与原始不一样的 sr 时, 会进行重采样(librosa.resample)")
t = time.time()
wav, sr = librosa.load(audio_path, sr=22050)
print(f"sr={
      
      sr}, len={
      
      len(wav)}, 耗时: {
      
      time.time()-t}")
print(f"(min, max, mean) = ({
      
      wav.min()}, {
      
      wav.max()}, {
      
      wav.mean()})")
wav

不设置 sr (uses the native sampling rate)
sr=16000, len=64320, 耗时: 0.08281397819519043
(min, max, mean)=(-0.026611328125, 0.0225830078125, -1.8552998881204985e-05)
[ 3.0517578e-05 -3.0517578e-05 -6.1035156e-05 ... -3.0517578e-05
  3.0517578e-05 -6.1035156e-05]
====================================================================
设置了与原始不一样的 sr 时, 会进行重采样(librosa.resample)
/data/miniconda3/lib/python3.9/site-packages/librosa/core/audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
sr=22050, len=88641, 耗时: 0.6544580459594727
(min, max, mean) = (-0.02733371965587139, 0.028146227821707726, -1.8553111658548005e-05)
array([ 2.7023052e-05, -6.1948767e-06, -5.5695422e-05, ...,
        2.9699471e-05, -2.1948585e-05, -5.6789420e-05], dtype=float32)

3) pydub.AudioSegment.from_file

基本使用如下：

from pydub import AudioSegment #需要导入pydub三方库，第一次使用需要安装

audio_path = './data/example.mp3'

t = time.time()
song = AudioSegment.from_file(audio_path, format='mp3')
# print(len(song)) #时长，单位：毫秒
# print(song.frame_rate) #采样频率，单位：赫兹
# print(song.sample_width) #量化位数，单位：字节
# print(song.channels) #声道数，常见的MP3多是双声道的，声道越多文件也会越大。
wav = np.array(song.get_array_of_samples())
sr = song.frame_rate
print(f"sr={
      
      sr}, len={
      
      len(wav)}, 耗时: {
      
      time.time()-t}")
print(f"(min, max, mean) = ({
      
      wav.min()}, {
      
      wav.max()}, {
      
      wav.mean()})")
wav

sr=16000, len=64320, 耗时: 0.04667925834655762
(min, max, mean) = (-872, 740, -0.6079446517412935)
array([ 1, -1, -2, ..., -1,  1, -2], dtype=int16)

输出为 (-2**15, 2**15-1) 的 int16。除以 32768 之后的结果同 librosa.load：

wav = wav / 32768
wav

array([ 3.05175781e-05, -3.05175781e-05, -6.10351562e-05, ...,
       -3.05175781e-05,  3.05175781e-05, -6.10351562e-05])

结果会保留原始采样频率，如果想改为其它频率可以借助 librosa 进行重采样：

new_wav = librosa.resample(wav.astype(np.float32), 16000, 22050)
print(new_wav.shape)
new_wav

(88641,)
array([ 2.7023052e-05, -6.1948767e-06, -5.5695422e-05, ...,
        2.9699471e-05, -2.1948585e-05, -5.6789420e-05], dtype=float32)

一些其他参考:

https://blog.csdn.net/weixin_38468077/article/details/106896485

https://blog.csdn.net/qq_37100442/article/details/110092393

https://www.programcreek.com/python/example/89506/pydub.AudioSegment.from_file

Python | 语音处理 | 用 librosa / AudioSegment / soundfile 读取音频文件的对比

1) soundfile.read

2) librosa.load

3) pydub.AudioSegment.from_file

猜你喜欢