Call Baidu Cloud Speech to Text

1. Create an application

https://console.bce.baidu.com/ai/?_=1665651969389#/ai/speech/overview/index

Insert image description hereInsert image description hereOpen the application list and you can see the application you created, with the corresponding app_id, api_key and secret_key
Insert image description here

2. Calling method 1

Installation package

pip install baidu-aip
from aip import AipSpeech

""" 你的 APPID AK SK """
APP_ID = ''
API_KEY = ''
SECRET_KEY = ''

client = AipSpeech(APP_ID, API_KEY, SECRET_KEY)

# 复制一个16000采样频率、16bit位、单声道、小于60秒的wav语音(output.wav),放在在代码同级目录下。

# 调用百度aip 实现语音识别,对音频文件的要求如下:
# 1、支持音频格式:pcm、wav、amr、m4a
# 2、音频编码要求:采样率 16000、8000,16 bit 位深,单声道
# 3、录音文件时长不超过 60 秒


# 读取音频文件
with open('data/record/nls-sample-16k.wav', 'rb') as fp:
    au = fp.read()

# wav格式,16000采样频率,1537表示普通话。 这3个要和语音文件的内容保持一致。
res = client.asr(au, 'wav', 16000, {'dev_pid': 1537})
print(res)
print('识别结果:' + ''.join(res['result']))

3. Calling method 2

import sys
import json
import base64
import time

IS_PY3 = sys.version_info.major == 3

if IS_PY3:
    from urllib.request import urlopen
    from urllib.request import Request
    from urllib.error import URLError
    from urllib.parse import urlencode
    timer = time.perf_counter
else:
    from urllib2 import urlopen
    from urllib2 import Request
    from urllib2 import URLError
    from urllib import urlencode
    if sys.platform == "win32":
        timer = time.clock
    else:
        # On most other platforms the best timer is time.time()
        timer = time.time

API_KEY = ''
SECRET_KEY = ''

# 需要识别的文件
AUDIO_FILE = './audio/16k.pcm'  # 只支持 pcm/wav/amr 格式,极速版额外支持m4a 格式
# 文件格式
FORMAT = AUDIO_FILE[-3:]  # 文件后缀只支持 pcm/wav/amr 格式,极速版额外支持m4a 格式

CUID = '123456PYTHON'
# 采样率
RATE = 16000  # 固定值

# 普通版

DEV_PID = 1537  # 1537 表示识别普通话,使用输入法模型。根据文档填写PID,选择语言及识别模型
ASR_URL = 'http://vop.baidu.com/server_api'
SCOPE = 'audio_voice_assistant_get'  # 有此scope表示有asr能力,没有请在网页里勾选,非常旧的应用可能没有

#测试自训练平台需要打开以下信息, 自训练平台模型上线后,您会看见 第二步:“”获取专属模型参数pid:8001,modelid:1234”,按照这个信息获取 dev_pid=8001,lm_id=1234
# DEV_PID = 8001 ;   
# LM_ID = 1234 ;

# 极速版 打开注释的话请填写自己申请的appkey appSecret ,并在网页中开通极速版(开通后可能会收费)

# DEV_PID = 80001
# ASR_URL = 'http://vop.baidu.com/pro_api'
# SCOPE = 'brain_enhanced_asr'  # 有此scope表示有极速版能力,没有请在网页里开通极速版

# 忽略scope检查,非常旧的应用可能没有
# SCOPE = False

class DemoError(Exception):
    pass


"""  TOKEN start """

TOKEN_URL = 'http://aip.baidubce.com/oauth/2.0/token'


def fetch_token():
    params = {'grant_type': 'client_credentials',
              'client_id': API_KEY,
              'client_secret': SECRET_KEY}
    post_data = urlencode(params)
    if (IS_PY3):
        post_data = post_data.encode( 'utf-8')
    req = Request(TOKEN_URL, post_data)
    try:
        f = urlopen(req)
        result_str = f.read()
    except URLError as err:
        print('token http response http code : ' + str(err.code))
        result_str = err.read()
    if (IS_PY3):
        result_str =  result_str.decode()

    print(result_str)
    result = json.loads(result_str)
    print(result)
    if ('access_token' in result.keys() and 'scope' in result.keys()):
        print(SCOPE)
        if SCOPE and (not SCOPE in result['scope'].split(' ')):  # SCOPE = False 忽略检查
            raise DemoError('scope is not correct')
        print('SUCCESS WITH TOKEN: %s  EXPIRES IN SECONDS: %s' % (result['access_token'], result['expires_in']))
        return result['access_token']
    else:
        raise DemoError('MAYBE API_KEY or SECRET_KEY not correct: access_token or scope not found in token response')

"""  TOKEN end """

if __name__ == '__main__':
    token = fetch_token()

    speech_data = []
    with open(AUDIO_FILE, 'rb') as speech_file:
        speech_data = speech_file.read()

    length = len(speech_data)
    if length == 0:
        raise DemoError('file %s length read 0 bytes' % AUDIO_FILE)
    speech = base64.b64encode(speech_data)
    if (IS_PY3):
        speech = str(speech, 'utf-8')
    params = {'dev_pid': DEV_PID,
             #"lm_id" : LM_ID,    #测试自训练平台开启此项
              'format': FORMAT,
              'rate': RATE,
              'token': token,
              'cuid': CUID,
              'channel': 1,
              'speech': speech,
              'len': length
              }
    post_data = json.dumps(params, sort_keys=False)
    # print post_data
    req = Request(ASR_URL, post_data.encode('utf-8'))
    req.add_header('Content-Type', 'application/json')
    try:
        begin = timer()
        f = urlopen(req)
        result_str = f.read()
        print ("Request time cost %f" % (timer() - begin))
    except URLError as err:
        print('asr http response http code : ' + str(err.code))
        result_str = err.read()

    if (IS_PY3):
        result_str = str(result_str, 'utf-8')
    print(result_str)
    with open("result.txt","w") as of:
        of.write(result_str)

4. Audio transcoding tools

The short speech recognition interface does not support MP3 format.
You can use the ffmpeg tool to convert the audio format. The audio format can be viewed through the ffprobe tool. Reference:
https://ai.baidu.com/ai-doc/SPEECH/7k38lxpwf#%E6%9F% A5%E7%9C%8B%E9%9F%B3%E9%A2%91%E6%A0%BC%E5%BC%8Fffprobe%E4%BD%BF%E7%94%A8

This article describes how to convert audio in other formats into audio files that meet the input requirements of speech recognition. That is, audio files in 4 formats:

pcm(不压缩),也称为raw格式。音频输入最原始的格式,不用再解码。
wav(不压缩,pcm编码):在pcm文件的开头出上加上一个描述采样率,编码等信息的字节。
amr(有损压缩格式),对音频数据进行有损压缩,类似mp3文件。
m4a(有损压缩格式,AAC编码),对音频数据进行有损压缩,通常仅供微信小程序使用的格式。自行转换比较复杂。

Since the underlying recognition uses pcm, it is recommended to upload the pcm file directly. If you upload other formats, they will be transcoded into pcm on the server side, and the time it takes to call the interface will increase.
Audio parameter concept

采样率: 百度语音识别一般仅支持16000的采样率。即1秒采样16000次。
位深: 无损音频格式pcm和wav可以设置,百度语音识别使用16bits 小端序 ,即2个字节记录1/16000 s的音频数据。
声道: 百度语音识别仅支持单声道。

Taking the 16000 sampling rate 16bits encoded pcm file as an example, each 16bits (=2bytes) records 1/16000s of audio data. That is, 1s of audio data is 2bytes * 16000 = 32000B

The format requirements of the following table are for reference only. For details, please refer to the documentation of the API or SDK called.
Insert image description here

Audio Player

pcm 播放 ,使用AudioAudition ,选择 16000采样率 ;16位PCM;Little-Endian(即默认字节序)
wav, m4a 播放, 使用AudioAudition 或 完美解码
amr 播放, 使用完美解码

Conversion command example

wav file to 16k 16bits bit depth mono pcm file

ffmpeg -y  -i 16k.wav  -acodec pcm_s16le -f s16le -ac 1 -ar 16000 16k.pcm

44100 sampling rate mono 16bts pcm file to 16000 sampling rate 16bits bit depth mono pcm file

ffmpeg -y -f s16le -ac 1 -ar 44100 -i test44.pcm  -acodec pcm_s16le -f s16le -ac 1 -ar 16000 16k.pcm

Convert mp3 files to 16K 16bits bit depth mono pcm files

ffmpeg -y  -i aidemo.mp3  -acodec pcm_s16le -f s16le -ac 1 -ar 16000 16k.pcm

// -acodec pcm_s16le pcm_s16le 16bits 编码器
// -f s16le 保存为16bits pcm格式
// -ac 1 单声道
//  -ar 16000  16000采样率

The normal output is as follows:

Input #0, mp3, from 'aaa.mp3':
  Metadata:
    encoded_by      : Lavf52.24.1
  Duration: 00:02:33.05, start: 0.000000, bitrate: 128 kb/s
    Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16p, 128 kb/s
// 输入音频, MP3格式, 44100采样率,stereo-双声道, 16bits 编码

Output #0, s16le, to '16k.pcm':
  Metadata:
    encoded_by      : Lavf52.24.1
    encoder         : Lavf57.71.100
    Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s

// 输入音频, MP3格式, 16000采样率,mono-单声道, 16bits
// 256 kb/s = 32KB/s = 32B/ms

5. ffmpeg installation

1. Go to the official website https://ffmpeg.org/download.html
2. Find the windows version
Insert image description here

3. Click to download Insert image description here
or download directly from this address
https://www.gyan.dev/ffmpeg/builds/ffmpeg-git-full.7z
4. Add environment variables
Insert image description here
and enter cmd inspection
Insert image description here

6. ffmpeg usage instructions

Introduction

One function of ffmpeg is to convert different audio formats. For other introductions, please go to http://ffmpeg.org/

linux version: http://www.ffmpeg.org/download.html#build-linux linux static compiled version: https://www.johnvansickle.com/ffmpeg/ windows version: http://ffmpeg.org/download. html#build-windows

ffmpeg official document address: http://ffmpeg.org/ffmpeg.html
Compilation parameters and supported formats

ffmpeg supports pcm and wav (pcm encoding) formats by default. Additional compilation parameters are as follows:

--enable-libopencore-amrnb 支持amr-nb(8000 采样率) 读写
--enable-libopencore-amrwb 支持amr-wb(16000 采样率) 读取
--enable-libvo-amrwbenc 支持amr-wb(16000 采样率) 写入
--enable-libmp3lame 支持mp3 写入
--enable-libfdk-aac 使用libfdk作为aac的编码和解码格式
ffmpeg -codecs 可以查看所有的格式:

D..... = Decoding supported  # 读取
.E.... = Encoding supported  # 写入
..A... = Audio codec      # 音频编码
....L. = Lossy compression # 有损
.....S = Lossless compression # 无损
DEA..S pcm_s16le            PCM signed 16-bit little-endian
DEA.LS wavpack              WavPack
DEA.L. mp3  MP3 (MPEG audio layer 3)
DEA.L. amr_nb       AMR-NB (Adaptive Multi-Rate NarrowBand)
DEA.L. amr_wb       AMR-WB (Adaptive Multi-Rate WideBand)
DEA.L. aac          AAC (Advanced Audio Coding) (decoders: aac aac_fixed )
DEA.L. aac          AAC (Advanced Audio Coding) (decoders: aac aac_fixed libfdk_aac ) (encoders: aac libfdk_aac )

7. ffmpeg command

ffmpeg {通用参数} {输入音频参数}  {输出音频参数}

ffmpeg -y  -i aidemo.mp3  -acodec pcm_s16le -f s16le -ac 1 -ar 16000 16k.pcm

通用参数: -y
输入音频mp3文件参数: -i aidemo.mp3
输出音频 16000采样率 单声道 pcm 格式:  -acodec pcm_s16le -f s16le -ac 1 -ar 16000 16k.pcm

Example: The input is a 32000HZ mono 16bits pcm file. The input parameters in the query below are "-f s16le -ac 1 -ar 32000 -i test32.pcm" and the output is a 16000HZ mono 16bits pcm file. The output parameter of the query below is "-f s16le -ac 1 -ar 16000 16k.pcm" Common parameter selection -y

Combined as follows:

ffmpeg  -y  -f s16le -ac 1 -ar 32000  -i test32.pcm -f s16le -ac 1 -ar 16000 16k.pcm

Enter audio parameters

Wav, amr, mp3 and m4a formats all have their own headers, containing information such as sampling rate encoding and multi-channel. PCM is the original audio information and has no similar header. The wav (pcm encoding) format only adds a few-byte file header to the pcm file with the same parameters.

Input formats such as wav, amr, mp3 and m4a:

-i  test.wav # 或test.mp3 或者 test.amr

Input pcm format: pcm needs to additionally inform the encoding format, sampling rate, and mono information

-acodec pcm_s16le -f s16le -ac 1 -ar 16000 -i 8k.pcm
// 单声道 16000 采样率  16bits编码 pcm文件

// s16le  s(signied)16(16bits)le(Little-Endian)
-acodec pcm_s16le:使用s16le进行编码
-f s16le 文件格式是s16le的pcm
-ac 1 :单声道
-ar 16000 : 16000采样率

Output audio parameters

When the original sampling rate is greater than or close to 16000, it is recommended to use a sampling rate of 16000. A sampling rate of 8000 will reduce the recognition effect. When outputting wav and amr formats, if the output encoder is not specified, ffmpeg will select the default encoder.

Output pcm audio:

-f s16le -ac 1 -ar 16000 16k.pcm
// 单声道 16000 采样率 16bits编码 pcm文件

Output wav audio:

-ac 1 -ar 16000 16k.wav
// 单声道 16000 采样率 16bits编码 pcm编码的wav文件

Output amr-nb audio:

The full name is: Adaptive Multi-Rate, adaptive multi-rate, which is an audio encoding file format dedicated to effectively compressing speech frequencies. When bandwidth is not a bottleneck, it is not recommended to choose this format. Decompression requires extra time on the Baidu server. The amr-nb format can only select 8000 sampling rate. The higher the bit rates, the better the sound quality, but the larger the file size. bit rates 4.75k, 5.15k, 5.9k, 6.7k, 7.4k, 7.95k, 10.2k or 12.2k

The sampling rate of 8000 and lossy compression will reduce the recognition effect. If the original sample rate is greater than 16000, please use amr-wb format.

-ac 1 -ar 8000 -ab 12.2k 8k-122.amr

// 8000 采样率 12.2 bitRates

Output amr-wb format, sampling rate 16000. The higher the bit rates, the better the sound quality, but the larger the file size. 6600 8850 12650 14250 15850 18250 19850 23050 23850

-acodec amr_wb -ac 1 -ar 16000 -ab 23850 16k-23850.amr

Output m4a file

m4a files generally come from WeChat applet recording (see the parameter description of the WeChat recording applet in the restapi document). It is not recommended to convert other formats to m4a; it is recommended to convert to amr lossy compression format and call restapi.

If you must convert to the m4a format supported by Baidu, see
the common parameters for m4a file transcoding later.

-y 覆盖同名文件
-v 日志输出基本 如 -v ERROR -v quiet 等

8. Check the use of audio format ffprobe

View the MP3 format information generated by speech synthesis:

ffprobe -v quiet -print_format json -show_streams  aidemo.mp3

Return as follows

 {
    "streams": [
        {
            "index": 0,
            "codec_name": "mp3", // mp3 格式
            "codec_long_name": "MP3 (MPEG audio layer 3)",
            "codec_type": "audio",
            "codec_time_base": "1/16000",
            "codec_tag_string": "[0][0][0][0]",
            "codec_tag": "0x0000",
            "sample_fmt": "s16p",
            "sample_rate": "16000", // 16000采样率
            "channels": 1, // 单声道
            "channel_layout": "mono",
            "bits_per_sample": 0,
            "r_frame_rate": "0/0",
            "avg_frame_rate": "0/0",
            "time_base": "1/14112000",
            "start_pts": 0,
            "start_time": "0.000000",
            "duration_ts": 259096320,
            "duration": "18.360000",
            "bit_rate": "16000",
            "disposition": {
                "default": 0,
                "dub": 0,
                "original": 0,
                "comment": 0,
                "lyrics": 0,
                "karaoke": 0,
                "forced": 0,
                "hearing_impaired": 0,
                "visual_impaired": 0,
                "clean_effects": 0,
                "attached_pic": 0,
                "timed_thumbnails": 0
            }
        }
    ]
}

9. Calculation of audio duration of pcm files

Like image bmp files, pcm files save uncompressed audio information. 16bits encoding means that each sampled audio information is stored in 2 bytes. You can compare the bmp file with 2 bytes each to save the RGB color information. The 16000 sampling rate refers to 16,000 samples per second. Common audio is 44100HZ, which is sampled 44100 times per second. Mono: Only one channel.

Based on this information, we can calculate: The file size of an audio file with a sample rate of 16000 for 1 second is 2 16000 = 32000 bytes, which is approximately 32K The audio file size for an audio file with a sample rate of 8000 for 1 second is 2 8000 = 16000 bytes, which is approximately 16K

If the recording duration is known, you can calculate whether the sampling rate is normal based on the file size.

10. Convert to m4a format (AAC encoding)

It is recommended to use the amr lossy compression format, and the m4a format is used for recording in WeChat applet.

You need to download MP4Box to convert the brand to mp42:0, mini Version 0. restapi does not support brand M4A. Click here to download and select the download link in Current release.

ffmpeg officially recommends using the libfdk_aac library, but this library needs to be compiled by yourself according to the official documentation. If you use the static version, you can also use the aac library that comes with ffmpeg.

Compiled libfdk_aac ffmpeg example

ffmpeg -y -f s16le -ac 1 -ar 16000 -i 16k_57test.pcm -c libfdk_aac  -profile:a aac_low -b:a 48000 -ar 16000 -ac 1 16k.m4a
MP4Box -brand mp42:0 16k.m4a #这步不能忽略

Example of the aac library that comes with the static version

ffmpeg -y -f s16le -ac 1 -ar 16000 -i 16k_57test.pcm -c aac  -profile:a aac_low -b:a 48000 -ar 16000 -ac 1 16k.m4a 
MP4Box -brand mp42:0 16k.m4a #这步不能忽略

Output parameters

-c 选编码库 libfdk_aac或者aac
-profile:a profile固定选aac_low(AAC-LC),restapi不支持 例如HE-AAC ,LD,ELD等
-b:a bitrates , 16000采样率对应的bitrates CBR 范围为 24000-96000。越大的话,失真越小,但是文件越大
-ar 采样率,一般固定16000
-ac 固定1,单声道

View m4a format

> ffprobe 16k.m4a

Guess you like

Origin blog.csdn.net/qq236237606/article/details/127305795