FastASR+FFmpeg (audio and video development + speech recognition)

        If you want to do something better, you not only need to know how to use it, but also some basic concepts.

1. Basic carding of audio and video processing

        1. Understanding of multimedia files

        

        1.1 Structural Analysis

        A multimedia file can essentially be understood as a container

        There are many streams in the container

        Each stream is encoded by a different encoder

        Multiple frames are contained in many packages (frame is the smallest unit in audio and video processing)

        1.2 Package format

        Encapsulation format (also called container) is to put the encoded and compressed video stream, audio stream and subtitle stream into a file according to a certain scheme, which is convenient for playback software to play. Generally speaking, the suffix of a video file is its encapsulation format. The package format is different, and the suffix name is also different (xxx.mp4 xxx.flv).

        1.3 Audio and video synchronization

        Audio Master: Sync video to audio

        Video Master: Sync Audio to Video

        External Clock Master: Synchronize audio and video to an external clock

        1.4 Principles of audio and video recording

        

        1.5 Principles of audio and video playback

         

         1.6 Principles of audio and video playback

        

     2. Basic concepts of audio and video

       2.1 sound

         Sampling the natural sound is to digitize the signal on the time axis, that is, to take its instantaneous value point by point on the analog signal x(t) according to a certain time interval t. The higher the sampling rate, the higher the degree of sound restoration, the better the quality, and the larger the space occupied.

        Quantization: Approximate the original continuous amplitude value with a finite number of amplitudes to change the continuous amplitude of the analog signal into a finite number of discrete values ​​with a certain interval. [The accuracy of the sampling value depends on how small it is expressed, which is quantization. For example, 8-bit quantization can represent 256 different values, while 16-bit CD quality can represent 65536 values, range -32769-32767]

Let's calculate this value:

 

        Coding: Install a certain rule to express the quantized value with binary numbers, and then convert it into a binary or multi-valued digital signal stream. The digital signal obtained in this way can be transmitted through digital lines such as cable and satellite communication. The receiving end is the reverse of the above process.

How the encoding is understood:

        When we were in school, we should have heard the teacher talk about Huffman coding, the truth is actually the same. Using some form to make a value unique, effective encoding can improve security, compress data, and other effective functions.

        PCM: The above digitization process is also called pulse code modulation. Usually we say that the raw data format of audio is pulse code modulation data. Describing a piece of PCM data requires 4 quantitative indicators: sampling rate, bit depth, byte order, and number of channels.

        Sampling rate: How many samples are taken per second, in Hz.

wireless broadcast

22000(22kHz)

CD quality

44100(44.1kHz)

Digital TV (DVD)

48000(48Hz)

Blu-ray (HD DVD)

96000(96kHz)

Blu-ray (HD DVD)

192000(192kHz)

        Bit depth (Bit-depth): Indicates how many small binary bits are used to describe the sampling data, generally 16 bits.

        Byte order: Indicates whether the byte order of audio PCM data storage is big-endian or little-endian. For efficient data processing, little-endian storage is usually used.

        Number of channels (channel number): Whether the number of channels contained in the current PCM file is mono or dual

        Bit rate: the number of bits transmitted per second, in bps (Bit Per Second). An indirect measure of sound quality. Bit rate of uncompressed audio data = sampling frequency * sampling precision * number of channels.

        Bit rate: The bit rate of the compressed audio data. The larger the code rate, the lower the compression efficiency, the better the sound quality, and the larger the compressed data. Bit rate = audio file size / duration.

FM quality

96 kps

Normal Quality Audio

128-160kbps

CD quality

192kbps

high quality audio

256-320kbps

        

        Frame: The number of sample units per encoding. For example, MP3 usually uses 1152 sampling points as a coding unit, and AAC usually uses 1024 sampling points as a coding unit.

        Frame length: It can refer to the duration of each frame playback. Duration of each frame (seconds) = number of sampling points per frame/sampling frequency (HZ). It can also refer to the data length of each frame after compression.

        Audio coding: the main function is to compress audio sample data (PCM, etc.) into an audio stream, thereby reducing the amount of audio data, which is biased towards storage and transmission.

MP3

A digital audio encoding and lossy compression format used to drastically reduce the amount of audio.

AAC

AAC has a higher compression ratio than MP3, and the same size audio file, AAC has higher sound quality.

WMA

Contains both lossy and lossless compression formats natively

2.2 Image

        An image is a similar and vivid description or portrait of an objective object, and is the most commonly used information carrier in human social activities. In other words, an image is a representation of an objective object, which contains information about the object being described. It is people's primary source of information.

        Pixel: Screen display is to divide the effective area into many small grids, and each grid only displays one color, which is the smallest element of imaging, so it is called "pixel".

        Resolution: The number of pixels in the length and width directions of the screen is called resolution, which is generally represented by A x B. The higher the resolution, the smaller the area of ​​each pixel, and the smoother and more delicate the display effect will be.

        RGB represents an image: 8bit represents a sub-pixel: the value range is [0~255] or [00~FF]. For example, the image format RGBA_8888 means that 4 8bits represent a pixel, while RGB_565 uses 5+6+5bits to represent a pixel. A 1280*720 RGBA_8888 format picture size = 1280 x 720 x 32bit. So the raw data of each image is very large. A 90-minute movie, 25 frames per second: 90 * 60 * 25 * 1280 * 720 * 32 bit = 463.48G.

        YU V means image: YUV is another color coding method. The raw data of video is generally expressed in YUV data format. Y represents brightness, also known as gray value (gray scale value). UY stands for Hue, which both represent the color and saturation of the influence, and are used to specify the color of a pixel.

        Luminance: Needed to be established from the RGB input signal by summing certain parts of the RGB signal (g-component signal) together.

        Chroma: defines the hue and saturation of the color, represented by Cr and Cb respectively, (C stands for component (abbreviation of component)). Cr reflects the difference between the red part of the RGB input signal and the brightness value of the RGB signal. Cb reflects the difference between the blue portion of the RGB input signal and the brightness value of the RGB signal.

        2.3 video

        Due to the special structure of the human eye, when the screen is switched quickly, the screen will remain (persistence of vision), which feels like a coherent movement. Therefore, a video is composed of a series of pictures.

        Video bit rate: refers to the data flow used by video files per unit time, also called bit rate. The higher the code rate, the higher the sampling rate per unit time and the higher the accuracy of the data stream.

        Video frame rate: Generally speaking, 25 frames of a video refers to the video frame rate, that is, 25 frames will be displayed in 1 second. The higher the frame rate, the smoother the visual experience.

        Video resolution: The resolution is what we often say 640x480 resolution, 1920x1080 resolution, the resolution affects the size of the video image.

        Frame: A frame is generated without referring to other pictures, and the complete image is reconstructed only by itself during decoding.

        Video encoding: The purpose of encoding is to compress, so that the size of various videos becomes smaller, which is conducive to storage and transmission. There are two mainstream international organizations that formulate video codec technology, one is "ITU-T", which formulates standards such as H.261, H.263, H.263+, H.264, etc. The other is the "International Organization for Standardization (ISO)", which has established standards such as MPEG-1, MPEG-2, and MPEG-4.

WMV

A streaming media format launched by Microsoft, which is an upgrade and extension of the "same door" ASF format. Under the same video quality, files in WMV format can be played while downloading, so it is very suitable for playing and transmitting on the Internet.

VP8

WebM, VPX (VP6, VP7, VP8, VP9) from On2, this codec is designed for web video.

WebRTC

In May 2010, Google acquired VoIP software developer Global IP Solutions for approximately $68.2 million, thereby gaining access to the company's WebRTC technology. WebRTC integrates VP8, VP9.

AV1

It is an open, patent-free video encoding format designed for transmitting video over the Internet.

AVS

It is the second-generation source coding standard with independent intellectual property rights in China. It is the abbreviation of the "Advanced Audio and Video Coding for Information Technology" series of standards. It includes four main technical standards and conformance tests of system, video, audio, and digital rights management. and other support standards.

H265

HEVC offers significant improvements in compression over the H.264 codec. HEVC compresses video twice more efficiently than H.264. With HEVC, a video of the same visual quality takes up half the space.

VP9

It is an open, copyright-free video coding standard developed by Google, and VP9 is also regarded as the next-generation video coding standard of VP8.

3. Commonly used audio and video processing third-party libraries

3.2.1 Basic concepts

        FFmpeg (Fast Forward MPEG) is the world's leading multimedia framework, capable of decoding (decode), encoding (encode), transcoding (transcode), multiplexing (mux), demultiplexing (demux), streaming (stream), filtering (filter) and play almost all multimedia files created by humans and machines.

3.2.2 The main basic components of FFmpeg

        FFmpeg's encapsulation module AVFormat: AVFormat implements most media encapsulation formats in the multimedia field, including encapsulation and decapsulation, such as MP4, FLV, KV, TS and other file encapsulation formats, RTMP, RTSP, MMS, HLS and other network protocol encapsulation formats. Whether FFmepg supports a certain media packaging format depends on whether the packaging library for this format is included when compiling.

       FFmpeg's codec module AVCodec: AVCodec includes most commonly used codec formats, supporting both encoding and decoding. In addition to supporting MPEG4, AAC, MJPEG and other built-in media formats, it also supports H.264 (x264 encoder), H.265 (X265 encoder), MP3 (libMP3lame encoder)

        FFmepg's filter module AVFilter: The AVFilter library provides a general audio, video, subtitle and other filter processing framework. In AVFilter, a filter frame can have multiple inputs and multiple outputs.

        FFmpeg's video image conversion calculation module swscale: The swscale module provides a high-level image conversion API, which can scale images and convert pixel formats.

        FFmpeg's audio conversion calculation module swresample: swresample provides an audio resampling API that supports audio sampling, audio channel layout, and layout adjustment.

3.2.3 Advantages and disadvantages of FFmpeg

  • High portability: It can be compiled, run and tested by FATE (FFMPEG automated test environment) on Linux, Mac, Windows and other systems.
  • High performance: Provides corresponding assembly-level optimized implementations for most mainstream processors such as X86, arm, MIPS, and ppc.
  • High security: FFMPEG officials always consider security when reviewing code, and once there are security bugs in the released version, they will be fixed and updated as soon as possible .
  • High ease of use: APIs provided by FFMPEG have related comments, and the official documentation also has corresponding documentation
  • Diversity of supported formats: FFMPEG supports functions such as decoding, encoding, multiplexing, and demultiplexing of many media formats, whether it is a very old format or a relatively new format.
  • Filenames with spaces are not recognized
  • When encoding FFMPEG, the timestamp only needs to specify the pts field of AVFrame

3.2.4 FFmpeg installation and configuration

Windows

easier

Linux

Method 1 (yum installation) but this version is a bit old

  1. update new system

sudo yum install epel-release -y

Sudo yum update -y

  1. Import the key and set the source

sudo rpm --import http://li.nux.ro/download/nux/RPM-GPG-KEY-nux.ro

sudo rpm -Uvh http://li.nux.ro/download/nux/dextop/el7/x86_64/nux-dextop-release-0-5.el7.nux.noarch.rpm

  1. 安装ffmpeg

yum -y install ffmpeg ffmpeg-devel

  1. 检查版本

ffmpeg -version

方法 2(编译安装)

先下载源码包:

git clone https://git.ffmpeg.org/ffmpeg.git ffmpeg

进入ffmpeg文件夹,依次执行下列语句:

cd ffmpeg

./configure

make

make install

将编译好的ffmpeg复制到bin目录

cp ffmpeg /usr/bin/ffmpeg

检查版本

ffmpeg -version

 3.2.5 FFmpeg的命令行使用

获取媒体文件信息

ffmpeg -i 文件全路径 -hide_banner

示例: ffmpeg -i video_file.mp4 -hide_banner 

注: -hide_banner 隐藏ffmpeg本身的信息,只显示文件相关信息(编码器、数据流等)。

转换媒体文件(可以实现在不同媒体格式之间进行自由转换)

注:ffmpeg会从后缀名猜测格式

ffmpeg -i 待转换的文件路径 转换后的文件路径

示例: ffmpeg -i video_input.mp4 video_output.avi

ffmpeg -i video_input.webm video_output.flv

ffmpeg -i audio_input.mp3 video_output.ogg

ffmpeg -i audio_input.wav audio_output.flac video_output.flv

参数: -qscale 0 保留原始的视频质量

从视频中提取音频

ffmpeg -i 视频全路径 -vn 需要保存的音频文件全路径 -hide_banner

参数说明:

-vn 从视频中提取音频

-ab 指定编码比特率(一些常见的比特率 96k、128k、192k、256k、320k)

-ar 采样率(22050、441000、48000)

-ac 声道数

-f 音频格式(通常会自动识别)

视频静音-(纯视频)

ffmpeg -i video_input.mp4 -an -video_output.mp4

注: -an 标记会让所有视频的音频参数无效,因为最后没有音频产生

视频文件中提取截图

ffmpeg -i 视频文件名 -r 帧率 -f 输出格式 输出文件名

示例: ffmpeg -i video.mp4 -r 1 -f image2 image-%3d.png

参数说明:

-r 帧率(一秒内导出多少张图像,默认25)

-f 代表输出格式(image2实际上是image2序列的意思)

更改视频分辨率或长宽比

ffmpeg -i 视频文件名 -s 分辩率 -c:a -aspect 长:宽 输出文件名

示例:ffmpeg -i video_input.mov -s 1024x576 -c:a video_output.mp4

参数说明:

-s 缩放视频

-c:a 保证音频编码正确

-aspect 更改长宽比

为音频增加封面(音频转视频)

当你想往某一个网站上传音频,但那个网站只接受视频的情况下非常适用

示例:ffmpeg -loop 1 -i image.jpg -i audio.wav -c:v libx264 -c:a acc -strict experimental -b:a 192k -shortest output.mp4

-c:v 视频编码

-c:a 音频编码

注:如果是4.x版本以上,不需要加 -strict experimental

为视频增加字幕

ffmpeg -i video.mp4 -i subtitles.srt -c:v copy -c:a copy -preset veryfast -c:s mov_text -map 0 -map 1 output.mp4

二、视频中提取音频   

1.FFmpeg 

通过命令行

ffmpeg -i 视频文件路径 -vn 音频文件全路径 -hide_banner

参数说明:

-vn 从视频中提取音频

-ab 指定编码比特率(一些常见的比特率 96k、128k、192k、256k、320k)

-ar 采样率(22050、441000、48000)

-ac 声道数

-f 音频格式(通常会自动识别)

示例:

通过提供的API


bool AVInterface::extractAudio(const char* src, const char* dstDir)
{

	if (NULL == src || NULL == dstDir)
	{
		printf("Ffmpeg::extractAudio[ERROR]::无效参数,请检查文件路径是否正确\n");
		return false;
	}

	int ret = 0;

	// 预存原文件路径
	const char* src_fileName = src;

	// 1.获取媒体文件的全局上下文信息

	// 1.1 定义 AVFormatContext 容器
	AVFormatContext* pFormatCtx = NULL;      // AVFormatContext描述了一个媒体文件或者媒体流构成的基本信息
	pFormatCtx = avformat_alloc_context();   // 为 pFormatCtx 申请内存

	// 1.2 打开媒体文件,并且读取媒体文件的头信息放入pFormatCtx中
	ret = avformat_open_input(&pFormatCtx, src_fileName, NULL, NULL);
	if (ret < 0)
	{
		printf("Ffmpeg::extractAudio[ERROR]::打开媒体流文件失败\n");
		return false;
	}

	// 2.探测流出信息

	// 2.1 探寻文件中是否存在信息流,如果存在则将多媒体文件信息流放到pFormatCtx
	ret = avformat_find_stream_info(pFormatCtx, NULL);
	if (ret < 0)
	{
		printf("Ffmpeg::extractAudio[ERROR]::文件中不存在信息流\n");
		return false;
	}

	av_dump_format(pFormatCtx, 0, src_fileName, 0);    // 打印封装格式和流信息

	// 2.2 查找文件信息流中是否存在音频流(我们只需要提取音频),并获取到音频流在信息流中的索引
	int audio_stream_index = -1;
	audio_stream_index = av_find_best_stream(pFormatCtx, AVMEDIA_TYPE_AUDIO, -1, -1, NULL, 0);
	if (-1 == audio_stream_index)
	{
		printf("Ffmpeg::extractAudio[ERROR]::文件中不存在音频流\n");
		return false;
	}

	// 3.输出容器的定义
	AVFormatContext* pFormatCtx_out = NULL;    // 输出格式的上下文信息  
	const AVOutputFormat*  pFormatOut = NULL;        // 输出的封装格式
	AVPacket packet;                         

	// 输出文件路径
	char szFilename[256] = { 0 };
	snprintf(szFilename, sizeof(szFilename), "%s/ffmpeg-music.aac", dstDir);

	// 3.1 初始化容器

	// 初始化一些基础的信息
	av_init_packet(&packet);                 

	// 给 pFormatCtx_out 动态分配内存,并且会根据文件名初始化一些基础信息
	avformat_alloc_output_context2(&pFormatCtx_out, NULL, NULL, szFilename);  

	// 得到封装格式 AAC
	pFormatOut = pFormatCtx_out->oformat;


	// 4.读取音频流,并且将输入流的格式拷贝到输出流的格式中
	
	for (int i = 0; i < pFormatCtx->nb_streams; ++i)   // nb_streams 流的个数
	{
		
		// 流的结构体,封存了一些流相关的信息
		AVStream* out_stream = NULL;               // 输出流
		AVStream* in_stream  = pFormatCtx->streams[i];             // 输入流
		AVCodecParameters* in_codeper = in_stream->codecpar;   // 编解码器
		

		// 只取音频流
		if (in_codeper->codec_type == AVMEDIA_TYPE_AUDIO)
		{
			// 建立输出流
			out_stream = avformat_new_stream(pFormatCtx_out, NULL);
			if (NULL == out_stream)
			{
				printf("Ffmpeg::extractAudio::[ERROR]建立输出流失败\n");
				return false;
			}

			// 拷贝编码参数,如果需要转码请不要直接拷贝
			// 这里只需要做音频的提取,对转码要求不高
			ret = avcodec_parameters_copy(out_stream->codecpar, in_codeper); // 将输入流的编码拷贝到输出流
			if (ret < 0)
			{
				printf("Ffmpeg::extractAudio::[ERROR]拷贝编码失败\n");
				return false;
			}

			out_stream->codecpar->codec_tag = 0;
			break;  // 拿到音频流就可以直接退出循环,这里我们只需要音频流
		}
	}

	av_dump_format(pFormatCtx_out, 0, szFilename, 1);

	// 解复用器,如果没有指定就使用pb
	if (!(pFormatCtx->flags & AVFMT_NOFILE))
	{
		ret = avio_open(&pFormatCtx_out->pb, szFilename, AVIO_FLAG_WRITE); // 读写
		if (ret < 0)
		{
			printf("Ffmpeg::extractAudio::[ERROR]创建AVIOContext对象:打开文件失败\n");
			return false;
		}
	}
	
	
	// 写入媒体文件头部
	ret = avformat_write_header(pFormatCtx_out, NULL);
	if (ret < 0)
	{
		printf("Ffmpeg::extractAudio::[ERROR]写入媒体头部失败\n");
		return false;
	}


	// 逐帧提取音频
	AVPacket* pkt = av_packet_alloc();
	while (av_read_frame(pFormatCtx, &packet) >=0 )
	{
		AVStream* in_stream  = NULL;
		AVStream* out_stream = NULL;
		in_stream = pFormatCtx->streams[pkt->stream_index];
		out_stream = pFormatCtx_out->streams[pkt->stream_index];

		if (packet.stream_index == audio_stream_index)
		{

			packet.pts = av_rescale_q_rnd(packet.pts, in_stream->time_base, out_stream->time_base, (AVRounding)(AV_ROUND_INF|AV_ROUND_PASS_MINMAX));
			packet.dts = packet.pts;
			packet.duration = av_rescale_q(packet.duration, in_stream->time_base, out_stream->time_base);
			packet.pos = -1;
			packet.stream_index = 0;

			// 将包写到输出媒体文件
			av_interleaved_write_frame(pFormatCtx_out, &packet);
			// 减少引用计数,防止造成内存泄漏
			av_packet_unref(&packet);
		}
	}


	// 写入尾部信息
	av_write_trailer(pFormatCtx_out);

	// 释放
	av_packet_free(&pkt);
	avio_close(pFormatCtx_out->pb);
	avformat_close_input(&pFormatCtx);
	

    return true;
}

3.性能对比

5s

5min

30min

0.087017s

0.138014s

0.875926s

三、视频文件中提取图片

1.FFmpeg 

通过命令行

ffmpeg -i 视频文件名 -r 帧率 -f 输出格式 输出文件名

示例: ffmpeg -i video.mp4 -r 1 -f image2 image-%3d.png

参数说明:

-r 帧率(一秒内导出多少张图像,默认25)

-f 代表输出格式(image2实际上是image2序列的意思)

示例:

 

通过提供的API


bool AVInterface::extracPictrue(const char* src, const char* dstDir, int num)
{


    if(NULL == src || NULL == dstDir)
    {
        printf("Ffmpeg::extracPictrue[ERROR]::无效参数,请检查文件路径是否正确\n");
        return false;
    }

    int ret = 0;
    
    // 预存原文件路径
    const char* src_fileName = src;
    
    // 1.获取媒体文件的全局上下文信息
    
    // 1.1 定义 AVFormatContext 容器
    AVFormatContext* pFormatCtx = NULL;       // AVFormatContext描述了一个媒体文件或者媒体流构成的基本信息
    pFormatCtx = avformat_alloc_context();    // 为pFormatCtx申请内存

    // 1.2 打开媒体文件,并且读取媒体文件的头信息放入pFormatCtx中
    ret = avformat_open_input(&pFormatCtx, src_fileName, NULL, NULL);
    if(ret < 0)
    {
        printf("Ffmpeg::extracPictrue[ERROR]::打开媒体流文件失败\n");
        return false;
    }


    // 2.探测流信息
    
    // 2.1 探寻文件中是否存在信息流,如果存在则将多媒体文件信息流放到pFormatCtx中
    ret = avformat_find_stream_info(pFormatCtx, NULL);
    if(ret < 0)
    {
        printf("Ffmpeg::extracPictrue[ERROR]::文件中不存在信息流\n");
        return false;
    }

    av_dump_format(pFormatCtx, 0, src_fileName, 0);      // 可以打印查看

    // 2.2 查找文件信息流中是否存在视频流(这里我们需要提取图片),并获取到视频流在信息流中的索引
    int vecdio_stream_index = -1;
    vecdio_stream_index = av_find_best_stream(pFormatCtx, AVMEDIA_TYPE_VIDEO, -1, -1, NULL, 0);
    if(-1 == vecdio_stream_index)
    {
        printf("Ffmpeg::extracPictrue[ERROR]::文件中不存在视频流\n");
        return false;
    }   // ----------> 丛林方法1
    
    
    // 3.找到对应的解码器:音视频文件是压缩之后的,我们要对文件内容进行处理,就必须先解码
    
    // 3.1 定义解码器的容器
    AVCodecContext* pCodeCtx = NULL;          // AVCodecContext描述编解码器的结构,包含了众多解码器的基本信息
    const AVCodec* pCodec = NULL;                   // AVCodec 存储解码器的信息
    
	pCodeCtx = avcodec_alloc_context3(NULL);  // 初始化解码器上下文

    // 3.2 查找解码器
    AVStream* pStream = pFormatCtx->streams[vecdio_stream_index]; // 在众多解码器找到视频处理的上下文信息
    pCodec = avcodec_find_decoder(pStream->codecpar->codec_id);          // 根据视频流获取视频解码器的基本信息
    if(NULL == pCodec)
    {
        printf("未发现视频编码器\n");
        return false;
    }

	// 初始化解码器上下文
	ret = avcodec_parameters_to_context(pCodeCtx, pStream->codecpar);
	if (ret < 0)
	{
		printf("初始化解码器上下文失败\n");
		return false;
	}

    // 3.3 打开解码器
    ret = avcodec_open2(pCodeCtx, pCodec, NULL);
    if(ret < 0)
    {
        printf("无法打开编解码\n");
        return false;
    }


	AVFrame* pFrame = NULL;
	pFrame = av_frame_alloc();
	if (NULL == pFrame)
	{
		printf("av_frame_alloc is error\n");
		return false;
	}

	int index = 0;

	AVPacket avpkt;

	while (av_read_frame(pFormatCtx, &avpkt) >= 0)
	{
		if (avpkt.stream_index == vecdio_stream_index)
		{
			ret = avcodec_send_packet(pCodeCtx, &avpkt);
			if (ret < 0)
			{
				continue;
			}

			while (avcodec_receive_frame(pCodeCtx, pFrame) == 0)
			{
				SaveFramePicture(pFrame, dstDir, index);
			}
			index++;

			if (index == num)
			{
				break;
			}
		}

		av_packet_unref(&avpkt);
	}

    avcodec_close(pCodeCtx);
    avformat_close_input(&pFormatCtx);

    return true;

}


bool AVInterface::SaveFramePicture(AVFrame* pFrame, const char* dstDir, int index)
{
    char szFilename[256] = {0};
    snprintf(szFilename, sizeof(szFilename), "%s/ffmpeg-%d.png", dstDir, index);

    int ret = 0;

	int width  = pFrame->width;
	int height = pFrame->height;

    // 1.初始化图片封装格式的结构体
    AVCodecContext*  pCodeCtx = NULL;
    AVFormatContext* pFormatCtx = NULL;
    pFormatCtx = avformat_alloc_context(); 
    
    // 2.设置封装格式
	// MJPEG格式:按照25帧/秒速度使用JPEG算法压缩视频信号,完成动态视频的压缩 --> 视频文件使用MJPEG进行解压
    pFormatCtx->oformat = av_guess_format("mjpeg", NULL, NULL);  // 用于从已经注册的输出格式中寻找最匹配的输出格式

    // 3.创建AVIOContext对象:打开文件  
    ret = avio_open(&pFormatCtx->pb, szFilename, AVIO_FLAG_READ_WRITE); // 读写方式
    if(ret < 0)
    {
        printf("avio_open is error");
        return false;
    }


    // 构建一个新的stream
    AVStream* pAVStream = NULL;
    pAVStream = avformat_new_stream(pFormatCtx, 0);
    if(pAVStream == NULL)
    {
        printf("avformat_new_stream\n");
        return false;
    }

    
    AVCodecParameters* parameters = NULL;                    // 编码器参数的结构体
    parameters = pAVStream->codecpar;                        // 设置编码器 mjpeg
    parameters->codec_id = pFormatCtx->oformat->video_codec; // 视频流
    parameters->codec_type = AVMEDIA_TYPE_VIDEO;             // 编码类型
    //parameters->format = AV_PIX_FMT_BGR24;                 // 指定图片的显示样式
	parameters->format = AV_PIX_FMT_YUVJ420P;                // YUV 解压缩显示样式都是YUV
    parameters->width  = pFrame->width;                      // 指定图片的宽度
    parameters->height = pFrame->height;                     // 显示图片的高度
    

    // 找到相应的解码器
    const AVCodec* pCodec = avcodec_find_encoder(pAVStream->codecpar->codec_id);
    if(NULL == pCodec)
    {
        printf("avcodec_find_encoder is error\n");
        return false;
    }

    // 初始化解码器上下文
    pCodeCtx = avcodec_alloc_context3(pCodec);
    if(NULL == pCodeCtx)
    {
        printf("avcodec_alloc_context3 is error\n");
        return false;
    }

    // 设置解码器的参数
    //ret = avcodec_parameters_to_context(pCodeCtx, pAVStream->codecpar);
	ret = avcodec_parameters_to_context(pCodeCtx, parameters);
	if(ret < 0)
    {
        printf("avcodec_parameters_to_context is error\n");
        return false;
    }

	AVRational avrational = {1, 25};       
	pCodeCtx->time_base = avrational;

	// 打开编解码器
    ret = avcodec_open2(pCodeCtx, pCodec, NULL);
    if(ret < 0)
    {
        printf("avcodec_open2 is error\n");
        return false;
    }
    
	
    // 封装格式的头部信息写入
    ret = avformat_write_header(pFormatCtx, NULL);
    if(ret < 0)
    {
        printf("avformat_write_header is error\n");
        return false;
    }
    
    // 给AVPacket分配足够大的空间
    int y_size = width * height;    // 分辨率
    AVPacket pkt;
    av_new_packet(&pkt, y_size * 3);

    // 编码数据
    ret = avcodec_send_frame(pCodeCtx, pFrame);
    if(ret < 0)
    {
        printf("avcodec_send_frame is error\n");
        return false;
    }

    // 得到解码之后的数据
    ret = avcodec_receive_packet(pCodeCtx, &pkt);
    if(ret < 0)
    {
        printf("avcodec_receive_packet is error\n");
        return false;
    }

    ret = av_write_frame(pFormatCtx, &pkt);
    if(ret < 0)
    {
        printf("av_write_frame is error\n");
        return false;
    }


	av_packet_unref(&pkt);
	av_write_trailer(pFormatCtx);
	avcodec_close(pCodeCtx);
	avio_close(pFormatCtx->pb);
	avformat_free_context(pFormatCtx);

    return true;
}

 3.性能对比

5s

5min

30min

10张

0.295322s

0.146283s

0.151467s

100张

1.263546s

1.226884s

1.190490s

全部

2.670444s(170)

96.951886s(7514)

119.161211s(10000)

 四、音频文件中提取文字

1.百度智能云语音识别

百度语音目前只支持语音识别,语音合成和语音唤醒,支持pcm wav amr三种格式,时长为60秒以内,价格为完全免费,调用量限制为无限制

1、离线语音识别

百度离线语音识别目前只支持Android和IOS,Android 平台的一体化离在线语音识别解决方案,以JAR包 + SO库的形式发布。IOS移动设备的离在线语音识别解决方案,以静态库方式提供。

2、在线语音识别

通过API格式调用,Android,iOS,C#,Java,Node,PHP,Python,C++语言,其实是API模式,所有开发语言都支持

1.1百度智能云的优劣

  • 支持普通话,英语,粤语,四川话,普通话远场
  • 只支持60秒以内识别
  • 所有开发语言都支持
  • 百度的linux版离线SDK支持centos 和 ubantu14 16
  • 需要注册百度云控制台账号

1.2 百度智能云安装配置

安装必要的依赖,curl(必须带ssl) jsoncpp openssl

#安装libcurl

sudo apt-get install libcurl4-openssl-dev

#安装jsoncpp

sudo apt-get install libjsoncpp-dev

直接使用开发包步骤如下

  • 官方网站下载C++ SDK压缩包。SDK下载_文字识别SDK_语音识别SDK-百度AI开放平台 (baidu.com)
  • 将下载的aip-cpp-sdk-version.zip解压, 其中文件为包含实现代码的头文件。
  • 安装依赖库libcurl(需要支持https) openssl jsoncpp(>1.6.2版本,0.x版本将不被支持)
  • 编译工程时添加 C++11 支持 (gcc/clang 添加编译参数 -std=c++11), 添加第三方库链接参数 lcurl, lcrypto, ljsoncpp。
  • 在源码中include speech.h ,引入压缩包中的头文件以使用aip命名空间下的类和方法。

1.4百度智能云使用示例

用户可以参考如下代码新建一个client:

#include "speech.h"

    // 设置APPID/AK/SK

    std::string app_id = "XXX";

    std::string api_key = "XXX";

    std::string secret_key = "XXX";

    aip::Speech client(app_id, api_key, secret_key);

        在上面代码中,常量APP_ID在百度云控制台中创建,常量API_KEYSECRET_KEY是在创建完毕应用后,系统分配给用户的,均为字符串,用于标识用户,为访问做签名验证,可在AI服务控制台中的应用列表中查看。

向远程服务上传整段语音进行识别

void asr(aip::Speech client)

{

    // 无可选参数调用接口

    std::string file_content;

    aip::get_file_content("./assets/voice/16k_test.pcm", &file_content);

    Json::Value result = client.recognize(file_content, "pcm", 16000, aip::null);

    // 极速版调用函数

    // Json::Value result = client.recognize_pro(file_content, "pcm", 16000, aip::null);

    // 如果需要覆盖或者加入参数

    std::map<std::string, std::string> options;

    options["dev_pid"] = "1537";

    Json::Value result = client.recognize(file_content, "pcm", 16000, options);

}

返回样例:

// 成功返回

{

    "err_no": 0,

    "err_msg": "success.",

    "corpus_no": "15984125203285346378",

    "sn": "481D633F-73BA-726F-49EF-8659ACCC2F3D",

    "result": ["北京天气"]

}

// 失败返回

{

    "err_no": 2000,

    "err_msg": "data empty.",

    "sn": null

}

SpeechRecognition开源离线语音识别

       SpeechRecognition,是google出的,专注于语音向文本的转换。wit 和 apiai 提供了一些超出基本语音识别的内置功能,如识别讲话者意图的自然语言处理功能。

​​​​​​SpeechRecognition的优/劣

  • 满足几种主流语音 API ,灵活性高
  • Google Web Speech API 支持硬编码到 SpeechRecognition 库中的默认 API 密钥,无需注册就可使用
  • 易用性很高
  • python的语音识别库
  • 中国的识别效果不是特别好

SpeechRecognition安装配置

SpeechRecognition安装配置

pip install SpeechRecognition (pip install -i https://pypi.tuna.tsinghua.edu.cn/simple SpeechRecognition)

yum install python3-devel

yum install pulseaudio-libs-devel

yum install alse-lib-devel

pip install packetSphinx   

配置中文语音识别数据

下载地址

https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/

选择

Mandarin->cmusphinx-zh-cn-5.2.tar.gz

安装中文语音包

cd /usr/local/python3.6.8/lib/python3.6/site-packages/speech_recognition/pocketsphinx-data

tar zxvf cmusphinx-zh-cn-5.2.tar.gz

mv cmusphinx-zh-cn-5.2 zh-cn

cd zh-cn

mv zh_cn.cd_cont_5000 acoustic-model

mv zh_cn.lm.bin language-model.lm.bin

mv zh_cn.dic pronounciation-dictionary.dict

配置环境

cd /usr/local/python3.6.8/lib/python3.6/site-packages/speech_recognition/pocketsphinx-data

tar zxvf py36asr.tar.gz

source ./py36asr/bin/activate

SpeechRecognition使用示例

语音识别示例:

[root@localhost pocketsphinx-data]# pwd

/usr/local/python3.6.8/lib/python3.6/site-packages/speech_recognition/pocketsphinx-data

[root@localhost pocketsphinx-data]# ls

cmusphinx-zh-cn-5.2.tar.gz  py36asr         test1.py   test2.wav  zh-cn.tar.gz

en-US                       py36asr.tar.gz  test1.wav  zh-cn

程序示例:

# -*- coding: utf-8 -*-

# /usr/bin/python

import speech_recognition as sr

r = sr.Recognizer()    

test = sr.AudioFile("test1.wav")

with test as source:       

    audio = r.record(source)

type(audio)

c=r.recognize_sphinx(audio, language='zh-cn')    

print(c)

FastASR语音识别

        这是一个用C++实现ASR推理的项目,它的依赖很少,安装也很简单,推理速度很快。支持的模型是由Google的Transformer模型中优化而来,数据集是开源。Wennetspeech(1000+小时)或阿里私有数据集(60000+小时),所以识别效果有很好,可以媲美许多商用的ASR软件。

  • 流式模型:模拟的输入是语音流,并实时返回语音识别的结果,但是准确率会降低些。

名称

来源

数据集

模型

conformer_online

paddlespeech

WenetSpeech(1000h)

conformer_online_wenetspeech-zh-16k

  • 非流式模型:每次识别是以句子为单位,所以实时性会差一些,但是准确率会高一些。

名称

来源

数据集

模型

语言

paraformer

阿里达摩院

私有数据集(6000h)

Paraformer-large

En+zh

k2_rnnt2

kaldi2

WenetSpeech(10000h)

Prouned_transducer_stateless2

zh

Conformer_online

paddlespeech

WenetSpeech(10000h)

Conformer_online_wenetspeech-zh-16k

zh

        上面提到的这些模型都是基于深度学习框架(paddlepaddle和pytorch)实现的,本身的性能很不错,在个人电脑上运行,也能满足实时性要求(时长为10s的语言,推理视觉小于10s,即可满足实时性)。

FastASR的优/劣

  • 语言优势:由于C++和Python不同,是编译型语言,编译器会根据编译选项针对不同的平台的CPU进行优化,更合适在不同CPU平台上面部署,充分利用CPU的计算资源。
  • 实现独立:不依赖于现有的深度学习框架如pytorch、paddle、tensorflow等
  • 依赖少:项目仅使用了两个第三方libfftw和libopenblas,并无其它依赖,所以在各个平台的可以移植性很好,通用性很强。

  •  缺少量化和压缩模型
  •  支持C++ 和python

FastASR安装配置

  1. 依赖安装库 libfftw3

sudo apt-get install libfftw3-dev libfftw3-single3

  1. 安装依赖库 libopenblas

sudo apt-get install libopenblas-dev

  1. 安装python环境

sudo apt-get install python3 python3-dev

  1. 下载最新版的源码

git clone https://github.com/chenkui164/FastASR.git

  1. 编译最新版本的源码

cd FastASR/

mkdir build

cd build

cmake -DCMAKE_BUILD_TYPE=Release ..

make

  1. 编译python的whl安装包

cd FastASR

python -m build

  1. 下载预训练模型

paraformer预训练模型下载

cd ../models/paraformer_cli

1.从modelscope官网下载预训练模型

wget --user-agent="Mozilla/5.0" -c "https://www.modelscope.cn/api/v1/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/repo?Revision=v1.0.4&FilePath=model.pb"

  1. 重命名

mv repo\?Revision\=v1.0.4\&FilePath\=model.pb model.pb

  1. 将用于Python的模型转换为C++的

../scripts/paraformer_convert.py model.pb

  1. 通过md5检查是否等于 c77bce5758ebdc28a9024460e48602  

md5sum -b wenet_params.bin

K2_rnnt2预训练模型下载

cd ../models/k2_rnnt2_cli

1.从huggingface官网下载预训练模型

wget -c https://huggingface.co/luomingshuang/icefall_asr_wenetspeech_pruned_transducer_stateless2/resolve/main/exp/pretrained_epoch_10_avg_2.pt

2.将用于Python的模型转换为C++的

../scripts/k2_rnnt2_convert.py pretrained_epoch_10_avg_2.pt

3.通过md5检查是否等于 33a941f3c1a20a5adfb6f18006c11513

 md5sum -b wenet_params.bin

PaddleSpeech预训练模型下载

1.从PaddleSpeech官网下载预训练模型

wget -c https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1_conformer_wenetspeech_ckpt_0.1.1.model.tar.gz

2.将压缩包解压wenetspeech目录下

mkdir wenetspeech

tar -xzvf asr1_conformer_wenetspeech_ckpt_0.1.1.model.tar.gz -C wenetspeech

3.将用于Python的模型转换为C++的

../scripts/paddlespeech_convert.py wenetspeech/exp/conformer/checkpoints/wenetspeech.pdparams

4.md5检查是否等于 9cfcf11ee70cb9423528b1f66a87eafd

md5sum -b wenet_params.bin

流模式预训练模型下载

cd ../models/paddlespeech_stream

  1. 从PaddleSpeech官网下载预训练模型

wget -c https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz

2.将压缩包解压wenetspeech目录下

mkdir wenetspeech

tar -xzvf asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz -C wenetspeech

3.将用于Python的模型转化为C++的

../scripts/paddlespeech_convert.py wenetspeech/exp/chunk_conformer/checkpoints/avg_10.pdparams

4.md5检查是否等于 367a285d43442ecfd9c9e5f5e1145b84

md5sum -b wenet_params.bin

 ​​​​​​​FastASR使用示例

#include <iostream>
#include <win_func.h>
#include <Audio.h>
#include <Model.h>
#include <string.h>

using namespace std;

bool externContext(const char* src, const char* dst)
{

    Audio audio(0);           // 申请一个音频处理的对象
    audio.loadwav(src);       // 加载文件
    audio.disp();             // 分析格式

    // Model* mm = create_model("/home/chen/FastASR/models/k2_rnnt2_cli", 2); // 创建一个预训练模型
    Model* mm = create_model("/home/chen/FastASR/models/paraformer_cli", 3);
    audio.split();           // 解析文件
    
    float* buff = NULL;      // fftw3数据分析
    int len = 0;
    int flag = false;
    char buf[1024];
    
    // 一行一行的取出内容
    FILE* fp = NULL;
    fp = fopen(dst, "w+");
    if(NULL == fp)
    {
        printf("打开文件失败\n");
    }
    
    printf("0.---------------------->\n");

    while(audio.fetch(buff, len , flag) > 0)
    {
        printf("1.---------------------->\n");

        mm->reset();
        string msg = mm->forward(buff, len, flag);

        memset(buf, 0, sizeof(buf));
        snprintf(buf, sizeof(buf), "%s", msg.c_str());
        fseek(fp, 0, SEEK_END);
        fprintf(fp, "%s\n", buf);
        fflush(fp);

        printf("2.--------------------->\n");
    }

    printf("3.------------------------>\n");
    
    return true;

}

int main(void)
{

    externContext("./long.wav", "./Context.txt");

    return 0;
}
flags:= -I ./include
flags+= -L ./lib -lfastasr -lfftw3 -lfftw3f -lblas  -lwebrtcvad
src_cpp=$(wildcard ./*.cpp)

debug:
	g++ -g $(src_cpp) -omain $(flags) -std=c++11

 夜深了,这篇文章中的从之前写的文档里粘贴过来的。有一些地方格式不太好看。见谅...

Guess you like

Origin blog.csdn.net/weixin_46120107/article/details/129210284