If you want to do something better, you not only need to know how to use it, but also some basic concepts.
1. Basic carding of audio and video processing
1. Understanding of multimedia files
1.1 Structural Analysis
A multimedia file can essentially be understood as a container
There are many streams in the container
Each stream is encoded by a different encoder
Multiple frames are contained in many packages (frame is the smallest unit in audio and video processing)
1.2 Package format
Encapsulation format (also called container) is to put the encoded and compressed video stream, audio stream and subtitle stream into a file according to a certain scheme, which is convenient for playback software to play. Generally speaking, the suffix of a video file is its encapsulation format. The package format is different, and the suffix name is also different (xxx.mp4 xxx.flv).
1.3 Audio and video synchronization
Audio Master: Sync video to audio
Video Master: Sync Audio to Video
External Clock Master: Synchronize audio and video to an external clock
1.4 Principles of audio and video recording
1.5 Principles of audio and video playback
1.6 Principles of audio and video playback
2. Basic concepts of audio and video
2.1 sound
Sampling the natural sound is to digitize the signal on the time axis, that is, to take its instantaneous value point by point on the analog signal x(t) according to a certain time interval t. The higher the sampling rate, the higher the degree of sound restoration, the better the quality, and the larger the space occupied.
Quantization: Approximate the original continuous amplitude value with a finite number of amplitudes to change the continuous amplitude of the analog signal into a finite number of discrete values with a certain interval. [The accuracy of the sampling value depends on how small it is expressed, which is quantization. For example, 8-bit quantization can represent 256 different values, while 16-bit CD quality can represent 65536 values, range -32769-32767]
Let's calculate this value:
Coding: Install a certain rule to express the quantized value with binary numbers, and then convert it into a binary or multi-valued digital signal stream. The digital signal obtained in this way can be transmitted through digital lines such as cable and satellite communication. The receiving end is the reverse of the above process.
How the encoding is understood:
When we were in school, we should have heard the teacher talk about Huffman coding, the truth is actually the same. Using some form to make a value unique, effective encoding can improve security, compress data, and other effective functions.
PCM: The above digitization process is also called pulse code modulation. Usually we say that the raw data format of audio is pulse code modulation data. Describing a piece of PCM data requires 4 quantitative indicators: sampling rate, bit depth, byte order, and number of channels.
Sampling rate: How many samples are taken per second, in Hz.
wireless broadcast |
22000(22kHz) |
CD quality |
44100(44.1kHz) |
Digital TV (DVD) |
48000(48Hz) |
Blu-ray (HD DVD) |
96000(96kHz) |
Blu-ray (HD DVD) |
192000(192kHz) |
Bit depth (Bit-depth): Indicates how many small binary bits are used to describe the sampling data, generally 16 bits.
Byte order: Indicates whether the byte order of audio PCM data storage is big-endian or little-endian. For efficient data processing, little-endian storage is usually used.
Number of channels (channel number): Whether the number of channels contained in the current PCM file is mono or dual
Bit rate: the number of bits transmitted per second, in bps (Bit Per Second). An indirect measure of sound quality. Bit rate of uncompressed audio data = sampling frequency * sampling precision * number of channels.
Bit rate: The bit rate of the compressed audio data. The larger the code rate, the lower the compression efficiency, the better the sound quality, and the larger the compressed data. Bit rate = audio file size / duration.
FM quality |
96 kps |
Normal Quality Audio |
128-160kbps |
CD quality |
192kbps |
high quality audio |
256-320kbps |
Frame: The number of sample units per encoding. For example, MP3 usually uses 1152 sampling points as a coding unit, and AAC usually uses 1024 sampling points as a coding unit.
Frame length: It can refer to the duration of each frame playback. Duration of each frame (seconds) = number of sampling points per frame/sampling frequency (HZ). It can also refer to the data length of each frame after compression.
Audio coding: the main function is to compress audio sample data (PCM, etc.) into an audio stream, thereby reducing the amount of audio data, which is biased towards storage and transmission.
MP3 |
A digital audio encoding and lossy compression format used to drastically reduce the amount of audio. |
AAC |
AAC has a higher compression ratio than MP3, and the same size audio file, AAC has higher sound quality. |
WMA |
Contains both lossy and lossless compression formats natively |
2.2 Image
An image is a similar and vivid description or portrait of an objective object, and is the most commonly used information carrier in human social activities. In other words, an image is a representation of an objective object, which contains information about the object being described. It is people's primary source of information.
Pixel: Screen display is to divide the effective area into many small grids, and each grid only displays one color, which is the smallest element of imaging, so it is called "pixel".
Resolution: The number of pixels in the length and width directions of the screen is called resolution, which is generally represented by A x B. The higher the resolution, the smaller the area of each pixel, and the smoother and more delicate the display effect will be.
RGB represents an image: 8bit represents a sub-pixel: the value range is [0~255] or [00~FF]. For example, the image format RGBA_8888 means that 4 8bits represent a pixel, while RGB_565 uses 5+6+5bits to represent a pixel. A 1280*720 RGBA_8888 format picture size = 1280 x 720 x 32bit. So the raw data of each image is very large. A 90-minute movie, 25 frames per second: 90 * 60 * 25 * 1280 * 720 * 32 bit = 463.48G.
YU V means image: YUV is another color coding method. The raw data of video is generally expressed in YUV data format. Y represents brightness, also known as gray value (gray scale value). UY stands for Hue, which both represent the color and saturation of the influence, and are used to specify the color of a pixel.
Luminance: Needed to be established from the RGB input signal by summing certain parts of the RGB signal (g-component signal) together.
Chroma: defines the hue and saturation of the color, represented by Cr and Cb respectively, (C stands for component (abbreviation of component)). Cr reflects the difference between the red part of the RGB input signal and the brightness value of the RGB signal. Cb reflects the difference between the blue portion of the RGB input signal and the brightness value of the RGB signal.
2.3 video
Due to the special structure of the human eye, when the screen is switched quickly, the screen will remain (persistence of vision), which feels like a coherent movement. Therefore, a video is composed of a series of pictures.
Video bit rate: refers to the data flow used by video files per unit time, also called bit rate. The higher the code rate, the higher the sampling rate per unit time and the higher the accuracy of the data stream.
Video frame rate: Generally speaking, 25 frames of a video refers to the video frame rate, that is, 25 frames will be displayed in 1 second. The higher the frame rate, the smoother the visual experience.
Video resolution: The resolution is what we often say 640x480 resolution, 1920x1080 resolution, the resolution affects the size of the video image.
Frame: A frame is generated without referring to other pictures, and the complete image is reconstructed only by itself during decoding.
Video encoding: The purpose of encoding is to compress, so that the size of various videos becomes smaller, which is conducive to storage and transmission. There are two mainstream international organizations that formulate video codec technology, one is "ITU-T", which formulates standards such as H.261, H.263, H.263+, H.264, etc. The other is the "International Organization for Standardization (ISO)", which has established standards such as MPEG-1, MPEG-2, and MPEG-4.
WMV |
A streaming media format launched by Microsoft, which is an upgrade and extension of the "same door" ASF format. Under the same video quality, files in WMV format can be played while downloading, so it is very suitable for playing and transmitting on the Internet. |
VP8 |
WebM, VPX (VP6, VP7, VP8, VP9) from On2, this codec is designed for web video. |
WebRTC |
In May 2010, Google acquired VoIP software developer Global IP Solutions for approximately $68.2 million, thereby gaining access to the company's WebRTC technology. WebRTC integrates VP8, VP9. |
AV1 |
It is an open, patent-free video encoding format designed for transmitting video over the Internet. |
AVS |
It is the second-generation source coding standard with independent intellectual property rights in China. It is the abbreviation of the "Advanced Audio and Video Coding for Information Technology" series of standards. It includes four main technical standards and conformance tests of system, video, audio, and digital rights management. and other support standards. |
H265 |
HEVC offers significant improvements in compression over the H.264 codec. HEVC compresses video twice more efficiently than H.264. With HEVC, a video of the same visual quality takes up half the space. |
VP9 |
It is an open, copyright-free video coding standard developed by Google, and VP9 is also regarded as the next-generation video coding standard of VP8. |
3. Commonly used audio and video processing third-party libraries
3.2.1 Basic concepts
FFmpeg (Fast Forward MPEG) is the world's leading multimedia framework, capable of decoding (decode), encoding (encode), transcoding (transcode), multiplexing (mux), demultiplexing (demux), streaming (stream), filtering (filter) and play almost all multimedia files created by humans and machines.
3.2.2 The main basic components of FFmpeg
FFmpeg's encapsulation module AVFormat: AVFormat implements most media encapsulation formats in the multimedia field, including encapsulation and decapsulation, such as MP4, FLV, KV, TS and other file encapsulation formats, RTMP, RTSP, MMS, HLS and other network protocol encapsulation formats. Whether FFmepg supports a certain media packaging format depends on whether the packaging library for this format is included when compiling.
FFmpeg's codec module AVCodec: AVCodec includes most commonly used codec formats, supporting both encoding and decoding. In addition to supporting MPEG4, AAC, MJPEG and other built-in media formats, it also supports H.264 (x264 encoder), H.265 (X265 encoder), MP3 (libMP3lame encoder)
FFmepg's filter module AVFilter: The AVFilter library provides a general audio, video, subtitle and other filter processing framework. In AVFilter, a filter frame can have multiple inputs and multiple outputs.
FFmpeg's video image conversion calculation module swscale: The swscale module provides a high-level image conversion API, which can scale images and convert pixel formats.
FFmpeg's audio conversion calculation module swresample: swresample provides an audio resampling API that supports audio sampling, audio channel layout, and layout adjustment.
3.2.3 Advantages and disadvantages of FFmpeg
- High portability: It can be compiled, run and tested by FATE (FFMPEG automated test environment) on Linux, Mac, Windows and other systems.
- High performance: Provides corresponding assembly-level optimized implementations for most mainstream processors such as X86, arm, MIPS, and ppc.
- High security: FFMPEG officials always consider security when reviewing code, and once there are security bugs in the released version, they will be fixed and updated as soon as possible .
- High ease of use: APIs provided by FFMPEG have related comments, and the official documentation also has corresponding documentation
- Diversity of supported formats: FFMPEG supports functions such as decoding, encoding, multiplexing, and demultiplexing of many media formats, whether it is a very old format or a relatively new format.
- Filenames with spaces are not recognized
- When encoding FFMPEG, the timestamp only needs to specify the pts field of AVFrame
3.2.4 FFmpeg installation and configuration
Windows easier |
Linux Method 1 (yum installation) but this version is a bit old
sudo yum install epel-release -y Sudo yum update -y
sudo rpm --import http://li.nux.ro/download/nux/RPM-GPG-KEY-nux.ro sudo rpm -Uvh http://li.nux.ro/download/nux/dextop/el7/x86_64/nux-dextop-release-0-5.el7.nux.noarch.rpm
yum -y install ffmpeg ffmpeg-devel
ffmpeg -version 方法 2(编译安装) 先下载源码包: git clone https://git.ffmpeg.org/ffmpeg.git ffmpeg 进入ffmpeg文件夹,依次执行下列语句: cd ffmpeg ./configure make make install 将编译好的ffmpeg复制到bin目录 cp ffmpeg /usr/bin/ffmpeg 检查版本 ffmpeg -version |
3.2.5 FFmpeg的命令行使用
获取媒体文件信息 ffmpeg -i 文件全路径 -hide_banner 示例: ffmpeg -i video_file.mp4 -hide_banner 注: -hide_banner 隐藏ffmpeg本身的信息,只显示文件相关信息(编码器、数据流等)。 |
转换媒体文件(可以实现在不同媒体格式之间进行自由转换) 注:ffmpeg会从后缀名猜测格式 ffmpeg -i 待转换的文件路径 转换后的文件路径 示例: ffmpeg -i video_input.mp4 video_output.avi ffmpeg -i video_input.webm video_output.flv ffmpeg -i audio_input.mp3 video_output.ogg ffmpeg -i audio_input.wav audio_output.flac video_output.flv 参数: -qscale 0 保留原始的视频质量 |
从视频中提取音频 ffmpeg -i 视频全路径 -vn 需要保存的音频文件全路径 -hide_banner 参数说明: -vn 从视频中提取音频 -ab 指定编码比特率(一些常见的比特率 96k、128k、192k、256k、320k) -ar 采样率(22050、441000、48000) -ac 声道数 -f 音频格式(通常会自动识别) |
视频静音-(纯视频) ffmpeg -i video_input.mp4 -an -video_output.mp4 注: -an 标记会让所有视频的音频参数无效,因为最后没有音频产生 |
视频文件中提取截图 ffmpeg -i 视频文件名 -r 帧率 -f 输出格式 输出文件名 示例: ffmpeg -i video.mp4 -r 1 -f image2 image-%3d.png 参数说明: -r 帧率(一秒内导出多少张图像,默认25) -f 代表输出格式(image2实际上是image2序列的意思) |
更改视频分辨率或长宽比 ffmpeg -i 视频文件名 -s 分辩率 -c:a -aspect 长:宽 输出文件名 示例:ffmpeg -i video_input.mov -s 1024x576 -c:a video_output.mp4 参数说明: -s 缩放视频 -c:a 保证音频编码正确 -aspect 更改长宽比 |
为音频增加封面(音频转视频) 当你想往某一个网站上传音频,但那个网站只接受视频的情况下非常适用 示例:ffmpeg -loop 1 -i image.jpg -i audio.wav -c:v libx264 -c:a acc -strict experimental -b:a 192k -shortest output.mp4 -c:v 视频编码 -c:a 音频编码 注:如果是4.x版本以上,不需要加 -strict experimental |
为视频增加字幕 ffmpeg -i video.mp4 -i subtitles.srt -c:v copy -c:a copy -preset veryfast -c:s mov_text -map 0 -map 1 output.mp4 |
二、视频中提取音频
1.FFmpeg
通过命令行ffmpeg -i 视频文件路径 -vn 音频文件全路径 -hide_banner 参数说明: -vn 从视频中提取音频 -ab 指定编码比特率(一些常见的比特率 96k、128k、192k、256k、320k) -ar 采样率(22050、441000、48000) -ac 声道数 -f 音频格式(通常会自动识别) 示例: |
通过提供的API |
bool AVInterface::extractAudio(const char* src, const char* dstDir)
{
if (NULL == src || NULL == dstDir)
{
printf("Ffmpeg::extractAudio[ERROR]::无效参数,请检查文件路径是否正确\n");
return false;
}
int ret = 0;
// 预存原文件路径
const char* src_fileName = src;
// 1.获取媒体文件的全局上下文信息
// 1.1 定义 AVFormatContext 容器
AVFormatContext* pFormatCtx = NULL; // AVFormatContext描述了一个媒体文件或者媒体流构成的基本信息
pFormatCtx = avformat_alloc_context(); // 为 pFormatCtx 申请内存
// 1.2 打开媒体文件,并且读取媒体文件的头信息放入pFormatCtx中
ret = avformat_open_input(&pFormatCtx, src_fileName, NULL, NULL);
if (ret < 0)
{
printf("Ffmpeg::extractAudio[ERROR]::打开媒体流文件失败\n");
return false;
}
// 2.探测流出信息
// 2.1 探寻文件中是否存在信息流,如果存在则将多媒体文件信息流放到pFormatCtx
ret = avformat_find_stream_info(pFormatCtx, NULL);
if (ret < 0)
{
printf("Ffmpeg::extractAudio[ERROR]::文件中不存在信息流\n");
return false;
}
av_dump_format(pFormatCtx, 0, src_fileName, 0); // 打印封装格式和流信息
// 2.2 查找文件信息流中是否存在音频流(我们只需要提取音频),并获取到音频流在信息流中的索引
int audio_stream_index = -1;
audio_stream_index = av_find_best_stream(pFormatCtx, AVMEDIA_TYPE_AUDIO, -1, -1, NULL, 0);
if (-1 == audio_stream_index)
{
printf("Ffmpeg::extractAudio[ERROR]::文件中不存在音频流\n");
return false;
}
// 3.输出容器的定义
AVFormatContext* pFormatCtx_out = NULL; // 输出格式的上下文信息
const AVOutputFormat* pFormatOut = NULL; // 输出的封装格式
AVPacket packet;
// 输出文件路径
char szFilename[256] = { 0 };
snprintf(szFilename, sizeof(szFilename), "%s/ffmpeg-music.aac", dstDir);
// 3.1 初始化容器
// 初始化一些基础的信息
av_init_packet(&packet);
// 给 pFormatCtx_out 动态分配内存,并且会根据文件名初始化一些基础信息
avformat_alloc_output_context2(&pFormatCtx_out, NULL, NULL, szFilename);
// 得到封装格式 AAC
pFormatOut = pFormatCtx_out->oformat;
// 4.读取音频流,并且将输入流的格式拷贝到输出流的格式中
for (int i = 0; i < pFormatCtx->nb_streams; ++i) // nb_streams 流的个数
{
// 流的结构体,封存了一些流相关的信息
AVStream* out_stream = NULL; // 输出流
AVStream* in_stream = pFormatCtx->streams[i]; // 输入流
AVCodecParameters* in_codeper = in_stream->codecpar; // 编解码器
// 只取音频流
if (in_codeper->codec_type == AVMEDIA_TYPE_AUDIO)
{
// 建立输出流
out_stream = avformat_new_stream(pFormatCtx_out, NULL);
if (NULL == out_stream)
{
printf("Ffmpeg::extractAudio::[ERROR]建立输出流失败\n");
return false;
}
// 拷贝编码参数,如果需要转码请不要直接拷贝
// 这里只需要做音频的提取,对转码要求不高
ret = avcodec_parameters_copy(out_stream->codecpar, in_codeper); // 将输入流的编码拷贝到输出流
if (ret < 0)
{
printf("Ffmpeg::extractAudio::[ERROR]拷贝编码失败\n");
return false;
}
out_stream->codecpar->codec_tag = 0;
break; // 拿到音频流就可以直接退出循环,这里我们只需要音频流
}
}
av_dump_format(pFormatCtx_out, 0, szFilename, 1);
// 解复用器,如果没有指定就使用pb
if (!(pFormatCtx->flags & AVFMT_NOFILE))
{
ret = avio_open(&pFormatCtx_out->pb, szFilename, AVIO_FLAG_WRITE); // 读写
if (ret < 0)
{
printf("Ffmpeg::extractAudio::[ERROR]创建AVIOContext对象:打开文件失败\n");
return false;
}
}
// 写入媒体文件头部
ret = avformat_write_header(pFormatCtx_out, NULL);
if (ret < 0)
{
printf("Ffmpeg::extractAudio::[ERROR]写入媒体头部失败\n");
return false;
}
// 逐帧提取音频
AVPacket* pkt = av_packet_alloc();
while (av_read_frame(pFormatCtx, &packet) >=0 )
{
AVStream* in_stream = NULL;
AVStream* out_stream = NULL;
in_stream = pFormatCtx->streams[pkt->stream_index];
out_stream = pFormatCtx_out->streams[pkt->stream_index];
if (packet.stream_index == audio_stream_index)
{
packet.pts = av_rescale_q_rnd(packet.pts, in_stream->time_base, out_stream->time_base, (AVRounding)(AV_ROUND_INF|AV_ROUND_PASS_MINMAX));
packet.dts = packet.pts;
packet.duration = av_rescale_q(packet.duration, in_stream->time_base, out_stream->time_base);
packet.pos = -1;
packet.stream_index = 0;
// 将包写到输出媒体文件
av_interleaved_write_frame(pFormatCtx_out, &packet);
// 减少引用计数,防止造成内存泄漏
av_packet_unref(&packet);
}
}
// 写入尾部信息
av_write_trailer(pFormatCtx_out);
// 释放
av_packet_free(&pkt);
avio_close(pFormatCtx_out->pb);
avformat_close_input(&pFormatCtx);
return true;
}
3.性能对比
5s |
5min |
30min |
0.087017s |
0.138014s |
0.875926s |
三、视频文件中提取图片
1.FFmpeg
通过命令行ffmpeg -i 视频文件名 -r 帧率 -f 输出格式 输出文件名 示例: ffmpeg -i video.mp4 -r 1 -f image2 image-%3d.png 参数说明: -r 帧率(一秒内导出多少张图像,默认25) -f 代表输出格式(image2实际上是image2序列的意思) 示例:
|
通过提供的API
bool AVInterface::extracPictrue(const char* src, const char* dstDir, int num)
{
if(NULL == src || NULL == dstDir)
{
printf("Ffmpeg::extracPictrue[ERROR]::无效参数,请检查文件路径是否正确\n");
return false;
}
int ret = 0;
// 预存原文件路径
const char* src_fileName = src;
// 1.获取媒体文件的全局上下文信息
// 1.1 定义 AVFormatContext 容器
AVFormatContext* pFormatCtx = NULL; // AVFormatContext描述了一个媒体文件或者媒体流构成的基本信息
pFormatCtx = avformat_alloc_context(); // 为pFormatCtx申请内存
// 1.2 打开媒体文件,并且读取媒体文件的头信息放入pFormatCtx中
ret = avformat_open_input(&pFormatCtx, src_fileName, NULL, NULL);
if(ret < 0)
{
printf("Ffmpeg::extracPictrue[ERROR]::打开媒体流文件失败\n");
return false;
}
// 2.探测流信息
// 2.1 探寻文件中是否存在信息流,如果存在则将多媒体文件信息流放到pFormatCtx中
ret = avformat_find_stream_info(pFormatCtx, NULL);
if(ret < 0)
{
printf("Ffmpeg::extracPictrue[ERROR]::文件中不存在信息流\n");
return false;
}
av_dump_format(pFormatCtx, 0, src_fileName, 0); // 可以打印查看
// 2.2 查找文件信息流中是否存在视频流(这里我们需要提取图片),并获取到视频流在信息流中的索引
int vecdio_stream_index = -1;
vecdio_stream_index = av_find_best_stream(pFormatCtx, AVMEDIA_TYPE_VIDEO, -1, -1, NULL, 0);
if(-1 == vecdio_stream_index)
{
printf("Ffmpeg::extracPictrue[ERROR]::文件中不存在视频流\n");
return false;
} // ----------> 丛林方法1
// 3.找到对应的解码器:音视频文件是压缩之后的,我们要对文件内容进行处理,就必须先解码
// 3.1 定义解码器的容器
AVCodecContext* pCodeCtx = NULL; // AVCodecContext描述编解码器的结构,包含了众多解码器的基本信息
const AVCodec* pCodec = NULL; // AVCodec 存储解码器的信息
pCodeCtx = avcodec_alloc_context3(NULL); // 初始化解码器上下文
// 3.2 查找解码器
AVStream* pStream = pFormatCtx->streams[vecdio_stream_index]; // 在众多解码器找到视频处理的上下文信息
pCodec = avcodec_find_decoder(pStream->codecpar->codec_id); // 根据视频流获取视频解码器的基本信息
if(NULL == pCodec)
{
printf("未发现视频编码器\n");
return false;
}
// 初始化解码器上下文
ret = avcodec_parameters_to_context(pCodeCtx, pStream->codecpar);
if (ret < 0)
{
printf("初始化解码器上下文失败\n");
return false;
}
// 3.3 打开解码器
ret = avcodec_open2(pCodeCtx, pCodec, NULL);
if(ret < 0)
{
printf("无法打开编解码\n");
return false;
}
AVFrame* pFrame = NULL;
pFrame = av_frame_alloc();
if (NULL == pFrame)
{
printf("av_frame_alloc is error\n");
return false;
}
int index = 0;
AVPacket avpkt;
while (av_read_frame(pFormatCtx, &avpkt) >= 0)
{
if (avpkt.stream_index == vecdio_stream_index)
{
ret = avcodec_send_packet(pCodeCtx, &avpkt);
if (ret < 0)
{
continue;
}
while (avcodec_receive_frame(pCodeCtx, pFrame) == 0)
{
SaveFramePicture(pFrame, dstDir, index);
}
index++;
if (index == num)
{
break;
}
}
av_packet_unref(&avpkt);
}
avcodec_close(pCodeCtx);
avformat_close_input(&pFormatCtx);
return true;
}
bool AVInterface::SaveFramePicture(AVFrame* pFrame, const char* dstDir, int index)
{
char szFilename[256] = {0};
snprintf(szFilename, sizeof(szFilename), "%s/ffmpeg-%d.png", dstDir, index);
int ret = 0;
int width = pFrame->width;
int height = pFrame->height;
// 1.初始化图片封装格式的结构体
AVCodecContext* pCodeCtx = NULL;
AVFormatContext* pFormatCtx = NULL;
pFormatCtx = avformat_alloc_context();
// 2.设置封装格式
// MJPEG格式:按照25帧/秒速度使用JPEG算法压缩视频信号,完成动态视频的压缩 --> 视频文件使用MJPEG进行解压
pFormatCtx->oformat = av_guess_format("mjpeg", NULL, NULL); // 用于从已经注册的输出格式中寻找最匹配的输出格式
// 3.创建AVIOContext对象:打开文件
ret = avio_open(&pFormatCtx->pb, szFilename, AVIO_FLAG_READ_WRITE); // 读写方式
if(ret < 0)
{
printf("avio_open is error");
return false;
}
// 构建一个新的stream
AVStream* pAVStream = NULL;
pAVStream = avformat_new_stream(pFormatCtx, 0);
if(pAVStream == NULL)
{
printf("avformat_new_stream\n");
return false;
}
AVCodecParameters* parameters = NULL; // 编码器参数的结构体
parameters = pAVStream->codecpar; // 设置编码器 mjpeg
parameters->codec_id = pFormatCtx->oformat->video_codec; // 视频流
parameters->codec_type = AVMEDIA_TYPE_VIDEO; // 编码类型
//parameters->format = AV_PIX_FMT_BGR24; // 指定图片的显示样式
parameters->format = AV_PIX_FMT_YUVJ420P; // YUV 解压缩显示样式都是YUV
parameters->width = pFrame->width; // 指定图片的宽度
parameters->height = pFrame->height; // 显示图片的高度
// 找到相应的解码器
const AVCodec* pCodec = avcodec_find_encoder(pAVStream->codecpar->codec_id);
if(NULL == pCodec)
{
printf("avcodec_find_encoder is error\n");
return false;
}
// 初始化解码器上下文
pCodeCtx = avcodec_alloc_context3(pCodec);
if(NULL == pCodeCtx)
{
printf("avcodec_alloc_context3 is error\n");
return false;
}
// 设置解码器的参数
//ret = avcodec_parameters_to_context(pCodeCtx, pAVStream->codecpar);
ret = avcodec_parameters_to_context(pCodeCtx, parameters);
if(ret < 0)
{
printf("avcodec_parameters_to_context is error\n");
return false;
}
AVRational avrational = {1, 25};
pCodeCtx->time_base = avrational;
// 打开编解码器
ret = avcodec_open2(pCodeCtx, pCodec, NULL);
if(ret < 0)
{
printf("avcodec_open2 is error\n");
return false;
}
// 封装格式的头部信息写入
ret = avformat_write_header(pFormatCtx, NULL);
if(ret < 0)
{
printf("avformat_write_header is error\n");
return false;
}
// 给AVPacket分配足够大的空间
int y_size = width * height; // 分辨率
AVPacket pkt;
av_new_packet(&pkt, y_size * 3);
// 编码数据
ret = avcodec_send_frame(pCodeCtx, pFrame);
if(ret < 0)
{
printf("avcodec_send_frame is error\n");
return false;
}
// 得到解码之后的数据
ret = avcodec_receive_packet(pCodeCtx, &pkt);
if(ret < 0)
{
printf("avcodec_receive_packet is error\n");
return false;
}
ret = av_write_frame(pFormatCtx, &pkt);
if(ret < 0)
{
printf("av_write_frame is error\n");
return false;
}
av_packet_unref(&pkt);
av_write_trailer(pFormatCtx);
avcodec_close(pCodeCtx);
avio_close(pFormatCtx->pb);
avformat_free_context(pFormatCtx);
return true;
}
3.性能对比
5s |
5min |
30min |
|
10张 |
0.295322s |
0.146283s |
0.151467s |
100张 |
1.263546s |
1.226884s |
1.190490s |
全部 |
2.670444s(170) |
96.951886s(7514) |
119.161211s(10000) |
四、音频文件中提取文字
1.百度智能云语音识别
百度语音目前只支持语音识别,语音合成和语音唤醒,支持pcm wav amr三种格式,时长为60秒以内,价格为完全免费,调用量限制为无限制。 1、离线语音识别 百度离线语音识别目前只支持Android和IOS,Android 平台的一体化离在线语音识别解决方案,以JAR包 + SO库的形式发布。IOS移动设备的离在线语音识别解决方案,以静态库方式提供。 2、在线语音识别 通过API格式调用,Android,iOS,C#,Java,Node,PHP,Python,C++语言,其实是API模式,所有开发语言都支持。 |
1.1百度智能云的优劣
|
1.2 百度智能云安装配置
安装必要的依赖,curl(必须带ssl) jsoncpp openssl #安装libcurl sudo apt-get install libcurl4-openssl-dev #安装jsoncpp sudo apt-get install libjsoncpp-dev 直接使用开发包步骤如下:
|
1.4百度智能云使用示例
用户可以参考如下代码新建一个client: #include "speech.h" // 设置APPID/AK/SK std::string app_id = "XXX"; std::string api_key = "XXX"; std::string secret_key = "XXX"; aip::Speech client(app_id, api_key, secret_key); 在上面代码中,常量APP_ID在百度云控制台中创建,常量API_KEY与SECRET_KEY是在创建完毕应用后,系统分配给用户的,均为字符串,用于标识用户,为访问做签名验证,可在AI服务控制台中的应用列表中查看。 向远程服务上传整段语音进行识别 void asr(aip::Speech client) { // 无可选参数调用接口 std::string file_content; aip::get_file_content("./assets/voice/16k_test.pcm", &file_content); Json::Value result = client.recognize(file_content, "pcm", 16000, aip::null); // 极速版调用函数 // Json::Value result = client.recognize_pro(file_content, "pcm", 16000, aip::null); // 如果需要覆盖或者加入参数 std::map<std::string, std::string> options; options["dev_pid"] = "1537"; Json::Value result = client.recognize(file_content, "pcm", 16000, options); } 返回样例: // 成功返回 { "err_no": 0, "err_msg": "success.", "corpus_no": "15984125203285346378", "sn": "481D633F-73BA-726F-49EF-8659ACCC2F3D", "result": ["北京天气"] } // 失败返回 { "err_no": 2000, "err_msg": "data empty.", "sn": null } |
SpeechRecognition开源离线语音识别
SpeechRecognition,是google出的,专注于语音向文本的转换。wit 和 apiai 提供了一些超出基本语音识别的内置功能,如识别讲话者意图的自然语言处理功能。
SpeechRecognition的优/劣
|
SpeechRecognition安装配置
SpeechRecognition安装配置 pip install SpeechRecognition (pip install -i https://pypi.tuna.tsinghua.edu.cn/simple SpeechRecognition) yum install python3-devel yum install pulseaudio-libs-devel yum install alse-lib-devel pip install packetSphinx 配置中文语音识别数据 下载地址 https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/ 选择 Mandarin->cmusphinx-zh-cn-5.2.tar.gz 安装中文语音包 cd /usr/local/python3.6.8/lib/python3.6/site-packages/speech_recognition/pocketsphinx-data tar zxvf cmusphinx-zh-cn-5.2.tar.gz mv cmusphinx-zh-cn-5.2 zh-cn cd zh-cn mv zh_cn.cd_cont_5000 acoustic-model mv zh_cn.lm.bin language-model.lm.bin mv zh_cn.dic pronounciation-dictionary.dict 配置环境 cd /usr/local/python3.6.8/lib/python3.6/site-packages/speech_recognition/pocketsphinx-data tar zxvf py36asr.tar.gz source ./py36asr/bin/activate |
SpeechRecognition使用示例
语音识别示例: [root@localhost pocketsphinx-data]# pwd /usr/local/python3.6.8/lib/python3.6/site-packages/speech_recognition/pocketsphinx-data [root@localhost pocketsphinx-data]# ls cmusphinx-zh-cn-5.2.tar.gz py36asr test1.py test2.wav zh-cn.tar.gz en-US py36asr.tar.gz test1.wav zh-cn 程序示例: # -*- coding: utf-8 -*- # /usr/bin/python import speech_recognition as sr r = sr.Recognizer() test = sr.AudioFile("test1.wav") with test as source: audio = r.record(source) type(audio) c=r.recognize_sphinx(audio, language='zh-cn') print(c) |
FastASR语音识别
这是一个用C++实现ASR推理的项目,它的依赖很少,安装也很简单,推理速度很快。支持的模型是由Google的Transformer模型中优化而来,数据集是开源。Wennetspeech(1000+小时)或阿里私有数据集(60000+小时),所以识别效果有很好,可以媲美许多商用的ASR软件。
- 流式模型:模拟的输入是语音流,并实时返回语音识别的结果,但是准确率会降低些。
名称 |
来源 |
数据集 |
模型 |
conformer_online |
paddlespeech |
WenetSpeech(1000h) |
conformer_online_wenetspeech-zh-16k |
- 非流式模型:每次识别是以句子为单位,所以实时性会差一些,但是准确率会高一些。
名称 |
来源 |
数据集 |
模型 |
语言 |
paraformer |
阿里达摩院 |
私有数据集(6000h) |
Paraformer-large |
En+zh |
k2_rnnt2 |
kaldi2 |
WenetSpeech(10000h) |
Prouned_transducer_stateless2 |
zh |
Conformer_online |
paddlespeech |
WenetSpeech(10000h) |
Conformer_online_wenetspeech-zh-16k |
zh |
上面提到的这些模型都是基于深度学习框架(paddlepaddle和pytorch)实现的,本身的性能很不错,在个人电脑上运行,也能满足实时性要求(时长为10s的语言,推理视觉小于10s,即可满足实时性)。
FastASR的优/劣
|
FastASR安装配置
- 依赖安装库 libfftw3
sudo apt-get install libfftw3-dev libfftw3-single3 |
- 安装依赖库 libopenblas
sudo apt-get install libopenblas-dev |
- 安装python环境
sudo apt-get install python3 python3-dev |
- 下载最新版的源码
git clone https://github.com/chenkui164/FastASR.git |
- 编译最新版本的源码
cd FastASR/ mkdir build cd build cmake -DCMAKE_BUILD_TYPE=Release .. make |
- 编译python的whl安装包
cd FastASR python -m build |
- 下载预训练模型
paraformer预训练模型下载 cd ../models/paraformer_cli 1.从modelscope官网下载预训练模型 wget --user-agent="Mozilla/5.0" -c "https://www.modelscope.cn/api/v1/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/repo?Revision=v1.0.4&FilePath=model.pb"
mv repo\?Revision\=v1.0.4\&FilePath\=model.pb model.pb
../scripts/paraformer_convert.py model.pb
md5sum -b wenet_params.bin K2_rnnt2预训练模型下载 cd ../models/k2_rnnt2_cli 1.从huggingface官网下载预训练模型 wget -c https://huggingface.co/luomingshuang/icefall_asr_wenetspeech_pruned_transducer_stateless2/resolve/main/exp/pretrained_epoch_10_avg_2.pt 2.将用于Python的模型转换为C++的 ../scripts/k2_rnnt2_convert.py pretrained_epoch_10_avg_2.pt 3.通过md5检查是否等于 33a941f3c1a20a5adfb6f18006c11513 md5sum -b wenet_params.bin PaddleSpeech预训练模型下载 1.从PaddleSpeech官网下载预训练模型 wget -c https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1_conformer_wenetspeech_ckpt_0.1.1.model.tar.gz 2.将压缩包解压wenetspeech目录下 mkdir wenetspeech tar -xzvf asr1_conformer_wenetspeech_ckpt_0.1.1.model.tar.gz -C wenetspeech 3.将用于Python的模型转换为C++的 ../scripts/paddlespeech_convert.py wenetspeech/exp/conformer/checkpoints/wenetspeech.pdparams 4.md5检查是否等于 9cfcf11ee70cb9423528b1f66a87eafd md5sum -b wenet_params.bin 流模式预训练模型下载 cd ../models/paddlespeech_stream
wget -c https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz 2.将压缩包解压wenetspeech目录下 mkdir wenetspeech tar -xzvf asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz -C wenetspeech 3.将用于Python的模型转化为C++的 ../scripts/paddlespeech_convert.py wenetspeech/exp/chunk_conformer/checkpoints/avg_10.pdparams 4.md5检查是否等于 367a285d43442ecfd9c9e5f5e1145b84 md5sum -b wenet_params.bin |
FastASR使用示例
#include <iostream>
#include <win_func.h>
#include <Audio.h>
#include <Model.h>
#include <string.h>
using namespace std;
bool externContext(const char* src, const char* dst)
{
Audio audio(0); // 申请一个音频处理的对象
audio.loadwav(src); // 加载文件
audio.disp(); // 分析格式
// Model* mm = create_model("/home/chen/FastASR/models/k2_rnnt2_cli", 2); // 创建一个预训练模型
Model* mm = create_model("/home/chen/FastASR/models/paraformer_cli", 3);
audio.split(); // 解析文件
float* buff = NULL; // fftw3数据分析
int len = 0;
int flag = false;
char buf[1024];
// 一行一行的取出内容
FILE* fp = NULL;
fp = fopen(dst, "w+");
if(NULL == fp)
{
printf("打开文件失败\n");
}
printf("0.---------------------->\n");
while(audio.fetch(buff, len , flag) > 0)
{
printf("1.---------------------->\n");
mm->reset();
string msg = mm->forward(buff, len, flag);
memset(buf, 0, sizeof(buf));
snprintf(buf, sizeof(buf), "%s", msg.c_str());
fseek(fp, 0, SEEK_END);
fprintf(fp, "%s\n", buf);
fflush(fp);
printf("2.--------------------->\n");
}
printf("3.------------------------>\n");
return true;
}
int main(void)
{
externContext("./long.wav", "./Context.txt");
return 0;
}
flags:= -I ./include
flags+= -L ./lib -lfastasr -lfftw3 -lfftw3f -lblas -lwebrtcvad
src_cpp=$(wildcard ./*.cpp)
debug:
g++ -g $(src_cpp) -omain $(flags) -std=c++11
夜深了,这篇文章中的从之前写的文档里粘贴过来的。有一些地方格式不太好看。见谅...