springboot integrates vosk to realize simple speech recognition function

vosk open source speech recognition

Vosk is an open source speech recognition toolkit. Things that Vosk supports include:

  1. Supports nineteen languages ​​- Chinese, English, Indian English, German, French, Spanish, Portuguese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Persian Filipino, Ukrainian, Kazakh.

  2. Work offline on mobile devices - Raspberry Pi, Android, iOS.

  3. Install using simple pip3 install vosk .

  4. The portable model is only 50Mb per language, but there are larger server models available.

  5. Provides a streaming API for the best user experience (unlike popular speech recognition python packages).

  6. There are also wrappers for different programming languages ​​- java/csharp/javascript etc.

  7. Vocabulary can be quickly reconfigured for optimal accuracy.

  8. Support for speaker recognition.

vosk-api

Offline Speech Recognition API for Android, iOS, Raspberry Pi and with Python, Java, C#, etc.

Link: vosk-api github address

There are examples of use in each language

vosk server

WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries

Link: vosk-server github address

There are examples of use in each language

vosk-api - java - use in springboot

import dependencies

  <!-- 语音识别 -->
        <dependency>
            <groupId>net.java.dev.jna</groupId>
            <artifactId>jna</artifactId>
            <version>5.13.0</version>
        </dependency>
        <dependency>
            <groupId>com.alphacephei</groupId>
            <artifactId>vosk</artifactId>
            <version>0.3.45</version>
        </dependency>

        <!-- JAVE2(Java音频视频编码器)库是ffmpeg项目上的Java包装器。 -->
        <dependency>
            <groupId>ws.schild</groupId>
            <artifactId>jave-core</artifactId>
            <version>3.1.1</version>
        </dependency>

        <!-- 在windows上开发 开发机可实现压缩效果 window64位 -->
        <dependency>
            <groupId>ws.schild</groupId>
            <artifactId>jave-nativebin-win32</artifactId>
            <version>3.1.1</version>
        </dependency>
        <dependency>
            <groupId>ws.schild</groupId>
            <artifactId>jave-nativebin-win64</artifactId>
            <version>3.1.1</version>
        </dependency>

VoskResult

public class VoskResult {
    
    

    private String text;

    public String getText() {
    
    
        return text;
    }

    public void setText(String text) {
    
    
        this.text = text;
    }
}

vosk model loading

package com.fjdci.vosk;

import org.vosk.LibVosk;
import org.vosk.LogLevel;
import org.vosk.Model;

import java.io.IOException;

/**
 * vosk模型加载
 * @author zhou
 */
public class VoskModel {
    
    

    /**
     * 3. 使用 volatile 保证线程安全
     * 禁止指令重排
     * 保证可见性
     * 不保证原子性
     */
    private static volatile VoskModel instance;

    private Model voskModel;

    public Model getVoskModel() {
    
    
        return voskModel;
    }

    /**
     * 1.私有构造函数
     */
    private VoskModel() {
    
    
        System.out.println("SingleLazyPattern实例化了");
        //String modelStr = "D:\\work\\project\\fjdci-vosk\\src\\main\\resources\\vosk-model-small-cn-0.22";
        String modelStr = "D:\\work\\fjdci\\docker\\vosk\\vosk-model-cn-0.22";
        try {
    
    
            voskModel = new Model(modelStr);
            LibVosk.setLogLevel(LogLevel.INFO);
        } catch (IOException e) {
    
    
            e.printStackTrace();
        }
    }

    /**
     * 2.通过静态方法获取一个唯一实例
     * DCL 双重检查锁定 (Double-CheckedLocking)
     * 在多线程情况下保持⾼性能
     */
    public static VoskModel getInstance() {
    
    
        if (instance == null) {
    
    
            synchronized (VoskModel.class) {
    
    
                if (instance == null) {
    
    
                    // 1. 分配内存空间 2、执行构造方法,初始化对象 3、把这个对象指向这个空间
                    instance = new VoskModel();
                }
            }
        }
        return instance;
    }

    /**
     * 多线程测试加载
     * @param args
     */
    public static void main(String[] args) {
    
    
        for (int i = 0; i < 5; i++) {
    
    
            new Thread(() -> {
    
    
                VoskModel.getInstance();
            }).start();
        }
    }


}

Language Recognition Tools

package com.fjdci.vosk;

import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;
import org.vosk.Model;
import org.vosk.Recognizer;
import ws.schild.jave.EncoderException;
import ws.schild.jave.MultimediaObject;
import ws.schild.jave.info.AudioInfo;
import ws.schild.jave.info.MultimediaInfo;

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import java.util.Optional;
import java.util.UUID;

@Slf4j
@Component
public class VoiceUtil {
    
    

    public static void main(String[] args) throws EncoderException {
    
    
        String wavFilePath = "D:\\fjFile\\annex\\xwbl\\tem_2.wav";
        // 秒
        long cutDuration = 20;
        String waveForm = acceptWaveForm(wavFilePath, cutDuration);
        System.out.println(waveForm);
    }

    /**
     * 对Wav格式音频文件进行语音识别翻译
     *
     * @param wavFilePath
     * @param cutDuration
     * @return
     * @throws EncoderException
     */
    private static String acceptWaveForm(String wavFilePath, long cutDuration) throws EncoderException {
    
    
        // 判断视频的长度
        long startTime = System.currentTimeMillis();
        MultimediaObject multimediaObject = new MultimediaObject(new File(wavFilePath));
        MultimediaInfo info = multimediaObject.getInfo();
        // 时长/毫秒
        long duration = info.getDuration();
        AudioInfo audio = info.getAudio();
        // 通道数
        int channels = audio.getChannels();
        // 秒
        long offset = 0;
        long forNum = (duration / 1000) / cutDuration;
        if (duration % (cutDuration * 1000) > 0) {
    
    
            forNum = forNum + 1;
        }
        // 进行切块处理
        List<String> strings = cutWavFile(wavFilePath, cutDuration, offset, forNum);
        // 循环进行翻译
        StringBuilder result = new StringBuilder();
        for (String string : strings) {
    
    
            File f = new File(string);
            result.append(VoiceUtil.getRecognizerResult(f, channels));
        }
        long endTime = System.currentTimeMillis();
        String msg = "耗时:" + (endTime - startTime) + "ms";
        System.out.println(msg);
        return result.toString();
    }

    /**
     * 对wav进行切块处理
     *
     * @param wavFilePath 处理的wav文件路径
     * @param cutDuration 切割的固定长度/秒
     * @param offset      设置起始偏移量(秒)
     * @param forNum      切块的次数
     * @return
     * @throws EncoderException
     */
    private static List<String> cutWavFile(String wavFilePath, long cutDuration, long offset, long forNum) throws EncoderException {
    
    
        UUID uuid = UUID.randomUUID();
        // 大文件切割为固定时长的小文件
        List<String> strings = new ArrayList<>();
        for (int i = 0; i < forNum; i++) {
    
    
            String target = "D:\\fjFile\\annex\\xwbl\\" + uuid + "\\" + i + ".wav";
            Float offsetF = Float.valueOf(String.valueOf(offset));
            Float cutDurationF = Float.valueOf(String.valueOf(cutDuration));
            Jave2Util.cut(wavFilePath, target, offsetF, cutDurationF);
            offset = offset + cutDuration;
            strings.add(target);
        }
        return strings;
    }

    /**
     * 进行翻译
     *
     * @param f
     * @param channels
     */
    public static String getRecognizerResult(File f, int channels) {
    
    
        StringBuilder result = new StringBuilder();
        Model voskModel = VoskModel.getInstance().getVoskModel();
        // 采样率为音频采样率的声道倍数
        log.info("====加载完成,开始分析====");
        try (
                Recognizer recognizer = new Recognizer(voskModel, 16000 * channels);
                InputStream ais = new FileInputStream(f)
        ) {
    
    
            int nbytes;
            byte[] b = new byte[4096];
            while ((nbytes = ais.read(b)) >= 0) {
    
    
                if (recognizer.acceptWaveForm(b, nbytes)) {
    
    
                    // 返回语音识别结果
                    result.append(getResult(recognizer.getResult()));
                }
            }
            // 返回语音识别结果。和结果一样,但不要等待沉默。你通常在流的最后调用它来获得音频的最后部分。它刷新功能管道,以便处理所有剩余的音频块。
            result.append(getResult(recognizer.getFinalResult()));
            log.info("识别结果:{}", result.toString());
        } catch (Exception e) {
    
    
            e.printStackTrace();
        }
        return result.toString();
    }

    /**
     * 获取返回结果
     *
     * @param result
     * @return
     */
    private static String getResult(String result) {
    
    
        VoskResult voskResult = JacksonMapperUtils.json2pojo(result, VoskResult.class);
        return Optional.ofNullable(voskResult).map(VoskResult::getText).orElse("");

    }

}

jave2 audio processing tool class

package com.fjdci.vosk;

import ws.schild.jave.Encoder;
import ws.schild.jave.EncoderException;
import ws.schild.jave.InputFormatException;
import ws.schild.jave.MultimediaObject;
import ws.schild.jave.encode.AudioAttributes;
import ws.schild.jave.encode.EncodingAttributes;
import ws.schild.jave.info.AudioInfo;
import ws.schild.jave.info.MultimediaInfo;

import java.io.File;

public class Jave2Util {
    
    

    /**
     * @param src      来源文件路径
     * @param target   目标文件路径
     * @param offset   设置起始偏移量(秒)
     * @param duration 设置切片的音频长度(秒)
     * @throws EncoderException
     */
    public static void cut(String src, String target, Float offset, Float duration) throws EncoderException {
    
    

        File targetFile = new File(target);
        if (targetFile.exists()) {
    
    
            targetFile.delete();
        }

        File srcFile = new File(src);
        MultimediaObject srcMultiObj = new MultimediaObject(srcFile);
        MultimediaInfo srcMediaInfo = srcMultiObj.getInfo();

        Encoder encoder = new Encoder();

        EncodingAttributes encodingAttributes = new EncodingAttributes();
        //设置起始偏移量(秒)
        encodingAttributes.setOffset(offset);
        //设置切片的音频长度(秒)
        encodingAttributes.setDuration(duration);
        // 输入格式
        encodingAttributes.setInputFormat("wav");

        //设置音频属性
        AudioAttributes audio = new AudioAttributes();
        audio.setBitRate(srcMediaInfo.getAudio().getBitRate());
        //audio.setSamplingRate(srcMediaInfo.getAudio().getSamplingRate());
        // 转换为16KHZ 满足vosk识别的标准
        audio.setSamplingRate(16000);
        audio.setChannels(srcMediaInfo.getAudio().getChannels());
        //如果截取的时候,希望同步调整编码,可以设置不同的编码
//        audio.setCodec("pcm_u8");
        //audio.setCodec(srcMediaInfo.getAudio().getDecoder().split(" ")[0]);
        encodingAttributes.setAudioAttributes(audio);
        //写文件
        encoder.encode(srcMultiObj, new File(target), encodingAttributes);
    }

    /**
     * 转化音频格式
     *
     * @param oldFormatPath : 原音乐路径
     * @param newFormatPath : 目标音乐路径
     * @return
     */
    public static boolean transforMusicFormat(String oldFormatPath, String newFormatPath) {
    
    
        File source = new File(oldFormatPath);
        File target = new File(newFormatPath);
        // 音频转换格式类
        Encoder encoder = new Encoder();
        // 设置音频属性
        AudioAttributes audio = new AudioAttributes();
        audio.setCodec(null);
        // 设置转码属性
        EncodingAttributes attrs = new EncodingAttributes();
        attrs.setInputFormat("wav");
        attrs.setAudioAttributes(audio);
        try {
    
    
            encoder.encode(new MultimediaObject(source), target, attrs);
            System.out.println("传唤已完成...");
            return true;
        } catch (IllegalArgumentException e) {
    
    
            e.printStackTrace();
        } catch (InputFormatException e) {
    
    
            e.printStackTrace();
        } catch (EncoderException e) {
    
    
            e.printStackTrace();
        }
        return false;
    }

    
    
    public static void main(String[] args) throws EncoderException {
    
    

        String src = "D:\\fjFile\\annex\\xwbl\\ly8603f22f24e0409fa9747d50a78ff7e5.wav";
        String target = "D:\\fjFile\\annex\\xwbl\\tem_2.wav";

        Jave2Util.cut(src, target, 0.0F, 60.0F);

        String inputFormatPath = "D:\\fjFile\\annex\\xwbl\\ly8603f22f24e0409fa9747d50a78ff7e5.m4a";
        String outputFormatPath = "D:\\fjFile\\annex\\xwbl\\ly8603f22f24e0409fa9747d50a78ff7e5.wav";

        info(inputFormatPath);

       // audioEncode(inputFormatPath, outputFormatPath);


    }

    /**
     * 获取音频文件的编码信息
     *
     * @param filePath
     * @throws EncoderException
     */
    private static void info(String filePath) throws EncoderException {
    
    
        File file = new File(filePath);
        MultimediaObject multimediaObject = new MultimediaObject(file);
        MultimediaInfo info = multimediaObject.getInfo();
        // 时长
        long duration = info.getDuration();
        String format = info.getFormat();
        // format:mov
        System.out.println("format:" + format);
        AudioInfo audio = info.getAudio();
        // 它设置将在重新编码的音频流中使用的音频通道数(1 =单声道,2 =立体声)。如果未设置任何通道值,则编码器将选择默认值。
        int channels = audio.getChannels();
        // 它为新的重新编码的音频流设置比特率值。如果未设置比特率值,则编码器将选择默认值。
        // 该值应以每秒位数表示。例如,如果您想要128 kb / s的比特率,则应调用setBitRate(new Integer(128000))。
        int bitRate = audio.getBitRate();
        // 它为新的重新编码的音频流设置采样率。如果未设置采样率值,则编码器将选择默认值。该值应以赫兹表示。例如,如果您想要类似CD
        // 采样率、音频采样级别 16000 = 16KHz
        int samplingRate = audio.getSamplingRate();

        // 设置音频音量
        // 可以调用此方法来更改音频流的音量。值为256表示音量不变。因此,小于256的值表示音量减小,而大于256的值将增大音频流的音量。
        // setVolume(Integer volume)

        String decoder = audio.getDecoder();

        System.out.println("声音时长:毫秒" + duration);
        System.out.println("声道:" + channels);
        System.out.println("bitRate:" + bitRate);
        System.out.println("samplingRate 采样率、音频采样级别 16000 = 16KHz:" + samplingRate);
        // aac (LC) (mp4a / 0x6134706D)
        System.out.println("decoder:" + decoder);
    }

    /**
     * 音频格式转换
     * @param inputFormatPath
     * @param outputFormatPath
     * @return
     */
    public static boolean audioEncode(String inputFormatPath, String outputFormatPath) {
    
    
        String outputFormat = getSuffix(outputFormatPath);
        String inputFormat = getSuffix(inputFormatPath);
        File source = new File(inputFormatPath);
        File target = new File(outputFormatPath);
        try {
    
    
            MultimediaObject multimediaObject = new MultimediaObject(source);
            // 获取音频文件的编码信息
            MultimediaInfo info = multimediaObject.getInfo();
            AudioInfo audioInfo = info.getAudio();
            //设置音频属性
            AudioAttributes audio = new AudioAttributes();
            audio.setBitRate(audioInfo.getBitRate());
            audio.setSamplingRate(audioInfo.getSamplingRate());
            audio.setChannels(audioInfo.getChannels());
            // 设置转码属性
            EncodingAttributes attrs = new EncodingAttributes();
            attrs.setInputFormat(inputFormat);
            attrs.setOutputFormat(outputFormat);
            attrs.setAudioAttributes(audio);
            // 音频转换格式类
            Encoder encoder = new Encoder();
            // 进行转换
            encoder.encode(new MultimediaObject(source), target, attrs);
            return true;
        } catch (IllegalArgumentException | EncoderException e) {
    
    
            e.printStackTrace();
        }
        return false;
    }

    /**
     * 获取文件路径的.后缀
     * @param outputFormatPath
     * @return
     */
    private static String getSuffix(String outputFormatPath) {
    
    
        return outputFormatPath.substring(outputFormatPath.lastIndexOf(".") + 1);
    }


}

reference link

Link: vosk open source speech recognition

Link: Summary of Whisper-based audio transcription services

Link: Recommended free speech-to-text tools

Link: java offline Chinese speech and text recognition

Link: Asr - python uses vosk for Chinese speech recognition

Link: NeMo is very powerful, covering ASR, NLP, TTS, providing pre-trained models and complete training modules. Its commercial version is RIVA.

Link: ASRT speech recognition document
ASRT is a speech recognition tool based on deep learning, which can be used to develop the most advanced speech recognition system. ) The open source speech recognition project started in 2016, the baseline is 85% recognition accuracy, and under certain conditions, it can achieve about 95% recognition accuracy. ASRT includes a speech recognition algorithm server (for training or deploying API services) and client SDKs for multiple platforms and programming languages. It supports one-sentence recognition and real-time streaming recognition. The relevant codes have been open sourced on GitHub and Gitee.
The API of the ASRT speech recognition system has provided speech recognition services for the search engine in the AI ​​Lemon station, which is used to realize the speech search function of the station.

Build an offline speech recognition system and provide webApi access

Some directions and ideas:

  1. Determine the speech recognition engine

First, a suitable speech recognition engine needs to be selected. Some common engines include CMU Sphinx, Kaldi, Baidu Voice, Xunfei Open Platform and so on. After the engine is selected, it needs to be configured and trained so that it can adapt to its own application scenario.

  1. Build an offline speech recognition system

Next, it is necessary to build an offline speech recognition system. It can be installed and configured by using a Linux system such as Ubuntu. The speech recognition engine and related dependencies selected in the previous step need to be installed in the system.

  1. Provides Web API access

In order to enable the offline speech recognition system to be accessed and used conveniently, a corresponding Web API needs to be provided. You can use frameworks such as Flask to build web services, and call the speech recognition engine in its context to perform speech recognition work.

Finally, in order to ensure the accuracy and fluency of speech recognition, a series of optimization and debugging work is required, such as voice noise reduction, speech rate control, model tuning, etc. Hope the above direction can help you.

2 whisper

Whisper is an automatic speech recognition model trained on 680,000 hours of multilingual data collected from the web.

Whisper is a speech recognition engine that can be used to develop voice-controlled applications, but it is usually used on mobile devices and embedded devices to provide offline speech recognition. If you want to use Java to build offline speech recognition, you can try to use other speech recognition engines, such as CMU Sphinx and Kaldi. These engines all support offline speech recognition and provide a Java API for developers to use.

3 left

Open Source Chinese Speech Recognition Project Introduction: ASRFrame

https://blog.csdn.net/sailist/article/details/95751825

Tencent AI Lab Open Source Lightweight Speech Processing Toolkit PIKA

Focus on E2E speech recognition, Tencent AI Lab open source lightweight speech processing toolkit PIKA-CSDN community

Are there any open source python Chinese speech-to-text projects?

https://blog.csdn.net/devid008/article/details/129656356

Offline speech recognition third-party service provider

1 iFLYTEK

https://www.xfyun.cn/service/offline_iat

HKUST Xunfei offline package is only based on Android, and does not support java offline version

It is also like calling local dll to realize offline voice

2 Baidu Speech Recognition

https://ai.baidu.com/tech/speech/realtime_asr

does not support offline

3 Alibaba Cloud Speech Recognition

https://ai.aliyun.com/nls/trans

Guess you like

Origin blog.csdn.net/qq_41604890/article/details/130546401