[Live broadcast preview] Will large models replace programmers? "

Traveling to the West Lake after ten years of marriage

Ten miles after the spring breeze, all the shepherds and wheat are green. Spring always makes people feel comfortable, and this March is particularly different because I have been married to my wife for ten years. The two of them took a luxurious day off, hiding their children, and revisited the West Lake to find the popsicle shop from 13 years ago (they bought her the most expensive ice cream - 8 yuan - for her who was a colleague at the time), and to find the ice cream shop that sold 13 years ago. The uncle of the red bean keychain (she gave me a mungbean keychain - pure friendship), went to sit on the same stool that I sat on 13 years ago... Just as I was immersed in romantic memories, a person whom I had not contacted for a long time came up. A friend of mine suddenly received news that we were meeting in Dazhuhai, Anji. I used to think that it was very peaceful to have bamboos in front and back of my house in my hometown. It turns out that the bamboos all over the mountains and plains also have a unique flavor. A group of children are playing football on the grass. Look, the children are having so much fun.

Curiosity starts from now on

While chatting, a friend showed me a small program called Qingdou. The function of extracting video copy attracted me. Simply copy a link to a short video such as Douyin or Kuaishou to extract the video copy. Driven by curiosity, we began a journey of exploration. I never thought that it would be easy to start but hard to let go.

After some simple thinking, the general process was determined, which is divided into three steps:

Extract video files -> Audio separation -> Audio to text. Then I started coding happily. Reality soon hit me hard, and it came true to the old Sichuan proverb that has been with me for 30 years: To put it lightly, it’s like a rush of light (the Sichuan dialect sounds interesting when you read it). The first difficulty is: how to download videos based on shared links and support various common platforms. After trying for a while, I gave up. After all, I “didn’t want to do this.” Later, I accidentally discovered that there are many such platforms that provide interfaces for downloading videos based on URLs, so I just used the third-party interface.

With the video link, it is simple to download it locally (however, there may be pitfalls in simple places), directly enter the code and return the InputStream generated by the file.

public InputStream run(MediaDownloadReq req) {
        //根据url获取视频流
        InputStream videoInputStream = null;
        try {
            String newName = "video-"+String.format("%s-%s", System.currentTimeMillis(), UUID.randomUUID().toString())+"."+req.getTargetFileSuffix();

            File folder = new File(tempPath);
            if (!folder.exists()) {
                folder.mkdir();
            }
            File file = HttpUtil.downloadFileFromUrl(req.getUrl(), new File(tempPath +"" + newName+""), new StreamProgress() {
                // 开始下载
                @Override
                public void start() {
                    log.info("Start download file...");
                }
                // 每隔 10% 记录一次日志
                @Override
                public void progress(long total) {
                    //log.info("Download file progress: {} ", total);
                }
                @Override
                public void finish() {
                    log.info("Download file success!");
                }
            });
            videoInputStream = new FileInputStream(file);
            file.delete();
        } catch (Exception e) {
            log.error("获取视频流失败  req ={}", req.getUrl(), e);
            throw new BusinessException(ErrorCodeEnum.DOWNLOAD_VIDEO_ERROR.code(), "获取视频流失败");
        }
        return videoInputStream;
    }

Then use javacv to separate the audio. There is nothing special about this. The separated audio is collected through FFmpegFrameRecorder. Also directly upload the code.

public ExtractAudioRes run(ExtractAudioReq req)  throws Exception {

        long current = System.currentTimeMillis();
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();

        //音频记录器，extractAudio:表示文件路径，2:表示两声道
        FFmpegFrameRecorder recorder = new FFmpegFrameRecorder(outputStream, 2);

        recorder.setAudioOption("crf", "0");
        recorder.setAudioQuality(0);
        //比特率
        recorder.setAudioBitrate(256000);
        //采样率
        //recorder.setSampleRate(16000);
        recorder.setSampleRate(8000);
        recorder.setFormat(req.getAudioFormat());
        //音频编解码
        recorder.setAudioCodec(avcodec.AV_CODEC_ID_PCM_S16LE);
        //开始记录
        recorder.start();
    
        //读取视频信息 
        FFmpegFrameGrabber grabber = new FFmpegFrameGrabber(req.getVideoInputStream());
        grabber.setSampleRate(8000);
        //FFmpegLogCallback.set(); 调试日志
        // 设置采集器构造超时时间(单位微秒，1秒=1000000微秒)
        grabber.setOption("stimeout", String.valueOf(TimeUnit.MINUTES.toMicros(30L)));
        grabber.start();
        recorder.setAudioChannels(grabber.getAudioChannels());
        Frame f;
        Long audioTime = grabber.getLengthInTime() / 1000/ 1000;
        current = System.currentTimeMillis();
        //获取音频样本，并且用recorder记录
        while ((f = grabber.grabSamples()) != null) {
            recorder.record(f);
        }
        grabber.stop();
        recorder.close();

        ExtractAudioRes extractAudioRes = new ExtractAudioRes(outputStream.toByteArray(),  audioTime, outputStream.size() /1024);
        extractAudioRes.setFormat(req.getAudioFormat());

        return extractAudioRes;
    }

storm before dawn

When I wrote this, I thought that victory was like the red sun that was about to emerge under the red clouds in the east. It was already infinitely close. One use case was perfect, and the second use case was perfect. Just when I was preparing for a speech-to-text stage, the last single test failed. . To this end, a protracted round of debugging began.

1. http download and save file - parsing failed - avformat_find_stream_info() error: Could not find stream information;

2. The browser also failed to save the file;

3. Thunder download and parsing also failed;

...

I have begun to suspect that there is a problem with the video encoding returned by the third-party interface; when the Douyin saved file was parsed successfully, my suspicion was further confirmed. But the files saved using the WeChat applet saveVideoToPhotosAlbum can be parsed successfully...I started to doubt myself. So various parameters began to be adjusted randomly. After countless failures, I came up with a bold idea. If you can't parse the one I downloaded, then you can always parse the javaCV you downloaded yourself. Sure enough. The above code has modified one line.


//FFmpegFrameGrabber grabber = new FFmpegFrameGrabber(req.getVideoInputStream());
// 直接传url 
FFmpegFrameGrabber grabber = new FFmpegFrameGrabber(req.getUrl());

The next step is to call Tencent Cloud's ars interface based on the extracted audio files. When I used Openai's interface to implement an internal financial robot before, I wrote an interface for converting voice input to text. I just took it and put it in and it was ok. The one-sentence interface is called as follows. If it takes more than one minute, just call the long voice interface. (Note: The one-sentence interface returns synchronously, and the long voice is an asynchronous callback)

    /**
     * @param audioRecognitionReq
     * @description: 语音转文字
     * @author: jijunjian
     * @date: 11/21/23 09:48
     * @param: [bytes]
     * @return: java.lang.String
     */
    @Override
    public String run(AudioRecognitionReq audioRecognitionReq) {

        log.info("一句话语音语音转文字开始");
        AsrClient client = new AsrClient(cred,  "");
        SentenceRecognitionRequest req = new SentenceRecognitionRequest();
        req.setSourceType(1L);
        req.setVoiceFormat(audioRecognitionReq.getFormat());
        req.setEngSerViceType("16k_zh");
        String base64Encrypted = BaseEncoding.base64().encode(audioRecognitionReq.getBytes());
        req.setData(base64Encrypted);
        req.setDataLen(Integer.valueOf(audioRecognitionReq.getBytes().length).longValue());

        String text = "";
        try {
            SentenceRecognitionResponse resp = client.SentenceRecognition(req);
            log.info("语音转文字结果:{}", JSONUtil.toJsonStr(resp));
            text = resp.getResult();
            if (Strings.isNotBlank(text)){
                return text;
            }
            return "无内容";
        } catch (TencentCloudSDKException e) {
            log.error("语音转文字失败:{}",e);
            throw new BusinessException(AUDIO_RECOGNIZE_ERROR.code(), "语音转文字异常，请重试");
        }
    }

Long speech to text is similar. code show as below

    /**
     * @param audioRecognitionReq
     * @description: 语音转文字
     * @author: jijunjian
     * @date: 11/21/23 09:48
     * @param: [bytes]
     * @return: java.lang.String
     */
    @Override
    public String run(AudioRecognitionReq audioRecognitionReq) {

        log.info("极速语音转文字开始");
        Credential credential = Credential.builder().secretId(AppConstant.Tencent.asrSecretId).secretKey(AppConstant.Tencent.asrSecretKey).build();
        String text = "";
        try {

            FlashRecognizer recognizer = SpeechClient.newFlashRecognizer(AppConstant.Tencent.arsAppId, credential);
            byte[] data = null;
            if (audioRecognitionReq.getBytes() != null){
                data = audioRecognitionReq.getBytes();
            }else {
                //根据文件路径获取识别语音数据 以后再实现
            }

            //传入识别语音数据同步获取结果
            FlashRecognitionRequest recognitionRequest = FlashRecognitionRequest.initialize();
            recognitionRequest.setEngineType("16k_zh");
            recognitionRequest.setFirstChannelOnly(1);
            recognitionRequest.setVoiceFormat(audioRecognitionReq.getFormat());
            recognitionRequest.setSpeakerDiarization(0);
            recognitionRequest.setFilterDirty(0);
            recognitionRequest.setFilterModal(0);
            recognitionRequest.setFilterPunc(0);
            recognitionRequest.setConvertNumMode(1);
            recognitionRequest.setWordInfo(1);
            FlashRecognitionResponse response = recognizer.recognize(recognitionRequest, data);


            if (SuccessCode.equals(response.getCode())){
                text = response.getFlashResult().get(0).getText();
                return text;
            }
            log.info("极速语音转文字失败:{}", JSONUtil.toJsonStr(response));
            throw new BusinessException(AUDIO_RECOGNIZE_ERROR.code(), "极速语音转换失败，请重试");
        } catch (Exception e) {
            log.error("语音转文字失败:{}",e);
            throw new BusinessException(AUDIO_RECOGNIZE_ERROR.code(), "极速语音转文字异常，请重试");
        }
    }

    /**
     * @param req
     * @description: filter 根据参数选
     * @author: jijunjian
     * @date: 3/3/24 18:54
     * @param:
     * @return:
     */
    @Override
    public Boolean filter(AudioRecognitionReq req) {
        if (req.getAudioTime() == null || req.getAudioTime() >= AppConstant.Tencent.Max_Audio_Len || req.getAudioSize() >= AppConstant.Tencent.Max_Audio_Size){
            return true;
        }
        return false;
    }

end without end

At first, I was just out of curiosity about copywriting extraction, but I never thought that I couldn’t stop writing it. The backend was implemented, and it felt a bit regretful if there was no front-end presentation. So I asked my wife to help me build a UI; and I made a simple small program...after a while, it was finally online. Interested students can scan the code to experience it.

Mini program name: Text-to-speech utility;

Mini program QR code: No QR code is allowed, here is a link for a quick experience

Extracting short video copy turns out to be so easy

Traveling to the West Lake after ten years of marriage

Curiosity starts from now on

storm before dawn

end without end

Guess you like