AI technology practice|Using Tencent Cloud recording file recognition to automatically generate subtitles for videos without subtitles

Just imagine, when we are watching a video, if the subtitles are missing, will the viewing experience be greatly compromised?

In recent years, online entertainment methods such as short videos and live broadcasts have developed rapidly, which has directly stimulated new trends in industries such as tourism, e-commerce, and film and television creation. To present a good video effect, it not only tests good shooting techniques, but also requires post-processing. Handling is also a top priority. Take video subtitles as an example. Videos with subtitles can always be watched smoothly, while videos without subtitles always feel like they are missing a certain flavor. In fact, adding subtitles purely by hand is time-consuming and labor-intensive. Faced with a long time and batch subtitle processing, it is somewhat miserable. So is there a more intelligent way?

Next, this article will share how to use the recording file recognition service to automatically generate subtitles for videos without subtitles.

1. Analysis and research

Automatically generating subtitles for videos without subtitles actually involves first identifying the audio files exported from the video file to obtain the recognized text, and then processing the time information of the recognized text and short sentences to obtain the video srt subtitle file, and then importing the srt into the video file. subtitle files to get the effect.

The implementation idea is as follows:

1. Extract audio from video with ffmpeg

2. Call the recording file recognition service to identify audio files

3. Process the time information of the recognized text and short sentences to obtain the video srt subtitle file

4. Place the video file with the same name as the srt file in the same directory, and open it with Baofengyingyin or other players to get the video with subtitles.

2. Code development

1. Extract audio from video with ffmpeg

The project uses ffmpeg dependencies, which need to be downloaded and installed first, and environment variables set. After that, you can introduce the subprocess library, execute the ffmpeg command, start a new process, and complete the audio extraction.

import subprocess
def extract_audio(video, tmpAudio):
	ret = subprocess.run('ffmpeg -version', shell=True)
	if ret.returncode != 0:
		print("请先安装 ffmpeg 依赖 ,并设置环境变量")
		return
	ret = subprocess.check_call(['ffmpeg', '-i', video, '-vn', '-ar', "16000", tmpAudio], shell=False)
	if ret.returncode != 0:
		print("error:", ret)

2. Identify audio files

The recording file recognition service I chose here is Tencent Cloud ASR's recording file recognition. Through research, Tencent Cloud's recording file recognition can intelligently segment sentences and add punctuation points directly based on the pauses between sentences when calling, without the need to call other interfaces. When splitting statements and returning result data, you can make multiple choices based on different needs, such as whether to filter dirty words, whether to filter modal particles, etc.

 

The specific details of the service will not be described here. For details, please refer to the official documentation and Tencent Cloud ASR .

(1) To access Tencent Cloud services, you need SecertId and SecretKey. The API key can be created and queried on the API key management page , and can be configured into the config file later.

 

The author's project configuration is in tencent/config.py

class Config(object):
	OUTPUT_PATH = '/XXX/video-srt/audio/' #输出文件目录
	APP_ID = '******' # 对应上述APPID
	SECRET_ID = '******' # 对应上述SecretId
	SECRET_KEY = '******' # 对应上述SecretKey

(2) Use the sdk provided by the official website

Find the API document for recording file recognition under Tencent Cloud Speech Recognition Service , slide to the bottom and find the developer resources. Here I choose to call the Python SDK.

 

You can see that recording file recognition is an asynchronous service. You can send a recording file recognition request through the CreateRecTask interface, and then query the recognition results through the DescribeTaskStatus interface.

In the author's project, the function create_rec and function query_rec_task respectively encapsulate the CreateRecTask interface and the DescribeTaskStatus interface. The details are as follows:

CreateRecTask:

In addition to required parameters such as EngineModelType (engine model type), ChannelNum (number of recognized channels), ResTextFormat (recognition result return form), SourceType (speech data source), etc., you can also pass in FilterDirty as needed when requesting (Whether to filter dirty words), FilterModal (whether to filter modal particles) and other parameters.

After the request is successful, RequestId, TaskId and other information will be returned.

def create_rec(engine_type, file_url):
	client = create_client(Config.SECRET_ID, Config.SECRET_KEY)
	req = models.CreateRecTaskRequest()
	params = {"ChannelNum": 1, "ResTextFormat": 2, "SourceType": 0, "ConvertNumMode": 1}
	req._deserialize(params)
	req.EngineModelType = engine_type
	req.Url = file_url
	try:
		resp = client.CreateRecTask(req)
		logger.info(resp)
		requesid = resp.RequestId
		taskid = resp.Data.TaskId
		return requesid, taskid
	except Exception as err:
		logger.info(traceback.format_exc())
		return None, None

There are two parameters to note here:

One is, ResTextFormat. There are three forms of return of recognition results. Here, when the author subsequently generated the srt file, a layer of separation was also performed based on the punctuation of the single sentence recognition results, so I chose "detailed recognition results at word level granularity (including punctuation, speaking speed values ) " If you do not need an additional layer of division, you can directly use the "recognition result text (including segmented timestamp)" format.

 

The second is, SourceType. There are two sources of voice data, namely voice URL and voice data (post body). The author here chooses the voice URL. The specific implementation is to upload the local audio to the cos storage bucket of Tencent Cloud, and the voice URL is fixed. Address + audio file name can be called. The audio URL can also be obtained through other methods.

import subprocess
def upload_file(tmpAudio):
	objectName = tmpAudio.split('/')[-1]
	ret = subprocess.run(['coscmd', '-s', 'upload', tmpAudio, objectName], shell=False)
	if ret.returncode != 0:
		print("error:", ret)

DescribeTaskStatus:

The TaskId needs to be passed in when requesting.

After the request is successful, the RequestId and identification result will be returned.

def query_rec_task(taskid):
	client = create_client(Config.SECRET_ID, Config.SECRET_KEY)
	req = models.DescribeTaskStatusRequest()
	params = '{"TaskId":' + str(taskid) + '}'
	req.from_json_string(params)
	result = ""
	while True:
	try:
		resp = client.DescribeTaskStatus(req)
		resp_json = resp.to_json_string()
		logger.info(resp_json)
		resp_obj = json.loads(resp_json)
		if resp_obj["Data"]["StatusStr"] == "success":
			result = resp_obj["Data"]["ResultDetail"]
			break
		if resp_obj["Data"]["Status"] == 3:
			return False, ""
		time.sleep(1)
	except TencentCloudSDKException as err:
		logger.info(err)
		return False, ""
	return True, result

Here the author will generate the srt file based on the ResultDetail information, so the return value of the function query_rec_task is the ResultDetail in the data returned by the DescribeTaskStatus interface.

3. Process the recognition results to generate srt subtitle files

In addition to marking the time according to the sentences that have been automatically divided by the calling interface, the srt file generated by the author here will also be combined with the OffsetEndMs, StartMs, EndMs, etc. in the ResultDetail based on the punctuation of the current sentence when the length of the automatically divided sentences is long. The information is divided into sentences again to avoid displaying too many subtitles in one line.

def to_srt(src_txt):
	flag_word = ["。", "?", "!", ","]
	basic_line = 15
	srt_txt = ""
	count = 1
	    for i in range(len(src_txt)):
        current_sentence = src_txt[i]["FinalSentence"]
        last_time = ms_to_hours(src_txt[i]["StartMs"])
        len_rec = len(current_sentence)
        if len_rec > basic_line:
            start_rec = 0
            last_time = ms_to_hours(src_txt[i]["StartMs"]) 
            while(len_rec > basic_line):
                flag = True
                for j in flag_word: 
                    if j in current_sentence[start_rec:start_rec+basic_line]:  
                        loc_rec = current_sentence.index(j, start_rec, start_rec+basic_line) + 1 
                        flag = False
                        break
                if flag:
                    loc_rec = start_rec + basic_line
                current_txt = current_sentence[start_rec:loc_rec] + "\n" 
                start_time = last_time
                end_time = ms_to_hours(src_txt[i]["Words"][loc_rec]["OffsetEndMs"]+src_txt[i]["StartMs"])
                if current_sentence[start_rec:] != "" and current_sentence[start_rec:] != None:
                    srt_txt = srt_txt + str(count) + "\n" + start_time + "-->" + end_time + "\n" + current_txt + "\n"
                    count += 1
                start_rec = loc_rec
                last_time = end_time
                len_rec = len(current_sentence[loc_rec:])
            current_txt = current_sentence[start_rec:] + "\n"
            start_time = last_time
            end_time = ms_to_hours(src_txt[i]["EndMs"])
            if current_sentence[start_rec:] != "" and current_sentence[start_rec:] != None:
                srt_txt = srt_txt + str(count) + "\n" + start_time + "-->" + end_time + "\n" + current_txt + "\n"
                count += 1
        else:
            start_time = last_time
            end_time = ms_to_hours(src_txt[i]["EndMs"])
            srt_txt = srt_txt + str(count) + "\n" + start_time + "-->" + end_time + "\n" + current_sentence + "\n"+"\n"
            count += 1
	return srt_txt

 

The final generated location of the srt file here is related to the OUTPUT_PATH in the Config file.

4. Get videos with subtitles.

(1) The name of the original video file must be the same as the srt file

 

(2)Select the opening method

 

(3)Videos with subtitles

 

At this point, generating subtitles for videos without subtitles has been implemented. The complete project code is in the appendix. Except for modifying some configurations, it is relatively simple to use. Interested students are welcome to use it!

appendix

Project code: GitHub - ForestSkyzzx/video-srt: Use Tencent Cloud AI recording file recognition to automatically generate subtitles for videos without subtitles

Guess you like

Origin blog.csdn.net/tencentAI/article/details/128221910