Now there is such a mission:
we have a text, which reads as follows:
ws0012cs3d4 this, there. 3 is A!? Some a text Z this ...
There are text in English punctuation, English characters, numbers, letters, Chinese, spaces, etc., now we need to read the text line by line, in front of the label (ws0012cs3d4) remains unchanged, after the text filter to include only Chinese text data, and then put together again numerals and text, the following form:
Here is some text ws0012cs3d4
Saved in a new file.
code show as below:
# - * - Coding: UTF-. 8 - * - '' ' GET TXT File, Remove All Numbers, symbles, Tab, Prosody in each TXT and Save IT in A new new TXT; Save TXT in A new new File Which AN BE use in mtts there are two functions, respectively, to achieve different functions, are free to use '' ' from __future__ Import unicode_literals Import Re Import OS ' '' deleted text odd line '' ' DEF remove_lines (txtfile): with Open (txtfile) Reader AS, Open ( 'newfile.txt', 'W') AS Writer: for index, in the enumerate Line (Reader): IF index% 2 == 0: writer.Write (Line) return 'newfile.txt' DEF _txt_preprocess (txtfile): Open with (txtfile) AS Reader, Open ( 'newfile2.txt', 'W') AS Writer: Parser = argparse.ArgumentParser ( the Description = "Convert for mandarin_txt and WAV to label for Merlin.") parser.add_argument( = txtlines [x.strip () for X in reader.readlines ()] for Line in txtlines: NUM, txt = line.split ( '',. 1) # line taken by the space division Geqie segmentation and only once txt = re.sub ( '[,,, :;..?! ... "" # 0-9 az AZ a-zA-Z]', '', txt) # [] is the wish to filter out all the symbols, The last two are full-width case English characters Space = '' changeline = '\ the n-' tmp = NUM + Space + TXT + changeline # reassemble text writer.Write (tmp) IF __name__ == '__main__': Import argparse "txtfile ", Help = " Full path to the each txtfile Which Contain Line NUM and TXT (Seperated by A White Space) " ) args = Parser.parse_args() _txt_preprocess(args.txtfile)