python text editor: re.sub ------- read the text, and save the removal of specified characters

Now there is such a mission:
we have a text, which reads as follows:

 

ws0012cs3d4 this, there. 3 is A!? Some a text Z this ...

 

There are text in English punctuation, English characters, numbers, letters, Chinese, spaces, etc., now we need to read the text line by line, in front of the label (ws0012cs3d4) remains unchanged, after the text filter to include only Chinese text data, and then put together again numerals and text, the following form:

Here is some text ws0012cs3d4

  

Saved in a new file.

code show as below:

 

# - * - Coding: UTF-. 8 - * - 
 '' ' 
 GET TXT File, 
 Remove All Numbers, symbles, Tab, Prosody in each TXT and Save IT in A new new TXT; 
 Save TXT in A new new File Which AN BE use in mtts 
 there are two functions, respectively, to achieve different functions, are free to use 
 '' ' 
 
 from __future__ Import unicode_literals 
 Import Re 
 Import OS 
 
 ' '' 
 deleted text odd line 
 '' ' 
 DEF remove_lines (txtfile): 
 with Open (txtfile) Reader AS, Open ( 'newfile.txt', 'W') AS Writer: 
     for index, in the enumerate Line (Reader): 
         IF index% 2 == 0: 
             writer.Write (Line) 
 return 'newfile.txt' 

 
 DEF _txt_preprocess (txtfile):
     Open with (txtfile) AS Reader, Open ( 'newfile2.txt', 'W') AS Writer: 
     Parser = argparse.ArgumentParser ( 
         the Description = "Convert for mandarin_txt and WAV to label for Merlin.")
     parser.add_argument(
         = txtlines [x.strip () for X in reader.readlines ()] 
         for Line in txtlines: 
             NUM, txt = line.split ( '',. 1) # line taken by the space division Geqie segmentation and only once 
             txt = re.sub ( '[,,, :;..?! ... "" # 0-9 az AZ a-zA-Z]', '', txt) # [] is the wish to filter out all the symbols, The last two are full-width case English characters 
             Space = '' 
             changeline = '\ the n-' 
             tmp = NUM + Space + TXT + changeline # reassemble text 
             writer.Write (tmp) 
 
 
 
 
 IF __name__ == '__main__': 
     Import argparse 
         "txtfile ", 
         Help = 
         " Full path to the each txtfile Which Contain Line NUM and TXT (Seperated by A White Space) " 
     ) 
     args = Parser.parse_args()
 
     _txt_preprocess(args.txtfile)

  

 

Guess you like

Origin www.cnblogs.com/gstblog/p/11649165.html