- Problem Description
In doing natural language processing, often encountered in the string read from a text file and delete unwanted content. This case presents a processing method based on Python, remove the read text data having a regularity, but unnecessary data. The basic process is: reading a text file (each line of data stored in a text), delete the beginning of the substring in each string, and then remove the intermediate piece is a string, the last file written to a text string after cleaning.
- Algorithm
#读取文本文件
def ReadTxtFile( fileName ):
with open( fileName, mode = 'r', encoding = 'utf-8' ) as fp:
lineNum = 0
dataTxt = []
for line in fp:
if lineNum < 5: #只读取前5行
lineNum += 1
dataTxt.append( line )
else:
break
return dataTxt, lineNum
#删除不需要的子串
def DeletePartTxtData( dataTxt, lineNum ):
data = []
ind = 0
txt =[]
for i in range( lineNum ):
txt = str( dataTxt[i] )
ind = txt.find( 'English' )
txt1 = txt[ ind-1 : ]
ind1 = txt1.find( 'http' )
ind2 = txt1.find( 'end' )
txt2 = txt1[ : ind1 ] + txt1[ ind2 + 6 :]
data.append( txt2 )
return data
#读取文本文件,删除不需要的子串后存储到新的文本文件
def ReadWriteTxtFile( fReadName, fWriteName ):
data, lineNum = ReadTxtFile( fReadName )
dataTxt = DeletePartTxtData( data, lineNum )
with open( fWriteName, mode = 'w', encoding = 'utf-8' ) as fp:
strdata = []
for i in range( lineNum ):
strdata = '[' + str(i+1) + ']' + '.' + dataTxt[i]
fp.write( strdata )
def main():
ReadWriteTxtFile( 'InitialTxtData.txt', 'TxtData.txt' )
print( 'over' )
if __name__ == '__main__':
main()
- annex
1.InitiTxtData.txt
2.TxtData.txt
Author: YangYF