In Chinese natural language processing, English, numbers and characters cannot be successfully compared in the dictionary, so they need to be eliminated.
Methods as below:
First introduce the re library:
import re
Then use the sub() function to eliminate letters and numbers first
re.sub('[a-zA-Z0-9]','',data)
#第一个参数是搜索a-z,A-Z,0-9
#第二个参数是''用于替换第一个参数
#第三个参数是读取到的文本
Then delete the characters
re.sub('\W','',data)
#用去替换特殊字符,即非字母、非数字、非汉字、非_
Text length before processing 660331
The processed text length is 334183