Python removes text in English, numbers and characters

In Chinese natural language processing, English, numbers and characters cannot be successfully compared in the dictionary, so they need to be eliminated.

Methods as below:

First introduce the re library:

import re

 Then use the sub() function to eliminate letters and numbers first

re.sub('[a-zA-Z0-9]','',data)
#第一个参数是搜索a-z,A-Z,0-9
#第二个参数是''用于替换第一个参数
#第三个参数是读取到的文本

Then delete the characters

re.sub('\W','',data)
#用去替换特殊字符,即非字母、非数字、非汉字、非_

 Text length before processing 660331

 

The processed text length is 334183

 

 

Guess you like

Origin blog.csdn.net/Wuyeyu2001/article/details/127324156