In the past two days, I was writing crawler programs, which involved English text processing, and required standardized English punctuation. Under normal circumstances, periods .
behind the need to ensure that there is only one space, but there are exceptions, such as i.e.
, e.g.
, P.S.
this. Because the case cannot be predicted, 标志位
flags are used in regular expressions , but they do not take effect.
In the beginning, my function was written like this:
def punctuate(s):
#----其余代码暂略
s = re.sub(' e. g. ', 'e.g.', s, re.I)
return s
The original intention of the code is: originally good e.g.
, e. g.
after the code that was broken in the first half of the function was wrongly changed , it needs to be repaired and .
the space after the period is deleted. But this line of re.sub() code has two main problems:
e. g.
The front and back are not necessarily spaces, so if you write in this way, it will be skipped if you encountere. g.,
or(e. g. xxx
.- English period is
.
not escaped - Flag
re.I
does not take effect
The first 2 problems are easy to solve. The improved code is as follows:
'''
遇到问题没人解答?小编创建了一个Python学习交流QQ群:778463939
寻找有志同道合的小伙伴,互帮互助,群里还有不错的视频学习教程和PDF电子书!
'''
def punctuate(s):
#----其余代码暂略
s = re.sub('([^a-zA-Z]e\.) (g\.[^a-zA-Z])', '\g<1>\g<2>', s, re.I)
return s
The rule is: e. g.
before or after, must have a non-English alphabet characters (including spaces), and e.
and g.
the middle there is a space in the middle of the space will be deleted, and non-English letters before and after the reservation ( \g<1>
represents the first one to find the brackets The text inside \g<2>
indicates the second bracket). But the problem of the flag bit re.I is still unresolved.
re.sub(pattern, repl, string, count=0, flags=0)
The fourth parameter re.I that I pass in will be treated as count. Therefore, the correct posture is clearly stated flags=re.I
.
The entire punctuation normalization function also includes other substitutions. The complete code is as follows:
def punctuate(s):
s = re.sub('([,:;?!\.”\)])', '\g<1> ', s) #后加空格
s = re.sub('([“\(])', ' \g<1>', s) #前加空格
s = re.sub('([“\(]) ', '\g<1>', s) #后删空格
s = re.sub(' ([,:;?!\.”\)])', '\g<1>', s) #前删空格
s = re.sub('([,\.?!;\)]) ”', '\g<1>”', s) #闭引号前去空格
s = re.sub('\) ([,:;?!\.”])', ')\g<1>', s) #闭括号后去空格
s = re.sub('(\d)\. (\d)', '\g<1>.\g<2>', s) #小数点后去空格
s = re.sub(' +', ' ', s) #多空格改单空格
#拉丁加点缩写单词,点号后面去空格
s = re.sub('([^a-zA-Z]e\.) (g\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
s = re.sub('([^a-zA-Z]i\.) (e\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
s = re.sub('([^a-zA-Z]q\.) (v\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
s = re.sub('([^a-zA-Z]v\.) (s\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
s = re.sub('([^a-zA-Z]n\.) (b\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
s = re.sub('([^a-zA-Z]p\.) (s\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
s = re.sub('\. ,', '.,', s)
return s