Python regular expression re.sub() function: the problem of flags and the number of parameters

In the past two days, I was writing crawler programs, which involved English text processing, and required standardized English punctuation. Under normal circumstances, periods .behind the need to ensure that there is only one space, but there are exceptions, such as i.e., e.g., P.S.this. Because the case cannot be predicted, 标志位flags are used in regular expressions , but they do not take effect.

In the beginning, my function was written like this:

 def punctuate(s):
     #----其余代码暂略
     s = re.sub(' e. g. ', 'e.g.', s, re.I)
     return s

The original intention of the code is: originally good e.g., e. g.after the code that was broken in the first half of the function was wrongly changed , it needs to be repaired and .the space after the period is deleted. But this line of re.sub() code has two main problems:

  • e. g.The front and back are not necessarily spaces, so if you write in this way, it will be skipped if you encounter e. g.,or (e. g. xxx.
  • English period is .not escaped
  • Flag re.Idoes not take effect

The first 2 problems are easy to solve. The improved code is as follows:

'''
遇到问题没人解答?小编创建了一个Python学习交流QQ群:778463939
寻找有志同道合的小伙伴,互帮互助,群里还有不错的视频学习教程和PDF电子书!
'''
 def punctuate(s):
     #----其余代码暂略
     s = re.sub('([^a-zA-Z]e\.) (g\.[^a-zA-Z])', '\g<1>\g<2>', s, re.I)
     return s

The rule is: e. g.before or after, must have a non-English alphabet characters (including spaces), and e.and g.the middle there is a space in the middle of the space will be deleted, and non-English letters before and after the reservation ( \g<1>represents the first one to find the brackets The text inside \g<2>indicates the second bracket). But the problem of the flag bit re.I is still unresolved.

re.sub(pattern, repl, string, count=0, flags=0)

The fourth parameter re.I that I pass in will be treated as count. Therefore, the correct posture is clearly stated flags=re.I.

The entire punctuation normalization function also includes other substitutions. The complete code is as follows:

def punctuate(s):
    s = re.sub('([,:;?!\.”\)])', '\g<1> ', s) #后加空格
    s = re.sub('([“\(])', ' \g<1>', s) #前加空格
    s = re.sub('([“\(]) ', '\g<1>', s) #后删空格
    s = re.sub(' ([,:;?!\.”\)])', '\g<1>', s) #前删空格
    s = re.sub('([,\.?!;\)]) ”', '\g<1>”', s) #闭引号前去空格
    s = re.sub('\) ([,:;?!\.”])', ')\g<1>', s) #闭括号后去空格
    s = re.sub('(\d)\. (\d)', '\g<1>.\g<2>', s) #小数点后去空格
    s = re.sub(' +', ' ', s) #多空格改单空格
    #拉丁加点缩写单词,点号后面去空格
    s = re.sub('([^a-zA-Z]e\.) (g\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
    s = re.sub('([^a-zA-Z]i\.) (e\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
    s = re.sub('([^a-zA-Z]q\.) (v\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
    s = re.sub('([^a-zA-Z]v\.) (s\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
    s = re.sub('([^a-zA-Z]n\.) (b\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
    s = re.sub('([^a-zA-Z]p\.) (s\.[^a-zA-Z])', '\g<1>\g<2>', s, flags=re.I)
    s = re.sub('\. ,', '.,', s)
    return s

Insert picture description here

Guess you like

Origin blog.csdn.net/qdPython/article/details/112572409