Detailed explanation of common functions of python regular expression re

        This chapter introduces re regular expressions for python text extraction. The main functions and knowledge points are: re.findall(), re.sub(), re.match(), re.search(), various regular symbols, grouping, Greedy vs. non-greedy etc.

Table of contents

1.  re.findall(pattern, string, flags=0)

(1) Extract any character []

(2) Extract numbers/non-numbers

(3) Extract Chinese, underscore, numbers, English

(4) Extract blank characters

(5) Limit the number of characters

(6) Continuous matching

(7) Boundary matching

(8) ignore case

(9) Match any character

(10) Grouping

(11) Greedy and non-greedy

2.  re.sub(pattern, repl, string, count=0, flags=0)

3.  re.match(pattern, string, flags=0)

4.  re.search(pattern, string, flags=0)


        In order to test the regular expression, first generate a piece of fake information for testing. If you don’t understand it, you can read my faker article.

import faker

f = faker.Faker('zh_CN')


def data_line():
    name = f.name()
    phone = f.phone_number()
    job = f.job()
    address_yb = f.address()
    address = address_yb.split(' ')[0]
    bm = address_yb.split(' ')[1]
    data = f'姓名:{name} 电话:{phone} 工作:{job} 地址:{address} 邮政编码:{bm} '
    return data


data = ''
for i in range(5):
    data = data + data_line()

print(data)
Figure 1 False information generated with fake

        Regarding some symbols that need to be remembered in regular expressions, in fact, there is no need to remember them specially, but they are easy to forget. If you use it more, you will naturally remember it. For the sake of convenience, I will write the opposite or similar meanings together, separated by Chinese ",".

symbol 1, symbol 2 Symbol 1 explained. Symbol 2 Explanation
[],[^]
Character type, extract any character that appears in []. The latter, in contrast, neither extract
[0-9a-zA-Z],[^0-9a-zA-Z]
All 0-9a-zA-Z elements. The latter is the opposite
\d,\D
\d All digits (equivalent to [0-9]). 
\D all non-digits (equivalent to [^0-9])
\w,\W
\w Chinese underscore number English (equivalent to [0-9a-zA-Z_]). \W is the opposite of \w, special characters, such as $, &, space, \n, \t, etc.
\s,\S
\s All blank characters (such as \t\n\r\f\v, equivalent to [ \t\n\r\f\v]). \S instead, all non-whitespace characters
{min,max}、{num}
{min,max} only extracts consecutive min-max elements that meet the conditions. {num} only extracts num consecutive elements that meet the conditions
[]+,\d+
Both are consecutive matches, the former is a continuous match of any character in [], the latter is a continuous match of all numbers
'excel*','excel+','excel?'
*The preceding character l matches 0 or unlimited times. + A preceding character l matches 1 or unlimited times. ?The previous character l matches 0 or 1 times
^,$
Specify the number that matches the condition from the beginning. The specified end is a qualified number (the difference from the first one is that it is only written at the front or end of '', while the first one is written in [])
re.I
Ignore case IGNORECASE
. 
.{num}
.*
.+
.Take an element (every element is taken separately). 
.{num} takes the first num digits (one). 
Take the whole of the latter two
(?) () group, the ? in the group means non-greedy mode

        A thousand words can't explain clearly, let's use the re.findall() function to practice.

1.  re.findall(pattern, string, flags=0)

      Find all elements matching the pattern condition.

        pattern: conditional expression

        string: the string to be extracted

        flags: Control the matching method of regular expressions, such as case-sensitive, multi-line matching, etc.

(1) Extract any character []

import re
a = re.findall('[0-9]',data)
b = re.findall('[0-9]+',data)
print('输出a:\n',a)
print('输出b:\n',b)
Figure 2 [0-9] and [0-9]+

        Here first use the re.findall() function to learn the content of the above table, first learn the conditional expressions, and then learn other functions. (Remind friends who slide the mouse fast, data has been defined above)

        [] means to match any character in [], where [0-9] means to match any character in 0-9 (including 0 and 9), say any one, that is, take each matched character as an element of the list, Such as output a. When + is added outside [], it means continuous matching. When encountering consecutive 0-9 numbers, it will be output as a whole, such as output b.

        Not only 0-9, if you want to extract others, you can also write directly in [], such as extracting 0-9, AZ, az and provincial words:

a = re.findall('[0-9A-Za-z省]',data)
print(a)
Figure 3 Extract 0-9, AZ, az and provincial words

        In this way, all the words containing 0-9, AZ, az and provinces are extracted one by one. If you want continuous ones, in the same way, add + after [].

        If you want to extract elements other than 0-9, AZ, az and provincial words one by one, you can add ^ at the beginning of [], such as:

a = re.findall('[^0-9A-Za-z省]',data)
print(a)
Figure 4 Arbitrary extraction

        The same is true if you want to match continuously, add + at the end, as shown in b in Figure 2.

(2) Extract numbers/non-numbers

        \d extracts digits and \D extracts non-digits.

import re
a = re.findall('\d',data)
b = re.findall('\d+',data)
c = re.findall('\D+',data)
print('输出a:\n',a)
print('输出b:\n',b)
print('输出c:\n',c)
Figure 5 \d and \D

         \d is to extract numbers, such as a. \D is a non-digit, such as b. + means continuous matching, I have an impression, the detailed continuous matching will be discussed later.

(3) Extract Chinese, underscore, numbers, English

        \w Extract Chinese, underscore, numbers, English. Uppercase \W is the opposite.

por = '十多年了,I still love you li_wei.$#% *&'
a = re.findall('\w+',por)
b = re.findall('\W+',por)
print(a)
print(b)
Figure 6 \w and \W

(4) Extract blank characters

        \s extracts all whitespace characters, including \t, \n, \r, \f, \v, and spaces. Capital letter \S is the opposite.

por = '十多年了,I still love you li_wei.$#% *& \n \r \t'
a = re.findall('\s+',por)
b = re.findall('\S+',por)
print(a)
print(b)
Figure 7 Blank characters

        The so-called blank characters are characters that appear as blanks when outputting, such as \t, \n, \r, \f, \v and spaces. The following prints por to see if it is "blank":

por = '十多年了,I still love you li_wei.$#% *& \n \r \t'
print(por)
Figure 8 Output statements containing blank characters

        As shown in the figure, \t, \n, \r, \f, \v and spaces are all displayed as blanks, and \s is specially used to extract these blank characters.

(5) Limit the number of characters

        {min,max} is the range of extracting qualified numbers from the former expression. {num} specific number.

por = '湖&南省&潮州县&静安韦街&座'
a = re.findall('\w{2,3}',por)
print(a)
Figure 9 Limited quantity

         \w is to extract Chinese underlined numbers in English, and 2-3 digits are specified later. In por, the word "湖" is followed by "&". Since "&" does not conform to Chinese underlined numbers and English, "湖" meets the conditions, but there is only one character, which does not conform to {2,3}, so it is not extracted. The following "Chaozhou County" conforms to Chinese underlined digits in English, and also conforms to 2-3 digits, so it is extracted. The following "Jing'an Weijie" meets the requirements, but it exceeds the conditions for 2-3 places, so only the most 3 places are taken, and "Jing'an Wei" is obtained.

(6) Continuous matching

        + means that the previous condition matches 1 or countless times. * Indicates that the preceding condition matches 0 or countless times. ? Indicates that the preceding condition matches 0 or 1 times. (See the example to understand)

por = 'wor word wordd worddd'
a = re.findall('word+',por)
b = re.findall('word*',por)
c = re.findall('word?',por)
print(a)
print(b)
print(c)
Figure 10 Continuous matching

        + means that the previous condition matches 1 or countless times. For example, a, the character d before the + sign matches once or infinitely, that is to say, it can be word , wordd , worddd , wordddd...etc. All those that match these can be extracted.

        * Indicates that the preceding condition matches 0 or countless times. For example, b, the character d in front of the * matches 0 times or infinite times, and 0 times means that even if there is no d but wor, it can be extracted, so it can be wor, word, wordd, worddd, wordddd...etc.

        ? Indicates that the preceding condition matches 0 or 1 times. Such as c, the character d in front of the sign matches 0 or 1 time, that is to say, it can be wor, word. Only these two can be extracted if they appear in por, but the redundant d will not be extracted.

(7) Boundary matching

        ^ indicates the specified beginning, $ indicates the specified end (it’s hard to say clearly, just look at the case directly, just write the number casually)

edge_1 = '16915158890abcdef'
edge_2 = 'abcdef16915158890'

a = re.findall('^\d{11}',edge_1)
b = re.findall('^\d{11}',edge_2)
print(a)
print(b)
Figure 11 Boundary match_beginning

         As shown in Figure 11, ^\d{11} means starting from the beginning and matching 11 consecutive numbers. In a, the beginning is specified with ^. Since the beginning of edge_1 matches \d , which is a number, and matches 11 digits, it can be extracted. In b, since the beginning of edge_2 is a, it does not match \d, so the match is unsuccessful, and an empty list is returned.

        The same is true for $, specifying the end.

edge_1 = '16915158890abcdef'
edge_2 = 'abcdef16915158890'

a = re.findall('\d{11}$',edge_1)
b = re.findall('\d{11}$',edge_2)
print(a)
print(b)
Figure 11 Boundary match_end

         As shown in the figure, for a, the ending is specified with $. Since the ending does not conform to \d{11}, that is, it does not conform to ending with 11 digits, a cannot be extracted. For b, the match ends with 11 digits, so it can be extracted.

        Remember to use ^ to specify the beginning and $ to specify the end. Instead of using $ to specify the beginning. Pay attention to distinguish the ^ in point (1), ^ is written at the top of [], which means the inversion of []; and ^ is written at the top of the whole, which means boundary matching.

(8) ignore case

        flags = re.I ignore case, this parameter is written in the function body.

fi = 'excel Excel excEL word EXCEL WORD'
a = re.findall('excel',fi,flags=re.I)
print(a)
Figure 12 ignore case

(9) Match any character

        In regular expressions, . represents any character except the carriage return character \n, which acts as a wildcard.

fi = '今天我吃了鱼呀,\n今天我吃了鸡肉呀,\n今天我吃了鸡扒呀,\n今天我什么都没有吃'
a = re.findall('今天我吃了.*呀',fi)
print(a)
Figure 13 Wildcard

         . here represents any character except the carriage return \n, plus the continuous matching symbol *, .* together represent 0 or countless characters. In this case, there is "I ate today" in front and "呀" in the back, indicating that there can be 0 or countless characters between these two, which means that it meets "I ate today...  Yeah" sentences can be extracted.

        There may be doubts here, how to extract only the part in the middle...? This requires grouping sentences, see 10.

(10) Grouping

        In regular expressions, () means grouping, and only the content in () is output.

fi = '今天我吃了鱼呀,\n今天我吃了鸡肉呀,\n今天我吃了鸡扒呀,\n今天我什么都没有吃'
a = re.findall('今天我吃了(.*)呀',fi)
print(a)
Figure 14 Grouping

         Just add a () in .*. ( It should be noted that the original sentence in this case contains the carriage return character \n. If there is no carriage return character, another problem will occur in grouping )

        If there is no carriage return in the original sentence, let's see what problems will arise in grouping:

Figure 15 Grouping problem

         Here, only one is extracted! The contents of the box are all listed together, not what we want (we want to get only food like Figure 14). This is because .* cannot match \n, so there are \n sentences as shown in Figure 14. When matching, it will automatically stop searching at \n, get one of the results, and then find the second one after \n, so that We met expectations. But for sentences without \n like in Figure 15, it matches everything except \n, so it matches all the way to the end, causing them to stick together. This kind of problem occurs widely in reptiles, so it is necessary to master the non-greedy mode.

(11) Greedy and non-greedy

        Like Figure 15, it is greedy. We only need to add a ? sign after the condition to start the non-greedy mode.

fi = '今天我吃了鱼呀,今天我吃了鸡肉呀,今天我吃了鸡扒呀,今天我什么都没有吃'
a = re.findall('今天我吃了(.*?)呀',fi)
print(a)
Figure 16 Non-greedy mode

        After starting the non-greedy mode, it will pause when it matches the first "呀" and get one of the results. Then continue to search for the next one until the retrieval is complete. This solves the problem of grouping in Figure 15.

        Note that when? is used in the () group, it means to start the non-greedy mode; when? is used in the non-group, it means continuous matching (0 or 1).

        The basic conditional expression pattern has been learned, don't be happy, I haven't finished learning it yet, the pattern in re.findall(pattern, string, flags=0), and re.sub(), re.match(), re. search() and other functions, but we have learned the parameters in these functions above, so learning re.sub(), re.match(), re.search() is very fast, basically just look at it, let's continue.

2.  re.sub(pattern, repl, string, count=0, flags=0)

      Regular replacement, support calling custom functions.

        pattern: the string to be replaced

        repl: what str to replace, support calling custom functions

        string: string, sentence

        count: the number of replacements, the default is 0 for countless times

        flags: Control the matching method of regular expressions, such as case-sensitive, multi-line matching, etc.

import re
st = 'word word word excel ppt office Word WORD '
a = re.sub('word','1111',st)
b = re.sub('word','1111',st,count=1)
c = re.sub('word','1111',st,flags=re.I)
print('替换全部word:')
print(a)
print('\n替换一次word:')
print(b)
print('\n替换全部word,忽略大小写:')
print(c)
Figure 17 Regular replacement re.sub()

         Replace the word with 1111, a, b, and c in the figure respectively list the cases of replacing all, replacing only once, and replacing all while ignoring case.

        The reason why regular expression replacement is superior to replace is that regular expressions can call custom functions. This replacement is an advanced replacement, but replace cannot. Advanced replacement example (.group() is the extraction of elements in the group, which will be mentioned later):

ss = '孙杨 98 李红 65 郭靖 85 扬程 43 彭凯文 70'

def rep_judge(x):
    score_str = x.group()
    score = eval(score_str)
    if score >= 90:
        return '优秀'
    elif score >= 80:
        return '良好'
    elif score >= 60:
        return '及格'
    else:
        return '不及格'

a = re.sub('\d+',rep_judge,ss)
print(a)
Figure 18 Regular advanced replacement

         The eval() function converts the number (string type) contained in the string into a number type. Advanced replacement also has very advanced functions, so I won't list them one by one here.

3.  re.match(pattern, string, flags=0)

      Match an element that matches the pattern from the beginning, return an object if the match is successful, and return None if the match is unsuccessful (do not match at the beginning, directly None)

        pattern: conditional expression

        string: the string to be extracted

        flags: Control the matching method of regular expressions, such as case-sensitive, multi-line matching, etc.

ss_1 = 'A83C72D1D8E67'
ss_2 = '83C72D1D8E67'
a = re.match('\d',ss_1)
b = re.match('\d',ss_2)
print(a)
print(b)
Figure 19 re.match

         a matches \d at the beginning of ss_1, that is, a number. Since the beginning is not a number, the direct match fails and returns None; b matches \d at the beginning of ss_2, and returns an object if the match succeeds. If the beginning does not meet the conditions, it returns None directly, and the latter is irrelevant, so re.match() is used to check the beginning! ! !

print(a.group())
Figure 20

         If b matches successfully, you can use .group() to extract elements from the object, as shown in Figure 20. In fact, numbers can also be entered in the brackets of .group(), indicating which group to extract the content of. Since re.match() only matches one, it can be extracted directly by .group(), so the parameters of group will not be studied here, and will be discussed further in re.search().

4.  re.search(pattern, string, flags=0)

      Match the elements that match the pattern from the beginning. If the match is successful, an object containing all matching elements will be returned. If there is no match, it will return None (it doesn't matter if it doesn't match at the beginning, then match, find all the matches, and pack them in the group; none can be found. just return None)

        The parameters are the same as re.match, and will not be repeated.

ss = '123abc¥456'
a = re.search('([0-9]*)([a-z]*)(\W)([0-9]*)',ss)
print(a.groups())                     # 全部 分组提取
print(a.group())                      # 全部 一起提取(括号内为空或者0)
print(a.group(1))
print(a.group(2))
print(a.group(3))
print(a.group(4))
Figure 21 re.search

        Divide numbers, az, \W, and numbers into groups, and then extract them one by one. The difference from Figure 14 is that what is returned here is an object, while Figure 14 returns a list, and the returned object needs to use .group() to extract the contents of the group.

        For the numbers in group(), please refer to Figure 21. .groups() is to extract all and return a tuple; .group(0) or .group() is to extract all, but they will be merged together; .group(1) to extract the first group, and so on.

        It should be noted that if the input of .group(5) exceeds the range of the number of groups, an error "no such group" will be reported.

Guess you like

Origin blog.csdn.net/m0_71559726/article/details/130178250