Data Structure Mastery of Regular Expressions in Five Minutes (Python)

Data Structure Mastery of Regular Expressions in Five Minutes (Python)


PAY ATTENTION!

⭐1. re.search() re.match() re.findall() re.finditer() Instead of directly returning the found content or the found location, it returns a matching object.

⭐2. The span() function returns the matched start position and end position. (including the start, excluding the end ----[3,5) )

⭐3. The group() function directly returns the matched content

PAY ATTENTION!

  • 0. Regular expression rule table

  • 1. re.search()------match search

  • 2. re.match() ----- match from the first

  • 3. re.findall()------find all

  • 4. re.finditer()----return iterator

  • 5. re.split()---------regularized segmentation

  • 6. Expression matching rules – quick search

  • 7. Expression matching case – quick understanding

0. Regular expression rule table

0) Regex modifiers - optional flags
Modifier describe
re.I Make matching case insensitive
re.L Do locale-aware matching
re.M multiline match, affects ^ and $
re.S make . match all characters including newlines
re.U Parse characters according to the Unicode character set. This flag affects \w, \W, \b, \B.
re.X This flag allows you to write regular expressions that are easier to understand by giving you more flexible formatting.
1) Special characters
example describe
. Matches any single character except "\n". To match any character including '\n', use a pattern like '[.\n]'.
\d Matches a numeric character. Equivalent to [0-9].
\D Matches a non-numeric character. Equivalent to [^^0-9].
\s Matches any whitespace character, including spaces, tabs, form feeds, and so on. Equivalent to [ \f\n\r\t\v].
\S Matches any non-whitespace character. Equivalent to [^^ \f\n\r\t\v].
\w Matches any word character including an underscore. Equivalent to '[A-Za-z0-9_]'.
\W Matches any non-word character. Equivalent to '[^^A-Za-z0-9_]'.
3) Must remember
^ matches the beginning of the string
$ Matches the end of a string.
. Matches any character, except newline, and when the re.DOTALL flag is specified, matches any character including newline.
[…] Used to represent a group of characters, listed separately: [amk] matches 'a', 'm' or 'k'
[^…] Characters not in []: [^abc] matches characters other than a, b, c.
re* Matches 0 or more expressions.
re+ Matches 1 or more expressions.
re? Match 0 or 1 fragment defined by the preceding regular expression, non-greedy
re{ n} Matches exactly n preceding expressions. For example, o{2} would not match the "o" in "Bob", but would match both o's in "food".
re{ n,} Matches n occurrences of the preceding expression. For example, o{2,} would not match the "o" in "Bob", but would match all o's in "foooood". "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*".
re{ n, m} Match n to m times the segment defined by the preceding regular expression, greedily
a| b match a or b
(re) Group regular expressions and remember matched text
(?imx) Regular expressions contain three optional flags: i, m, or x. Only the area enclosed in parentheses is affected.
(?-imx) The regular expression turns off the i, m, or x optional flags. Only the area enclosed in parentheses is affected.
(?: re) like (…), but does not denote a group
(?imx: re) Use i, m, or x optional flags in parentheses
(?-imx: re) Do not use i, m, or x optional flags in parentheses
(?#…) note.
(?= re) Forward positive delimiter. If the contained regular expression, denoted by ... , succeeds if it matches successfully at the current position, fails otherwise. But once the contained expression has been tried, the matching engine does not advance at all; the remainder of the pattern also tries the right side of the delimiter.
(?! re) Lookahead negation delimiter. The opposite of a positive delimiter; succeeds when the contained expression cannot be matched at the current position in the string
(?> re) Independent pattern for matching, omitting backtracking.
\w Matches alphanumerics and underscores
\W Match non-alphanumeric and underscore
\s 匹配任意空白字符,等价于 [ \t\n\r\f]
\S 匹配任意非空字符
\d 匹配任意数字,等价于 [0-9].
\D 匹配任意非数字
\A 匹配字符串开始
\Z 匹配字符串结束,如果是存在换行,只匹配到换行前的结束字符串。
\z 匹配字符串结束
\G 匹配最后匹配完成的位置。
\b 匹配一个单词边界,也就是指单词和空格间的位置。例如, ‘er\b’ 可以匹配"never" 中的 ‘er’,但不能匹配 “verb” 中的 ‘er’。
\B 匹配非单词边界。‘er\B’ 能匹配 “verb” 中的 ‘er’,但不能匹配 “never” 中的 ‘er’。
\n, \t, 等. 匹配一个换行符。匹配一个制表符。等
\1…\9 匹配第n个分组的内容。
\10 匹配第n个分组的内容,如果它经匹配。否则指的是八进制字符码的表达式。

1、re.search()------匹配查找

  • 目的

    扫描整个字符串并返回第一个成功的匹配

  • 使用方法

    re.search(pattern, string)
    

    pattern 匹配的正则表达式

    string 需要匹配的字符串

  • 使用案例

    import re
    re.search('go*gle','www.google.com')
    >>> <_sre.SRE_Match object; span=(4, 10), match='google'>
    re.search('go*gle','www.google.com').span()
    >>> (4, 10)
    re.search('go*gle','www.google.com').span()[0]
    >>> 4
    re.search('go*gle','www.google.com').group()
    >>> google
    
  • 注意事项

    1、返回值是一个对象,不能直接使用,需要配合span()、group()等函数才可以完成任务。

    2、search()函数找不到的话,返回的是None,而None是没有span()等方法的,而直接调用是会出错的,所以需要在写程序的时候增加一条判空语句

2、re.match() -----从第一位匹配

  • 目的

    尝试从字符串的起始位置匹配一个模式,如果不是起始位置匹配成功的话,match() 就返回 none。

  • 使用方法

    re.match(pattern, string)
    

    pattern 匹配的正则表达式

    string 需要匹配的字符串

  • 使用案例

    import re
    re.match('google.','www.google.com')
    >>> None
    re.match('w..','www.google.com')
    >>> <_sre.SRE_Match object; span=(0, 2), match='ww'>
    re.match('w.','www.google.com').span()
    >>> (0, 2)
    re.match('w.','www.google.com').span()[0]
    >>> 0
    re.match('w.','www.google.com').group()
    >>> ww
    
  • 注意事项

    只能从第一个字符开始匹配,如果第一个字符不能相互对应,则返回None。故一般不用

3、re.findall()------查找所有

  • 目的

    找到正则表达式所匹配的所有子串,并返回一个列表

  • 使用方法

    str.findall(string , pos = 0,endpos = len(str))
    

    str 匹配的正则表达式

    string 需要匹配的字符串

    pos 指定字符串的起始位置,默认为 0—可选参数

    endpos 指定字符串的结束位置,默认为字符串的长度—可选参数

  • 使用案例

    import re
    
    pattern = re.compile(r'\d+')
    str = 'asd123qwe456opi789mnb012'
    pattern.findall(str)
    >>> ['123', '456', '789', '012']
    pattern.findall(str,0,12)
    >>> ['123', '456']
    re.findall(r'(\w+)=(\d+)', 'set w=30 and h=40')    # 多个匹配模式,返回元组列表
    >>> [('w','30'),('h','40')]
    

    注意事项

    不是返回找到字串的位置,而是返回字串的值。

4、re.finditer()----返回迭代器

  • 目的

    扫描整个字符串并返回第一个成功的匹配

  • 使用方法

re.finditer(pattern, string, flags=0)
  • 使用案例
import re
 
it = re.finditer(r"\d+","78a32bc43jf3") 
for iter in it: 
    print (iter.group() )
  • 注意事项

返回的是迭代器无法直接输出

5、re.split()---------规则化分割

  • 目的

    扫描整个字符串并返回第一个成功的匹配,按照能够匹配的子串将字符串分割后返回列表

  • 使用方法

re.split(pattern, string[, maxsplit=0, flags=0])
  • 使用案例
>>>import re
>>> re.split('\W+', 'baidu, google, sogo.')
['baidu', 'google', 'sogo', '']
  • 注意事项

将所有匹配到的事项,存储到一个list当中,并返回。

6、表达式匹配规则

7、表达式匹配案例

Guess you like

Origin blog.csdn.net/un_lock/article/details/127400634