Python3 Regular Expression

regular expression pattern

Pattern strings use a special syntax to represent a regular expression:

Letters and numbers represent themselves. Letters and numbers in a regular expression pattern match the same string.

Most letters and numbers have different meanings when preceded by a backslash.

Punctuation characters only match themselves if they are escaped, otherwise they have a special meaning.

Backslashes themselves need to be escaped with backslashes.

Since regular expressions usually contain backslashes, you are better off using raw strings to represent them. Pattern elements (eg r'\t', equivalent to \\t ) match the corresponding special character.

The following table lists the special elements in the regular expression pattern syntax. If you use a pattern and provide the optional flags parameter, the meaning of some pattern elements will change.

Common operators

model describe
^ matches the beginning of the string
$ Matches the end of the string.
. Matches any single character, except newlines, and when the re.DOTALL flag is specified, matches any character including newlines.
[...] Used to represent a group of characters, listed separately: [amk] matches 'a', 'm' or 'k'
[^...] Characters not in []: [^abc] matches characters other than a,b,c.
* 0 or more copy extensions of the previous character.
+ 1 or more copy extensions of the previous character.
? 0 or 1 copy extension of the previous character.
{ n} The previous character n copies the extension. . For example, "o{2}" cannot match the "o" in "Bob", but can match the two o's in "food".
re{ n,} Matches exactly n of the preceding expressions. For example, "o{2,}" would not match the "o" in "Bob", but would match all o's in "foooood". "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*".
re{ n, m}                       The previous character is copied and expanded n to m times (including m), greedy way
a| b matches a or b
( ) Grouping markers, only the | operator can be used internally
(?imx) Regular expressions contain three optional flags: i, m, or x. Only affects the area in parentheses.
(?-imx) The regular expression turns off the i, m, or x optional flags. Only affects the area in parentheses.
(?: re) Like (...), but does not indicate a group
(?imx: re) Use i, m, or x optional flags in parentheses
(?-imx: re) Do not use i, m, or x optional flags in parentheses
(?#...) Notes.
(?= re) Forward positive delimiter. If the contained regular expression, represented by ..., succeeds when the current position is successfully matched, otherwise fails. But once the contained expression has been tried, the matching engine doesn't improve at all; the rest of the pattern also tries to the right of the delimiter.
(?! re) Forward negation delimiter. Contrary to the positive delimiter; succeeds when the contained expression cannot be matched at the current position in the string.
(?> re) An independent pattern of matching, omitting backtracking.
\w Matches alphanumeric underscores, equivalent to [A-Za-z0-9_]
\W Matches non-numeric letters underscore
\s Matches any whitespace character, equivalent to [\t\n\r\f].
\S matches any non-empty character
\d Matches any number, equivalent to [0-9].
\D matches any non-digit
\A match string start
\WITH Matches the end of the string. If there is a newline, only the end string before the newline is matched.
\with end of match string
\G Matches the position where the last match was done.
\b Matches a word boundary, that is, the position between a word and a space. For example, 'er\b' can match the 'er' in "never", but not the 'er' in "verb".
\B Match non-word boundaries. 'er\B' matches the 'er' in "verb", but not the 'er' in "never".
\n, \t, etc. Matches a newline character. matches a tab, etc.
\1...\9 Matches the content of the nth group.
\10 Matches the contents of the nth packet if it matches. Otherwise an expression referring to an octal character code.

Minimum match operator

operator

illustrate
*? The previous character is expanded 0 or infinite times, the smallest match
+? 前一个字符1次或者无限次扩展,最小匹配
?? 前一个字符0次或者1次扩展,最小匹配
{m,n}? 前一个字符m至n次(含n)扩展,最小匹配

捕获分组

操作符 说明     
(exp) 匹配exp,并捕获文本到自动命名的组里
(?<name>exp) 匹配exp,并捕获文本到名称为name的组里
(?:exp) 匹配exp,不捕获匹配的文本,也不给此分组分配组号
(?=exp) 匹配exp前面的位置
(?<=exp) 匹配exp后面的位置
(?!exp) 匹配后面跟的不是exp的位置
(?<!exp) 匹配前面不是exp的位置

功能函数

re.match函数

re.match(pattern, string, flags=0)

从字符串的起始位置匹配一个模式,如果不是起始位置匹配成功的话,match()就返回none。

匹配成功re.match方法返回一个匹配的对象,否则返回None。

参数 描述
pattern 匹配的正则表达式
string 要匹配的字符串。
flags 标志位,用于控制正则表达式的匹配方式,如:是否区分大小写,多行匹配等等。

re.search方法

re.search(pattern, string, flags=0)

re.search 扫描整个字符串并返回第一个成功的匹配。匹配成功re.search方法返回一个匹配的对象,否则返回None。

re.sub方法

re.sub(pattern, repl, string, count=0)

用于替换字符串中的匹配项

  • pattern : 正则中的模式字符串。
  • repl : 替换的字符串,也可为一个函数。
  • string : 要被查找替换的原始字符串。
  • count : 模式匹配后替换的最大次数,默认 0 表示替换所有的匹配。
#!/usr/bin/python3
import re
 
phone = "2004-959-559 # 这是一个电话号码"
 
# 删除注释
num = re.sub(r'#.*$', "", phone)
print ("电话号码 : ", num)
电话号码 :  2004-959-559

repl 参数是一个函数

#!/usr/bin/python
 
import re
 
# 将匹配的数字乘于 2
def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)
 
s = 'A23G4HFD567'
print(re.sub('(?P<value>\d+)', double, s))
A46G8HFD1134

re.compile方法

re.compile(pattern[, flags])

compile 函数用于编译正则表达式,生成一个正则表达式( Pattern )对象,供 match() 和 search() 这两个函数使用。

  • pattern : 一个字符串形式的正则表达式
  • flags 可选,表示匹配模式,比如忽略大小写,多行模式等,具体参数为:
    • re.I 忽略大小写
    • re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境
    • re.M 多行模式
    • re.S 即为' . '并且包括换行符在内的任意字符(' . '不包括换行符)
    • re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库
    • re.X 为了增加可读性,忽略空格和' # '后面的注释
>>>import re
>>> pattern = re.compile(r'\d+')                    # 用于匹配至少一个数字
>>> m = pattern.match('one12twothree34four')        # 查找头部,没有匹配
>>> print m
None
>>> m = pattern.match('one12twothree34four', 2, 10) # 从'e'的位置开始匹配,没有匹配
>>> print m
None
>>> m = pattern.match('one12twothree34four', 3, 10) # 从'1'的位置开始匹配,正好匹配
>>> print m                                         # 返回一个 Match 对象
<_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0)   # 可省略 0
'12'
>>> m.start(0)   # 可省略 0
3
>>> m.end(0)     # 可省略 0
5
>>> m.span(0)    # 可省略 0
(3, 5)

在上面,当匹配成功时返回一个 Match 对象,其中:

  • group([group1, …]) 方法用于获得一个或多个分组匹配的字符串,当要获得整个匹配的子串时,可直接使用 group()或 group(0)
  • start([group]) 方法用于获取分组匹配的子串在整个字符串中的起始位置(子串第一个字符的索引),参数默认值为 0;
  • end([group]) 方法用于获取分组匹配的子串在整个字符串中的结束位置(子串最后一个字符的索引+1),参数默认值为 0;
  • span([group]) 方法返回 (start(group), end(group))
>>>import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)   # re.I 表示忽略大小写
>>> m = pattern.match('Hello World Wide Web')
>>> print m                               # 匹配成功,返回一个 Match 对象
<_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0)                            # 返回匹配成功的整个子串
'Hello World'
>>> m.span(0)                             # 返回匹配成功的整个子串的索引
(0, 11)
>>> m.group(1)                            # 返回第一个分组匹配成功的子串
'Hello'
>>> m.span(1)                             # 返回第一个分组匹配成功的子串的索引
(0, 5)
>>> m.group(2)                            # 返回第二个分组匹配成功的子串
'World'
>>> m.span(2)                             # 返回第二个分组匹配成功的子串
(6, 11)
>>> m.groups()                            # 等价于 (m.group(1), m.group(2), ...)
('Hello', 'World')
>>> m.group(3)                            # 不存在第三个分组
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

re.findall方法

findall(string[, pos[, endpos]])

在字符串中找到正则表达式所匹配的所有子串,并返回一个列表,如果没有找到匹配的,则返回空列表。

  • string 待匹配的字符串。
  • pos 可选参数,指定字符串的起始位置,默认为 0。
  • endpos 可选参数,指定字符串的结束位置,默认为字符串的长度。

re.finditer方法

re.finditer(pattern, string, flags=0)

和 findall 类似,在字符串中找到正则表达式所匹配的所有子串,并把它们作为一个迭代器返回。

import re
 
it = re.finditer(r"\d+","12a32bc43jf3") 
for match in it: 
    print (match.group() )
12 
32 
43 
3

re.split方法

re.split(pattern, string[, maxsplit=0, flags=0])

split 方法按照能够匹配的子串将字符串分割后返回列表,它的使用形式如下:

pattern 匹配的正则表达式
string 要匹配的字符串。
maxsplit 分隔次数,maxsplit=1 分隔一次,默认为 0,不限制次数。
flags 标志位,用于控制正则表达式的匹配方式,如:是否区分大小写,多行匹配等等。
>>>import re
>>> re.split('\W+', 'runoob, runoob, runoob.')
['runoob', 'runoob', 'runoob', '']
>>> re.split('(\W+)', ' runoob, runoob, runoob.') 
['', ' ', 'runoob', ', ', 'runoob', ', ', 'runoob', '.', '']
>>> re.split('\W+', ' runoob, runoob, runoob.', 1) 
['', 'runoob, runoob, runoob.']
 
>>> re.split('a*', 'hello world')   # 对于一个找不到匹配的字符串而言,split 不会对其作出分割
['hello world']

正则表达式对象

re.RegexObject

re.compile() 返回 RegexObject 对象。

re.MatchObject

group() 返回被 RE 匹配的字符串。

  • start() 返回匹配开始的位置
  • end() 返回匹配结束的位置
  • span() 返回一个元组包含匹配 (开始,结束) 的位置

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325984392&siteId=291194637