第一章：文本-re:正则表达式-用组解析匹配

1.3.6 用组解析匹配
搜索模式匹配是正则表达式强大能力的基础。为模式提供组可以隔离匹配文本的各个部分，以扩展这些功能来创建一个解析器。可以用小括号包围模式来定义组。

# re_test_patterns.py
import re

def test_patterns(text,patterns):
    """Given source text and a list of patterns,look for
    matches for each pattern within the text and print
    them to stdout.
    """

    # Look for each pattern in the text and print the results.
    for pattern,desc in patterns:
        print("'{}' ({})\n".format(pattern,desc))
        print(" '{}'".format(text))
        for match in re.finditer(pattern,text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashes = text[:s].count('\\')
            prefix = '.' * (s + n_backslashes)
            print(" {}'{}'".format(prefix,substr))
        print()
    return

if __name__ == '__main__':
    test_patterns('abbaaabbbbaaaaa',[('ab',"'a' followed by 'b'")])

from re_test_patterns import test_patterns

test_patterns(
    'abbaaabbbbaaaaa',
    [('a(ab)','a followed by literal ab'),
     ('a(a*b*)','a followed by 0-n a and 0-n b'),
     ('a(ab)*','a followed by 0-n ab'),
     ('a(ab)+','a followed by 1-n ab')
        ],
    )

可以把完整的正则表达式转换为一个组，并嵌入到一个更大的表达式中。所有重复修饰符都可以应用到整个组，要求整个组模式重复。
运行结果：
在这里插入图片描述

要访问与模式中各个组匹配的子串，可以使用match对象的groups()方法。

import re

text = 'This is some text -- with punctuation.'

print(text)
print()

patterns = [
    (r'^(\w+)','word at start of string'),
    (r'(\w+)\S*$','word at end,with optional punctuation'),
    (r'(\bt\w+)\W+(\w+)','word starting with t,another word'),
    (r'(\w+t)\b','word ending with t'),
    ]

for pattern,desc in patterns:
    regex = re.compile(pattern)
    match = regex.search(text)
    print("'{}' ({})\n".format(pattern,desc))
    print(' ',match.groups())
    print()

运行结果：

This is some text – with punctuation.

‘^(\w+)’ (word at start of string)

(‘This’,)

‘(\w+)\S*$’ (word at end,with optional punctuation)

(‘punctuation’,)

‘(\bt\w+)\W+(\w+)’ (word starting with t,another word)

(‘text’, ‘with’)

‘(\w+t)\b’ (word ending with t)

(‘text’,)

要访问单个组的匹配，可以使用group()方法。当使用组查找字符串的各个部分时，有些部分尽管与组匹配但在结果中并不需要，此时group()方法就很有用。

import re

text = 'This is sone text -- with punctuation.'

print('Input text    :',text)

# Word starting with 't' then another word
regex = re.compile(r'(\bt\w+)\W+(\w+)')
print('Pattern       :',regex.pattern)

match = regex.search(text)
print('Entire match   :',match.group(0))
print('Word starting with "t"：',match.group(1))
print('Word after "t" word ：',match.group(2))

组0表示与整个表达式匹配的字符串，子组按其左括号在表达式中出现的顺序编号，从1开始。
运行结果：

Input text : This is sone text – with punctuation.
Pattern : (\bt\w+)\W+(\w+)
Entire match : text – with
Word starting with “t”: text
Word after “t” word : with

python扩展了基本组语法，还增加了命名组，通过使用名字来指示组可以更容易地修改模式，而不必同时修改使用了匹配结果的代码。要设置一个组的名字，可以使用语法(?Ppattern).

import re

text = 'This is some text -- with punctuation.'

print(text)
print()

patterns = [
    r'^(?P<first_word>\w+)',
    r'(?P<last_word>\w+)\S*$',
    r'(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)',
    r'(?P<ends_with_t>\w+t)\b',
    ]

for pattern in patterns:
    regex = re.compile(pattern)
    match = regex.search(text)
    print("'{}'".format(pattern))
    print(' ',match.groups())
    print(' ',match.groupdict())
    print()

可以使用groupdict()获取一个字典，它将组名映射为匹配的子串。命名模式也包含在groups()返回的有序序列中。
运行结果：

This is some text – with punctuation.

‘^(?P<first_word>\w+)’
(‘This’,)
{‘first_word’: ‘This’}

‘(?P<last_word>\w+)\S*$’
(‘punctuation’,)
{‘last_word’: ‘punctuation’}

‘(?P<t_word>\bt\w+)\W+(?P<other_word>\w+)’
(‘text’, ‘with’)
{‘t_word’: ‘text’, ‘other_word’: ‘with’}

‘(?P<ends_with_t>\w+t)\b’
(‘text’,)
{‘ends_with_t’: ‘text’}

更新的test_patterns()会显示一个模式匹配的编号组和命名组，使后面的例子更容易理解。

# re_test_patterns_groups.py
import re

def test_patterns(text,patterns):
    """Given source text and a list of patterns,look for
    matches for each pattern within the text and print
    them to stdout.
    """
    # Look for each pattern in the text and print the results.

    for pattern,desc in patterns:
        print('{!r} ({})\n'.format(pattern,desc))
        print(' {!r}'.format(text))
        for match in re.finditer(pattern,text):
            s = match.start()
            e = match.end()
            prefix = ' ' * (s)
            print(
                ' {}{!r}{} '.format(prefix,text[s:e],
                                    ' ' * (len(text) - e)),
                                    end=' ',
                )
            print(match.groups())
            if match.groupdict():
                print('{}{}'.format(' ' * (len(text) - s),
                      match.groupdict()),
                      )
            print()
        return

组本身是一个完整的正则表达式，可以嵌套在其他组中来创建更复杂的表达式。

from re_test_patterns_groups import test_patterns

test_patterns(
    'abbaabbba',
    [(r'a((a*)(b*))','a followed by 0-n a and 0-n b')]
    )

在这里，组(a*)匹配一个空串，所以groups()的返回值包含空串作为匹配值。
运行结果：
在这里插入图片描述
组还可以用于指定替代模式。可以使用管道符号（|）指示应当匹配某一个模式。不过，要仔细考虑管道符号的放置位置。下面这个例子中的第一个表达式匹配一个a序列，该序列后面跟着一个完全由某一个字母（a或b）组成的序列。第二个表达式匹配a，其后面跟着一个可能包含a或b的序列。模式很相似，但是得到的匹配完全不同。

from re_test_patterns_groups import test_patterns

test_patterns(
    'abbaabbba',
    [(r'a((a+)|(b+))','a then seq. of a or seq. of b'),
     (r'a((a|b)+)','a then seq. of [ab]')  
     ],
    )

如果一个替代组不匹配，但是整个模式确实匹配，那么groups()的返回值会在序列中本应该出现替代组的位置包含一个None值。
运行结果：
在这里插入图片描述

如果匹配子模式的字符串不必从整个文本中抽取出来，那么在这种情况下，定义包含子模式的组也很有用。这些组被称为非捕获组（non-capturing）.非捕获组可以用来描述重复模式或替代，而不会隔离返回值中字符串的匹配部分。可以使用语法（?:pattern）创建一个非捕获组。

from re_test_patterns_groups import test_patterns

test_patterns(
        'abbaabbba',
        [(r'a((a+)|(b+))','capturing form'),
         (r'a((?:a+)|(?:b+))','noncapturing')],

尽管一个模式的捕获和非捕获形式会匹配得到相同的结果，但它们会返回不同的组，下面来做一个比较。
运行结果：
在这里插入图片描述

第一章：文本-re:正则表达式-用组解析匹配

猜你喜欢