第一章：文本-re:正则表达式-模式语法（2）

1.3.4.2 字符集
字符集（character set）是一组字符，包含可以与模式中当前位置匹配的所有字符。例如，[ab]可以匹配a或b.

# re_test_patterns.py
import re

def test_patterns(text,patterns):
    """Given source text and a list of patterns,look for
    matches for each pattern within the text and print
    them to stdout.
    """

    # Look for each pattern in the text and print the results.
    for pattern,desc in patterns:
        print("'{}' ({})\n".format(pattern,desc))
        print(" '{}'".format(text))
        for match in re.finditer(pattern,text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashes = text[:s].count('\\')
            prefix = '.' * (s + n_backslashes)
            print(" {}'{}'".format(prefix,substr))
        print()
    return

if __name__ == '__main__':
    test_patterns('abbaaabbbbaaaaa',[('ab',"'a' followed by 'b'")])

from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [('[ab]','either a or b'),
     ('a[ab]+','a followed by 1 or more a or b'),
     ('a[ab]+?','a followed by 1 or more a or b,not greedy')
        ],
    )

贪心形式的表达式（a[ab]+）会消费真个字符串，因为第一个字母是a，而且后续的各个字符要么是a要么是b。
运行结果：

‘[ab]’ (either a or b)

‘abbaabbba’
‘a’
.‘b’
…‘b’
…‘a’
…‘a’
…‘b’
…‘b’
…‘b’
…‘a’

‘a[ab]+’ (a followed by 1 or more a or b)

‘abbaabbba’
‘abbaabbba’

‘a[ab]+?’ (a followed by 1 or more a or b,not greedy)

‘abbaabbba’
‘ab’
…‘aa’

字符集还可以用来排除特定的字符。尖字符（^）意味着要查找不在这个尖字符后面的集合中的字符。

from re_test_patterns import test_patterns

test_patterns(
    'This is some text -- with punctuation.',
    [('[^-. ]+','sequences without -, ., or space')],
    )

运行结果：

‘[^-. ]+’ (sequences without -, ., or space)

‘This is some text – with punctuation.’
‘This’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’

随着字符集变得更大，键入每一个应当或不应当匹配的字符会变得很麻烦。可以使用一种更简洁的格式，利用字符区间（character range）来定义一个字符集，包含指定的起点和终点之间所有连续的字符。

from re_test_patterns import test_patterns

test_patterns(
    'This is some text -- with punctuation.',
    [('[a-z]+','sequences of lowercase letters'),
     ('[A-Z]+','sequences of uppercase letters'),
     ('[a-zA-Z]+','sequences of lower- or uppercase letters'),
     ('[A-Z][a-z]+','one uppercase followed by lowercase')
        ],
    )

运行结果：

‘[a-z]+’ (sequences of lowercase letters)

‘This is some text – with punctuation.’
.‘his’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’

‘[A-Z]+’ (sequences of uppercase letters)

‘This is some text – with punctuation.’
‘T’

‘[a-zA-Z]+’ (sequences of lower- or uppercase letters)

‘This is some text – with punctuation.’
‘This’
…‘is’
…‘some’
…‘text’
…‘with’
…‘punctuation’

‘[A-Z][a-z]+’ (one uppercase followed by lowercase)

‘This is some text – with punctuation.’
‘This’

作为字符集的一种特殊情况，元字符点号（.）指示模式应当匹配该位置的单个字符。

from re_test_patterns import test_patterns

test_patterns(
    'abbaabbba',
    [('a.','a followed by any one chartcer'),
     ('b.','b followed by any one charcter'),
     ('a.*b','a followed by anything,ending in b'),
     ('a.*?b','a followed by anything,ending in b')
        ],
    )

运行结果：

‘a.’ (a followed by any one chartcer)

‘abbaabbba’
‘ab’
…‘aa’

‘b.’ (b followed by any one charcter)

‘abbaabbba’
.‘bb’
…‘bb’
…‘ba’

‘a.*b’ (a followed by anything,ending in b)

‘abbaabbba’
‘abbaabbb’

‘a.*?b’ (a followed by anything,ending in b)

‘abbaabbba’
‘ab’
…‘aab’

第一章：文本-re:正则表达式-模式语法（2）

猜你喜欢