Online Regular Expression Test: http://tool.oschina.net/regex/#
1. Summary of commonly used matching rules:
模式 描述
\w 匹配字母数字及下划线
\W 匹配非字母数字及下划线
\s 匹配任意空白字符,等价于 [\t\n\r\f].
\S 匹配任意非空字符
\d 匹配任意数字,等价于 [0-9]
\D 匹配任意非数字
\A 匹配字符串开始
\Z 匹配字符串结束,如果是存在换行,只匹配到换行前的结束字符串
\z 匹配字符串结束
\G 匹配最后匹配完成的位置
\n 匹配一个换行符
\t 匹配一个制表符
^ 匹配字符串的开头
$ 匹配字符串的末尾
. 匹配任意字符,除了换行符,当 re.DOTALL 标记被指定时,则可以匹配包括换行符的任意字符
[...] 用来表示一组字符,单独列出:[amk] 匹配 'a','m' 或 'k'
[^...] 不在 [] 中的字符:abc 匹配除了 a,b,c 之外的字符。
* 匹配 0 个或多个的表达式。
+ 匹配 1 个或多个的表达式。
? 匹配 0 个或 1 个由前面的正则表达式定义的片段,非贪婪方式
{n} 精确匹配 n 个前面表达式。
{n, m} 匹配 n 到 m 次由前面的正则表达式定义的片段,贪婪方式
`a b` 匹配 a 或 b
( ) 匹配括号内的表达式,也表示一个组
2. Universal matching
. (dot) can match any character (except newline), * (star) also means to match the preceding character infinite times, so they can be combined .*
to match any character, with it we don't have to Matched character by character.
Greedy matching:.*
It can match as many characters as you want. Normally, it will match as many characters as possible. The following example:
import re
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$', content)
print(result.group(1))
# 结果是 7
'^He.*(\d+).*Demo$' #匹配式中.*(贪婪匹配)可以尽量多地匹配到llo 123456 而正则式仍成立的
Non-greedy matching: In this case, to get a non-greedy matching, you can get 1234567, which can be used as a .*?
non-greedy matching mode. Examples are as follows:
import re
content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*?(\d+).*Demo$', content)
print(result.group(1))
# 结果是 1234567
Greedy matching is to match as many characters as possible, non-greedy matching is to match as few characters as possible, . ? is followed by \d+ to match numbers, when . ? It is a number, and \d+ just matches, so here. ? will no longer be matched, and hand it over to \d+ to match the following numbers. So, . ? matches as few characters as possible, and the result of \d+ is 1234567.
3. Modifiers
修饰符 描述
re.I 使匹配对大小写不敏感
re.L 做本地化识别(locale-aware)匹配
re.M 多行匹配,影响 ^ 和 $
re.S 使 . 匹配包括换行在内的所有字符
re.U 根据Unicode字符集解析字符。这个标志影响 \w, \W, \b, \B.
re.X 该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。
在网页匹配中较为常用的为 re.S、re.I。
Example:
import re
content = '''Hello 1234567 World_This
is a Regex Demo
'''
result = re.match('^He.*?(\d+).*?Demo$', content)
print(result.group(1))
#AttributeError: 'NoneType' object has no attribute 'group',没有捕获括号内的值
Because the text content that needs to be matched has a newline (with a newline), and .
any character other than the newline is matched, the matching fails. So here we only need to add a modifier re.S to fix this error.
result = re.match('^He.*?(\d+).*?Demo$', content, re.S)
#运行结果:1234567
The third parameter of the match() method is passed to re.S, which is used to make . match all characters including newlines.
4, re library function
- re.match(): Match from the beginning of the string and match to match;
import re
content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
result = re.match('Hello.*?(\d+).*?Demo', content)
print(result)
#运行结果 None
- re.search(): When matching, it scans the entire string and returns the first successful match;
import re
content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
result = re.search('Hello.*?(\d+).*?Demo', content)
print(result)
#运行结果 <_sre.SRE_Match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'>
re.findall(): When matching, it scans the entire string and returns everything that matches the regular expression;
re.sub(): match the content of the entire string and replace it;
import re
content = '54aK54yr5oiR54ix5L2g'
content = re.sub('\d+', '', content)
print(content)
#结果:aKyroiRixLg
- re.compile(): Compile the regular string into a regular expression object for reuse in subsequent matches;
import re
content1 = '2016-12-15 12:00'
content2 = '2016-12-17 12:55'
content3 = '2016-12-22 13:21'
pattern = re.compile('\d{2}:\d{2}')
result1 = re.sub(pattern, '', content1)
result2 = re.sub(pattern, '', content2)
result3 = re.sub(pattern, '', content3)
print(result1, result2, result3)
#运行结果:2016-12-15 2016-12-17 2016-12-22
#将正则字符串编译成正则表达式对象pattern,后面直接调用pattern
Reference:
https://germey.gitbooks.io/python3webspider/3.3-%E6%AD%A3%E5%88%99%E8%A1%A8%E8%BE%BE%E5%BC%8F.html