python库的解析--正则表达式(re库)

字符	作用
.	在默认模式，匹配除了换行的任意字符。如果指定了标签 DOTALL ，它将匹配包括换行符的任意字
^	匹配字符串的开头，并且在 MULTILINE 模式也匹配换行后的首个符号
$	匹配字符串尾或者换行符的前一个字符，在 MULTILINE 模式匹配换行符的前一个字符
*	对它前面的正则式匹配0到任意次重复，尽量多的匹配字符串
+	对它前面的正则式匹配1到任意次重复
?	对它前面的正则式匹配0到1次重复
{m}	对其之前的正则式指定匹配 m 个重复
{m,n}	对正则式进行 m 到 n 次匹配,匹配尽量多的字符次数
{m,n}?	前一个修饰符的非贪婪模式，只匹配尽量少的字符次数
\|	A\|B， A 和 B 可以是任意正则表达式，创建一个正则表达式，匹配 A 或者 B. 任意个正则表达式可以用 ‘\|’ 连接’,\|’ 操作符绝不贪婪
(…)	（组合），匹配括号内的任意正则表达式，并标识出组合的开始和结尾。匹配完成后，组合的内容可以被获取
(?…)	这是个扩展标记法, ‘?’ 后面的第一个字符决定了这个构建采用什么样的语法这种扩展通常并不创建新的组合； (?P…) 是唯一的例外。以下是目前支持的扩展。
(?aiLmsux)	( ‘a’, ‘i’, ‘L’, ‘m’, ‘s’, ‘u’, ‘x’ 中的一个或多个) 这个组合匹配一个空字符串；这些字符对正则表达式设置以下标记 re.A (只匹配ASCII字符), re.I (忽略大小写,ignore), re.L (语言依赖,language), re.M (多行模式，mutiline), re.S (点dot匹配全部字符，string), re.U (Unicode匹配), and re.X (冗长模式)。
(?:…)	正则括号的非捕获版本。匹配在括号内的任何正则表达式，但该分组所匹配的子字符串不能在执行匹配后被获取或是之后在模式中被引用。
(?aiLmsux-imsx:…)	(‘a’, ‘i’, ‘L’, ‘m’, ‘s’, ‘u’, ‘x’ 中的0或者多个，之后可选跟随 ‘-’ 在后面跟随 ‘i’ , ‘m’ , ‘s’ , ‘x’ 中的一到多个 .)
(?P…)	（命名组合）类似正则组合，但是匹配到的子串组在外部是通过定义的 name 来获取的。组合名必须是有效的Python标识符，并且每个组合名只能用一个正则表达式定义，只能定义一次
(?P=name)	反向引用一个命名组合；它匹配前面那个叫 name 的命名组中匹配到的串同样的字串
(?#…)	注释；里面的内容会被忽略
(?=…)	匹配 … 的内容，但是并不消费样式的内容。这个叫做 lookahead assertion。比如， Isaac (?=Asimov) 匹配 'Isaac ’ 只有在后面是 ‘Asimov’ 的时候
(?!…)	匹配 … 不符合的情况。这个叫 negative lookahead assertion （前视取反）。比如说， Isaac (?!Asimov) 只有后面不是 ‘Asimov’ 的时候才匹配 'Isaac ’
(?<=…)	匹配字符串的当前位置，它的前面匹配 … 的内容到当前位置。这叫:dfn:positive lookbehind assertion （正向后视断定）
(?<!…)	匹配当前位置之前不是 … 的样式。这个叫 negative lookbehind assertion （后视断定取非）类似正向后视断定，包含的样式匹配必须是定长的。由 negative lookbehind assertion 开始的样式可以从字符串搜索开始的位置进行匹配。
(?(id/name)yes-pattern\|no-pattern)	如果给定的 id 或 name 存在，将会尝试匹配 yes-pattern ，否则就尝试匹配 no-pattern，no-pattern 可选，也可以被忽略

实例

import re

str_config = 'print my self world and heLlo world' \
             ' heLlo my wife'
rule_config = ".*(?i:[l]{2}).*"
results = re.match(rule_config, str_config)
print(results)
print('-'*30, 'result', '-'*30)
str_config_1 = 'print my self world and hello world' \
             ' heLlo my wife'
rule_config_1 = ".*(?P<name>[l+o]).*?(?P=name)"
results = re.match(rule_config_1, str_config_1)
print(results)
print('-'*30, 'result', '-'*30)
str_config_2 = 'print my self world and hello world' \
             ' heLlo my wife'
rule_config_2 = ".*(?P<name>[l+o]).*?(?P=name)(?#:这里是正则的注释)"
results = re.match(rule_config_2, str_config_2)
print(results)
print('-'*30, 'result', '-'*30)
str_config_3 = 'print my self world and hello world' \
             ' hello my wife'
rule_config_3 = ".*?(?=world)"
results = re.match(rule_config_3, str_config_3)
print(results)
print('-'*30, 'result', '-'*30)
str_config_4 = 'print my self world and hello world' \
             ' hello my wife'
rule_config_4 = ".*(?!world)"
results = re.match(rule_config_4, str_config_4)
print(results)
print('-'*30, 'result', '-'*30)
str_config_5 = 'a boy can do everything for girl, he is just kidding'
rule_config_5 = '.*(?<=for )girl'
results = re.match(rule_config_5, str_config_5)
print(results)
print('-'*30, 'result', '-'*30)
str_config_6 = 'a boy can do everything for girl, he is just kidding'
rule_config_6 = '(.*?)(?<!for )girl'
results = re.match(rule_config_6, str_config_6)
print(results)

字符	作用
\number	匹配数字代表的组合。每个括号是一个组合，组合从1开始编号
\A	只匹配字符串开始。
\b	匹配空字符串，但只在单词开始或结尾的位置。一个单词被定义为一个单词字符的序列。注意，通常 \b 定义为 \w 和 \W 字符之间，或者 \w 和字符串开始/结尾的边界，意思就是 r’\bfoo\b’ 匹配 ‘foo’, ‘foo.’, ‘(foo)’, ‘bar foo baz’ 但不匹配 ‘foobar’ 或者 ‘foo3’。默认情况下，Unicode字母和数字是在Unicode样式中使用的，但是可以用 ASCII 标记来更改。如果 LOCALE 标记被设置的话，词的边界是由当前语言区域设置决定的，\b 表示退格字符，以便与Python字符串文本兼容。
\B	匹配空字符串，但不能在词的开头或者结尾。意思就是 r’py\B’ 匹配 ‘python’, ‘py3’, ‘py2’, 但不匹配 ‘py’, ‘py.’, 或者 ‘py!’. \B 是 \b 的取非，所以Unicode样式的词语是由Unicode字母，数字或下划线构成的，虽然可以用 ASCII 标志来改变。如果使用了 LOCALE 标志，则词的边界由当前语言区域设置。
\d	对于 Unicode (str) 样式：匹配任何Unicode十进制数（就是在Unicode字符目录[Nd]里的字符）。这包括了 [0-9]
\D	匹配任何非十进制数字的字符。就是 \d 取非。
\s	对于 Unicode (str) 样式：匹配任何Unicode空白字符（包括 [ \t\n\r\f\v] ，还有很多其他字符，比如不同语言排版规则约定的不换行空格）
\S	匹配任何非空白字符。就是 \s 取非。如果设置了 ASCII 标志，就相当于 [^ \t\n\r\f\v] 。
\w	对于 Unicode (str) 样式：匹配Unicode词语的字符，包含了可以构成词语的绝大部分字符，也包括数字和下划线。如果设置了 ASCII 标志，就只匹配 [a-zA-Z0-9_]
\W	匹配非单词字符的字符。这与 \w 正相反。如果使用了 ASCII 旗标，这就等价于 [^a-zA-Z0-9_]。如果使用了 LOCALE 旗标，则会匹配当前区域中既非字母数字也非下划线的字符。
\Z	只匹配字符串尾。

str_config_7 = 'thethe boy I see may be only 15 years old'
rule_config_7 = r'(.*)\1'
results = re.match(rule_config_7, str_config_7)
print(results)
print('-'*30, 'result', '-'*30)
str_config_8 = 'the, emm...the boy I see may be only 15 years old'
rule_config_8 = '\Athe.*old\Z'
results = re.search(rule_config_8, str_config_8)
print(results)
print('-'*30, 'result', '-'*30)
str_config_9 = 'the phone number is 0000-123456789'
rule_config_9 = '\d+\-\d+'
results = re.search(rule_config_9, str_config_9)
print(results)
print('-'*30, 'result', '-'*30)
str_config_10 = 'the \n phone number    is 0000-123456789'
rule_config_10 = '\s+'
results = re.search(rule_config_10, str_config_10)
print(results)
print('-'*30, 'result', '-'*30)

re库中的一些函数

import re

str_config = "<A class='go_to_another' href='http://www.baidu.com'>"
# re.compile(pattern, flags=0)建立一个正则表达对象
rule_config = re.compile("<a .* href='(.*)'>", re.IGNORECASE)
result = rule_config.findall(str_config)
'''
等价于
result = re.compile("<a .* href='(.*)'>").findall(str_config)
'''
print(result)
'''
re.A
re.ASCII
让 \w, \W, \b, \B, \d, \D, \s 和 \S 只匹配ASCII，而不是Unicode

re.DEBUG
显示编译时的debug信息，没有内联标记。

re.I
re.IGNORECASE
进行忽略大小写匹配；表达式如 [A-Z] 也会匹配小写字符

re.L
re.LOCALE
由当前语言区域决定

re.M
re.MULTILINE
设置以后，样式字符 '^' 匹配字符串的开始，和每一行的开始（换行符后面紧跟的符号）；样式字符 '$' 匹配字符串尾，和每一行的结尾（换行符前面那个符号）

re.S
re.DOTALL
让 '.' 特殊字符匹配任何字符，包括换行符；如果没有这个标记，'.' 就匹配 除了 换行符的其他任意字符。对应内联标记 (?s) 。

re.X
re.VERBOSE
这个标记允许你编写更具可读性更友好的正则表达式。通过分段和添加注释

'''

print('-'*45, 'result showing', '-'*45)

# re.match re.search re.fullmatch 匹配字符
str_config_1 = 'a boy can do everything for girl when she call 0000-123456789, he is just kidding'
rule_config_1 = '\d{4}-\d{9}'
result_match = re.match(rule_config_1, str_config_1)
result_search = re.search(rule_config_1, str_config_1)
print("the result of match is {}\nthe result of search is {}\n".format(result_match, result_search))
print('-'*45, 'result showing', '-'*45)
'''
re.search(pattern, string, flags=0)
扫描整个 字符串 找到匹配样式的第一个位置，并返回一个相应的 匹配对象
如果没有匹配，就返回一个 None 
 注意这和找到一个零长度匹配是不同的

re.match(pattern, string, flags=0)
如果 string 开始的0或者多个字符匹配到了正则表达式样式，就返回一个相应的 匹配对象
如果没有匹配，就返回 None 
注意它跟零长度匹配是不同的

re.fullmatch(pattern, string, flags=0)
如果整个 string 匹配到正则表达式样式，就返回一个相应的 匹配对象
否则就返回一个 None 
注意这跟零长度匹配是不同的
'''

# re.split分割字符
str_config_2 = 'a boy can do everything for girl when she call 0000-123456789, he is just kidding'
rule_config_2 = r'\W+'
print(re.split(rule_config_2, str_config_2))
print('-'*45, 'result showing', '-'*45)
'''
re.split(pattern, string, maxsplit=0, flags=0)
用 pattern 分开 string
如果在 pattern 中捕获到括号，那么所有的组里的文字也会包含在列表里
如果 maxsplit 非零， 最多进行 maxsplit 次分隔,剩下的字符全部返回到列表的最后一个元素
'''

# re.findall查找字符串
str_config_3 = "my name is Mike, my age is 17, my height is 178, and my weight is 50kg." \
               "one of my website URL is <a href='https://www.baidu.com'></a> and" \
               "another one is <a href='http://www.baidu.com/news/index.html'></a>"
rule_config_3 = re.compile(r"<a href='(.*?)'></a>", re.S)
result = rule_config_3.findall(str_config_3)
print(result)
# re.finditer(pattern, string, flags=0)查找字符串,返回一个迭代器
rule_config_3 = re.compile(r"<a href='(.*?)'></a>", re.S)
results = rule_config_3.finditer(str_config_3)
for result in results:
    print(result)
print('-'*45, 'result showing', '-'*45)

# re.sub(pattern, repl, string, count=0, flags=0) 替换字符串
rule_config_4 = ' '
str_config_4 = 'I am a boy!'
result = re.sub(rule_config_4, '\n', str_config_4, count=2)
print(result)
'''
count替换次数
'''
print('-'*45, 'result showing', '-'*45)

# re.subn(pattern, repl, string, count=0, flags=0) 替换字符串 返回元组
rule_config_5 = ' '
str_config_5 = 'I am a boy!'
result = re.subn(rule_config_5, '\n', str_config_5)
print(result)
print('-'*45, 'result showing', '-'*45)

# re.escape(pattern)转义 pattern 中的特殊字符
print(re.escape('http://www.python.org'))
print('-'*45, 'result showing', '-'*45)

# re.purge()清除正则表达式的缓存
re.purge()

正则匹配对象

# 正则表达对象
str_config_5 = 'there is a error in message. Please help me to check out it'
rule_config_5 = re.compile('(?<=error)(.*?)[e]+')
result = rule_config_5.findall(str_config_5)
print(result)
# Pattern.flags正则匹配标记
print(rule_config_5.flags)
# Pattern.groups捕获到的模式串中组的数量
print(rule_config_5.groups)
# Pattern.groupindex映射由 (?P<id>) 定义的命名符号组合和数字组合的字典。如果没有符号组，那字典就是空的。
print(rule_config_5.groupindex)
# Pattern.pattern编译对象的原始样式字符串
print(rule_config_5.pattern)
print('-'*45, 'result showing', '-'*45)

# 匹配对象
str_config = 'the 12 boy in my home is my little brother who came home last month'
rule_config = r'(?P<first_name>\w+\s)(?P<last_name>\d+)'
result_search = re.search(rule_config, str_config)
# Match.expand(template)对 template 进行反斜杠转义替换并且返回
print(result_search)
print(result_search.expand('\n'))
'''
Match.expand(template)
对 template 进行反斜杠转义替换并且返回，就像 sub() 方法中一样
转义如同 \n 被转换成合适的字符，数字引用(\1, \2)和命名组合(\g<1>, \g<name>) 替换为相应组合的内容
'''

# Match.group([group1, ...]) 返回一个或者多个匹配的子组
print(result_search.group(0))
print(result_search.group(1))
print(result_search.group(2))
print(result_search.group('first_name'))
print(result_search.group('last_name'))
'''
 如果一个组N 参数值为 0，相应的返回值就是整个匹配字符串
'''

# Match.groups(default=None)返回一个元组，包含所有匹配的子组
print(result_search.groups())

# Match.__getitem__(g)这个等价于 m.group(g)
'''
这允许更方便的引用一个匹配
'''
print(result_search[1])

# Match.groupdict(default=None)返回一个字典，包含了所有的 命名 子组
print(result_search.groupdict())

# Match.pos pos 的值
# 会传递给 search() 或 match() 的方法 a 正则对象
# 这个是正则引擎开始在字符串搜索一个匹配的索引位置
print(result_search.pos)

# Match.endpos
# endpos 的值，会传递给 search() 或 match() 的方法 a 正则对象
# 这个是正则引擎停止在字符串搜索一个匹配的索引位置。
print(result_search.endpos)

# Match.lastindex
# 捕获组的最后一个匹配的整数索引值
# 或者 None 如果没有匹配产生的话。比如，对于字符串 'ab'，表达式 (a)b, ((a)(b)), 和 ((ab)) 将得到 lastindex == 1
# 而 (a)(b) 会得到 lastindex == 2 。
print(result_search.lastindex)

# Match.lastgroup 最后一个匹配的命名组名字，或者 None 如果没有产生匹配的话。
print(result_search.lastgroup)

# Match.re 返回产生这个实例的 正则对象(re.compile)
print(result_search.re)

# Match.string 传递到 match() 或 search()
print(result_search.string)

# Match.start([group]) Match.end([group])
# 返回 group 匹配到的字串的开始和结束标号
# group 默认为0（意思是整个匹配的子串）
# 如果 group 存在，但未产生匹配，就返回 -1
# 对于一个匹配对象 m， 和一个未参与匹配的组 g ，组 g (等价于 m.group(g))产生的匹配是
print(result_search.start())

# m.string[m.start(g):m.end(g)]
# 注意 m.start(group) 将会等于 m.end(group)
# 如果 group 匹配一个空字符串的话

# Match.span([group])
# 对于一个匹配 m ， 返回一个二元组 (m.start(group), m.end(group))
# 注意如果 group 没有在这个匹配中，就返回 (-1, -1)
# group 默认为0，就是整个匹配
print(result_search.span())

python库的解析--正则表达式(re库)

猜你喜欢