21. re module

First, regular

1. Character

Metacharacters	Matched content
.	Matching the outer wrap of any character
\w	Match letters or numbers or an underscore
\s	Matches any whitespace
\d	Matching numbers
\n	Matches a newline
\t	A matching tab
\b	Match the end of a word
^	Matches the beginning of a specified character
$	End of the string
\W	Matching non-alphanumeric characters and underscores
\D	Matching non-numeric
\S	Matching non-whitespace characters
a \| b	A matching character or character b
()	Matching expression in parentheses, it is a group
[...]	Matches the character set of characters
[^...]	In addition to matching the character set of characters All other characters

2. quantifier

quantifier	Matched content
*	Repeat 0 or more times
+	Repeat 1 or more times
？	Repeat 0 or 1
{n}	N times
{n, }	Repeated n times or more
{n, m}	Repeated n times to m

3. use

import re
string_test = "hello"

(1) .
string_1 = re.findall('.', string_test)
print(string_1)
>['h', 'e', 'l', 'l', 'o']

(2) ^
string_2 = re.findall('^h', string_test)
print(string_2)
>['h']

(3) $
string_3 = re.findall('o$', string_test)
print(string_3)
>['o']

(4) * - 当匹配单个字符的时候，会因为匹配0次而出现空字符
string_test = "hello llo lw"
string_1 = re.findall('l*', string_test)
print(string_1)
>['', '', 'll', '', '', 'll', '', '', 'l', '', '']

string_test = "hello llo lw"
string_1 = re.findall('ll*', string_test)
print(string_1)
>['ll', 'll', 'l']

(5) + 
string_test = "hello llo lw"
string_1 = re.findall('l+', string_test)
print(string_1)
>['ll', 'll', 'l']

string_test = "hello llo lw"
string_1 = re.findall('ll+', string_test)
print(string_1)
>['ll', 'll']

(6) ?
string_test = "hello llo lw"
string_1 = re.findall('l?', string_test)
print(string_1)
>['', '', 'l', 'l', '', '', 'l', 'l', '', '', 'l', '', '']

string_test = "hello llo lw"
string_1 = re.findall('ll?', string_test)
print(string_1)
>['ll', 'll', 'l']

(7) {n, }
str1='iii amiiii ssdii iihjf iiifgfdgi '
str2=re.findall('ii{2,}',str1)
print(str2)
>['iii', 'iiii', 'iii']

(8) {, m}
str1='iii amiiii ssdii iihjf iiifgfdgi '
str2=re.findall('ii{,2}',str1)
print(str2)
>['iii', 'iii', 'i', 'ii', 'ii', 'iii', 'i']

(9) {n, m}
str1='iii amiiii ssdii iihjf iiifgfdgi '
str2=re.findall('ii{1,2}',str1)
print(str2)
>['iii', 'iii', 'ii', 'ii', 'iii']

(10) \d
str1='iii amiiii 123er45vg44 '
str2=re.findall(r'\d',str1)
print(str2)
>['1', '2', '3', '4', '5', '4', '4']

(11) \w
str1='iii am_你好iiii 123er45vg44 '
str2=re.findall(r'\w',str1)
print(str2)
>['i', 'i', 'i', 'a', 'm', '_', '你', '好', 'i', 'i', 'i', 'i', '1', '2', '3', 'e', 'r', '4', '5', 'v', 'g', '4', '4']

(12) \s
str1='iii am_你好iiii 123\ner\t45vg44 '
str2=re.findall(r'\s',str1)
print(str2)
>[' ', ' ', '\n', '\t', ' ']

(13) \b
str1='i love python '
str2=re.findall(r'\bon',str1)
str3=re.findall(r'on\b',str1)
print(str2)
print(str3)
>[]
>['on']

(14) \D
str1='i love python12456'
str2=re.findall(r'\D',str1)
print(str2)
>['i', ' ', 'l', 'o', 'v', 'e', ' ', 'p', 'y', 't', 'h', 'o', 'n']

(15) \S
str1='i love python12456\n\t'
str2=re.findall(r'\S',str1)
print(str2)
>['i', 'l', 'o', 'v', 'e', 'p', 'y', 't', 'h', 'o', 'n', '1', '2', '4', '5', '6']

(16) \W
str1='i love ￥%*python12456\n\t'
str2=re.findall(r'\W',str1)
print(str2)
>[' ', ' ', '￥', '%', '*', '\n', '\t']

(17) \B
str1='i love python12456'
str2=re.findall(r'ov\B',str1)
print(str2)
>['ov']

(18) []字符集
str1='i love python12456'
str2=re.findall(r'[a-z]',str1)
print(str2)
>['i', 'l', 'o', 'v', 'e', 'p', 'y', 't', 'h', 'o', 'n']

str1='i love python12456'
str2=re.findall(r'[^\d]',str1)
print(str2)
>['i', ' ', 'l', 'o', 'v', 'e', ' ', 'p', 'y', 't', 'h', 'o', 'n']

(19) |
str1='i love python12456\n\n'
str2=re.findall(r'[\d|\s]',str1)
print(str2)
>[' ', ' ', '1', '2', '4', '5', '6', '\n', '\n']

(20) () 分组-标记一个子表达式的开始和结束位置。子表达式可以获取供以后使用。只会匹配括号中的内容
str1='i love python12456\n\n'
str2=re.findall(r'n(\d)',str1)
print(str2)
>['1']

Second, greedy and non-greedy match

Regular expressions are typically used to find a match in the text string. Python in the default quantifier is greedy (in minority languages may also be non-greedy by default), always try to match as many characters; non-greedy on the contrary, always try to match as few characters. In the "*", "?", "+", "{M, n}" followed by? The greed become non-greedy.

1. greedy

String search will directly go to the next greedy mode end of the string to match, if not equal Looking forward, this process is called backtracking

import re
str1='<table><td><th>贪婪</th><th>贪婪</th><th>贪婪</th></td></table>贪婪'
str2=re.findall(r'<.*>',str1)
print(str2)

>['<table><td><th>贪婪</th><th>贪婪</th><th>贪婪</th></td></table>']

2. Non-greedy mode

From left to right will look at non-greedy mode, a match will not happen backtracking

import re
str1='<table><td><th>贪婪</th><th>贪婪</th><th>贪婪</th></td></table>贪婪'
str2=re.findall(r'<.*?>',str1)
print(str2)
>['<table>', '<td>', '<th>', '</th>', '<th>', '</th>', '<th>', '</th>', '</td>', '</table>']

Three, re module

1. re.A（re.ASCII）    
    让\w，\W，\b，\B，\d，\D，\s和\S 执行ASCII-只匹配完整的Unicode匹配代替。这仅对Unicode模式有意义，而对于字节模式则忽略。
    
2. re.I（re.IGNORECASE）    
    执行不区分大小写的匹配；类似的表达式也[A-Z]将匹配小写字母。
    
3. re.L（re.LOCALE）　　
    让\w，\W，\b，\B和区分大小写的匹配取决于当前的语言环境。该标志只能与字节模式一起使用。不建议使用此标志，因为语言环境机制非常不可靠，它一次只能处理一种“区域性”，并且仅适用于8位语言环境。默认情况下，Python 3中已为Unicode（str）模式启用了Unicode匹配，并且能够处理不同的语言环境/语言。
    
4. re.M（re.MULTILINE）　　
    指定时，模式字符'^'在字符串的开头和每行的开头（紧随每个换行符之后）匹配；模式字符'$'在字符串的末尾和每行的末尾（紧接在每个换行符之前）匹配。默认情况下，'^' 仅在字符串的开头，字符串'$'的末尾和字符串末尾的换行符（如果有）之前立即匹配。
    
5. re.S（re.DOTALL）    
    使'.'特殊字符与任何字符都匹配，包括换行符；没有此标志，'.'将匹配除换行符以外的任何内容。

1. findall(pattern, string, flags=0)

findall method to find the string pattern matches, all matching strings returned as a list, if there is no text string matching mode, returns an empty list, if there is a matching substring mode, the list contains an element of return, so no matter how match, we can traverse the direct result of findall returned without error.

import re

re_str = "hello this is python 2.7.13 and python 3.4.5"
pattern = "python [0-9]\.[0-9]\.[0-9]"
result = re.findall(pattern=pattern, string=re_str)
print(result)
>['python 2.7.1', 'python 3.4.5']

# 忽略大小写
re_str = "hello this is python 2.7.13 and Python 3.4.5"
pattern = "python [0-9]\.[0-9]\.[0-9]"
result = re.findall(pattern=pattern, string=re_str, flags=re.IGNORECASE)
print(result)
>['python 2.7.1', 'Python 3.4.5']

2. re.compile(pattern, flags=0)

Using the general manner compiled regular python module, if the large amount of data, using the compiled regular manner much improved performance

import re

re_str = "hello this is python 2.7.13 and Python 3.4.5"
re_obj = re.compile(pattern = "python [0-9]\.[0-9]\.[0-9]",flags=re.IGNORECASE)
res = re_obj.findall(re_str)
print(res)
>['python 2.7.1', 'Python 3.4.5']

3. re.match(pattern, string, flags=0)

match method, startwith method is similar to the string, just match the regular expression used in more powerful, more expressive, match function to match the beginning of the string, if the pattern matches, return type of a SRE_Match object, if the pattern matching fails, it returns a None, so ordinary prefix matching

1. 判断data字符串是否以what、数字开头

import re

s_true = "what is a boy"
s_false = "What is a boy"
re_obj = re.compile("what")
print(re_obj.match(string=s_true))
><_sre.SRE_Match object; span=(0, 4), match='what'>
print(re_obj.match(string=s_false))
>None

2. 匹配数字
s_true = "123what is a boy"
s_false = "what is a boy"
re_obj = re.compile("\d+")
print(re_obj.match(s_true))
><_sre.SRE_Match object; span=(0, 3), match='123'>
print(re_obj.match(s_true).start())
>0
print(re_obj.match(s_true).end())
>3
print(re_obj.match(s_true).string)
>123what is a boy
print(re_obj.match(s_true).group())
>123
print(re_obj.match(s_false))
>None

4. re.search(pattern, string, flag=0)

search method, the pattern matching is successful, also returns a SRE_Match objects, methods and search methods match the difference is that the match can only start from scratch to match, while the search can start from anywhere in a string of match, they have in common is that if the match is successful, returns a SRE_Match object, if the match fails, a return None, but also pay attention here, search only to find the first match, that match a string that contains multiple modes, it will only return the first the results matched, if you want to return all of the results, the easiest way is to findall method, the method can also be used finditer

print(re.search('\dcom','www.4comrunoob.5com').group())
>4com

*注：match和search一旦匹配成功，就是一个match object对象，而match object对象有以下方法：
* group() 返回被 RE 匹配的字符串
* start() 返回匹配开始的位置
* end() 返回匹配结束的位置
* span() 返回一个元组包含匹配 (开始,结束) 的位置
* group() 返回re整体匹配的字符串，可以一次输入多个组号，对应组号匹配的字符串。

import re

a = "123abc456"
print(re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(0))   #123abc456,返回整体
print(re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(1))   #123
print(re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(2))   #abc
print(re.search("([0-9]*)([a-z]*)([0-9]*)",a).group(3))   #456
>>>group(1) 列出第一个括号匹配部分，group(2) 列出第二个括号匹配部分，group(3) 列出第三个括号匹配部分

5. re.finditer(pattern, string, flags=0)

Search string, it returns a sequential access of each matching result (Match object) iterator. Find all substrings where the RE matches, and returns them as an iterator

import re

re_str = "what is a different between python 2.7.14 and python 3.5.4"
re_obj = re.compile("\d{1,}\.\d{1,}\.\d{1,}")

for i in re_obj.finditer(re_str):
    print(i)
>> <_sre.SRE_Match object; span=(35, 41), match='2.7.14'>
>> <_sre.SRE_Match object; span=(53, 58), match='3.5.4'>

6. re.sub(pattern, repl, string, count)

re module sub procedure similar string replace method, only sub Method supports regular expressions

import re

re_str = "what is a different between python 2.7.14 and python 3.5.4"
re_obj = re.compile("\d{1,}\.\d{1,}\.\d{1,}")
print(re_obj.sub("a.b.c",re_str,count=1))
>what is a different between python a.b.c and python 3.5.4

print(re_obj.sub("a.b.c",re_str,count=2))
>what is a different between python a.b.c and python a.b.c

print(re_obj.sub("a.b.c",re_str))
>what is a different between python a.b.c and python a.b.c

7、re.split(pattern, string[, maxsplit])

re模块的split方法和python字符串中的split方法功能是一样的，都是将一个字符串拆分成子字符串的列表，区别在于re模块的split方法能够; maxsplit用于指定最大分割次数，不指定将全部分割

print(re.split('\d+','one1two2three3four4five5'))
>['one', 'two', 'three', 'four', 'five', '']