Python programming - regex (with examples)

Python Regular Expressions

String editing relates to a maximum of a data structure needs to manipulate strings almost everywhere. Is there a method is designed to match the string it? A method for regular expression matching string of powerful weapons. Its design idea is to use a language descriptive string to define a rule, those who comply with the rules of the string, we consider it a "match", otherwise, the string is not legitimate.

re the Python language module has all the features of regular expressions. If things had gone in the use of regular expressions, we need to import the re module --import re;
complie pattern string and a function more optional flags parameter objects to generate a regular expression. This object has a series of methods for regular expression matching and replacement.

First, the basics

In the regular expression, if given directly string is an exact match. By \ d represents a number matching, \ w represents a matching letter or digit;
You can match any character;
To match the variable-length character in regular expressions, with + indicates at least one character, indicated by * any number of characters (including zero), with? Represents 0 or a string, with n {n} represents a string, with {n, m} denotes nm characters.

import re  #导库

re.match()        #match函数从头开始匹配，如果不是起始位置匹配成功的话，match函数的匹配结果就为none
re.match()        #搜素整个字符串，并返回第一个成功的匹配
re.search()        #搜索整个字符串，返回一个list（最常用的）
re.findall()        #compile  函数用于编译正则表达式，生成一个正则表达式（pattern）对象
re.split()        #将一个字符串按照正则表达式匹配的结果进行分割，返回列表类型
re.sub()        #在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串

Example:

import re  #导库
astr="s34份额是否as586c河1源..is市9.1d3防4H不0防h7b不仅4.r5cd"
 
print(re.findall("\d",astr))        #findall()在字符串中找到所有的数字，并以列表的形式返回
print(re.search("\w",astr))        #匹配一个字母或数字，并将第一匹配到的结果返回
print(re.findall("\w\d",astr))        #匹配连续的两个数字，一个字母一个数字，一个汉字一个数字，并将所有匹配到的结果以列表的形式返回
print(re.findall(r"\d.\d",astr))        #匹配连续的三位——> 第一位和第三位都是数字，中间为任意字符
print(re.findall("\d+",astr))        #匹配1位到无穷多位连续的数字
print(re.findall("\d\d+",astr))      #匹配2位到无穷多位连续的数字
print(re.findall("\d\d*",astr))      #匹配1位到无穷多位的连续数字
print(re.findall("\d\d?",astr))        #  ?  表示0或1位字符，故表示一位或两位连续数字
print(re.findall("\d{3}",astr))       #返回三位连续的数字
print(re.findall("\d{1,3}",astr))        #贪婪匹配，优先三位查询，接着两位，最后一位
print()re.findall("\d{2,3}?",astr)        # ?  可将贪婪匹配变成非贪婪匹配；先匹配两位数字，然后再匹配三位数字

运行结果：
['3', '4', '5', '8', '6', '1', '9', '1', '3', '4', '0', '7', '4', '5']
<_sre.SRE_Match object; span=(0, 1), match='s'>
['s3', 's5', '86', '河1', '市9', 'd3', '防4', '不0', 'h7', '仅4', 'r5']
['586', '9.1', '3防4']
['34', '586', '1', '9', '1', '3', '4', '0', '7', '4', '5']
['34', '586']
['34', '586', '1', '9', '1', '3', '4', '0', '7', '4', '5']
['34', '58', '6', '1', '9', '1', '3', '4', '0', '7', '4', '5']
['586']
['34', '586', '1', '9', '1', '3', '4', '0', '7', '4', '5']
['34', '58']

Second, strengthen stage

More precisely do match, may be used [] represents the range of, for example:

[0-9a-zA-Z_] matches a number, a letter or underscore;
[0-9a-zA-Z _] + can be matched by at least one number, letter or underscore character string, such as 'a100', '0_Z', 'py3000' and the like;
[A-zA-Z _] [0-9a-zA-Z _] * match with the letter or underscore, followed by any digital number, letter or underscore character string, which is valid variable python;
[A-zA-Z _] [0-9a-zA-Z _] {0,19} precisely limits the length of the variable is 1-20 characters (1 character in front of the rear + (0-19) characters).

import re
astr="A_c8fd33jd9_k0ja3"
print(re.findall("[0-9a-zA-Z\_]",astr))
print(re.findall("[0-9a-zA-Z\_]+",astr))
print(re.findall("[a-zA-Z\_][0-9a-zA-Z\_]*",astr))
print(re.findall("[a-zA-Z\_][0-9a-zA-Z\_]{0,19}",astr))

运行结果：
['A', '_', 'c', '8', 'f', 'd', '3', '3', 'j', 'd', '9', '_', 'k', '0', 'j', 'a', '3']
['A_c8fd33jd9_k0ja3']       #贪婪匹配
['A_c8fd33jd9_k0ja3']      
['A_c8fd33jd9_k0ja3']

A | B matches A or B, and so '(P | p) ython' or 'python'.
^ Represents the beginning of the line, ^ \ d expressed the need to start with a number.
$ Represent the end of the line, \ d $ representation must end with a number.
You may have noticed, py can also match the 'python', but with ^ py $ entire line becomes a match, it could match 'py' up.
Python string preceded by r represents the native string

import re
astr=astr="""csd3份额是否a9
sjhbh353758cdbsv河
1源..is市....cd3防4H不胜
防hh787bb不仅4.r5cd是is范德萨；‘’a8"""

print(re.findall(r"^\d",astr))      #换行符号不可见，将整段字符串按已行处理；行的开头为数字，则返回该数字，如不是，返回一个空列表
print(re.findall(r"^\d",astr,re.M))      #re.M 是换行符可见，将astr字符串转换为4行；行的结尾是数字，则返回该数字
print(re.findall(r"\d$",astr))        #行的结尾为数字
print(re.findall(r"\d$",astr,re.M))        #re.M 是换行符可见，将astr字符串转换为4行；行的结尾为数字，则以列表的形式返还该数字

运行结果：
[]
['1']
['8']
['9', '8']

Third, the specific use of regular expressions

1, the query

(1) re.search function

re.match matching string from the start position of a pattern, if the starting position of the match is successful re.match () method returns a matching objects; in fact, the position is not successful match, match () returns None.

Syntax is as follows: re.match (pattern, String, the flags = 0)

 (1)  pattern：需要匹配的正则表达式；
（2）string：在那个字符串中就行行匹配；
（3）flags：标志位（默认为0），它可以控制正则表达式的匹配方式；

Common flags as follows:

（1）re,l 忽略匹配时的大小写
（2）re.M 多行匹配，影响 ^ 和 $ 
（3）re.S. 默认不匹配换行，使 . 匹配包括换行在内的所有字符
（4）re.U 根据Unicode字符集解析字符。这个标志影响 \w, \W, \b, \B

Examples are as follows:

We can see that, match function matching successful, re.match method returns an object matching, not matching regular expression; by span () can obtain the location of the match.

>>> import re
>>> astr='11you are 3344 my apple\n 11开心果，you\n66a77'
>>> re.match('11',astr)
<_sre.SRE_Match object; span=(0, 2), match='11'>
>>> re.match('11',astr).span()
(0, 2)
>>> print(re.match('you',astr))
None

If you need to match the regular expression show up, we need to use the group (num) or group () function to get the matching target expression to match.

For example: re.match (r '. \ D () (\ d)', astr), the need to match the string, there may be a plurality of brackets, each bracket as a group.

group (0) is the string matching the whole expression, ie, \ d (\ d) (.);
group (1) represents a content of the brackets, i.e. (.); and so on;
group (num = 2,3,4 ...) opposite brackets represent the content;
group () Returns a string containing the contents of all of the brackets, the result returned tuple;

>>> import re
>>> astr='11you are 3344 my apple\n 11开心果，you\n66a77'
>>> re.match('\d(\d)(.)',astr,re.S).group(0)
'11y'
 
>>> re.match('\d(\d)(.)',astr,re.S).group(1)
'1'
 
>>> re.match('\d(\d)(.)',astr,re.S).group(2)
'y'
 
>>> re.match('\d(\d)(.)',astr,re.S).groups()
('1', 'y')

(2) re.search function

Search the entire string, and returns the first successful match.

Syntax is as follows: the re.search (pattern, String, the flags = 0)

(1) pattern: the need to match the regular expression;
(2) String: string to be matched;
(. 3) the flags: flag (default 0), it can control the regular expression matching method;

Common flags as follows:

Re.l ignore size when matching (1);
(2) re.M multi-line matching, affecting ^ and $
(3) does not match the default re.S wrap, wrap the match, including all characters, including;.
(4 ) re.U parse character based on Unicode character set. This flag affect \ w, \ W, \ b , \ B

Examples are as follows:
we can see, search function matching successful, re.search method returns an object matching, not matching regular expression; by span () can obtain the location of the match. If the span () can obtain the location of the match. If there is no match that is returned is None.

>>> import re
>>> astr='11you are 3344 my apple\n 11开心果，you\n66a77'
>>> re.search('11',astr)
<_sre.SRE_Match object; span=(0, 2), match='11'>
 
>>> re.search('you',astr)
<_sre.SRE_Match object; span=(2, 5), match='you'>
 
>>> re.search('you',astr).span()   #通过span（）获取匹配的位置
(2, 5)
 
>>> re.search('11',astr).span()
(0, 2)
 
>>> print(re.search('22',astr))
None

If you need to match the regular expression show up, we need to use the group (num) or groups () function to get the matching target expression to match.

For example: re.search (r '. \ D () (\ d)', astr), the need to match the string, there may be a plurality of brackets, each bracket as a group.

group (0) is the string matching the whole expression, ie, \ d (.) (\ d)
group (1) represents a content of the brackets, i.e. (.); and so on
group (num = 2,3,4 ...) indicates that the corresponding content brackets;
groups () Returns a string containing the contents of all of the brackets, as a result of the returned tuple.

>>> import re
>>> astr='1you are 3344 my apple\n 11开心果，you\n66a77'
>>> re.search('\d(\d)(.)',astr,re.S).group(0)
'334'
 
>>> re.search('\d(\d)(.)',astr,re.S).group(1)
'3'
 
>>> re.search('\d(\d)(.)',astr,re.S).group(2)
'4'
 
>>> re.search('\d(\d)(.)',astr,re.S).groups()
('3', '4')

re.search difference re.match function and function:
re.match matches only the beginning of the string, if the string does not conform to begin regular expression, the match fails, the function returns None; and re.search match the entire string, until a match is found, if not matched, None is returned.

Note: match and this match is search, and all findall match.

(3) re.findall function

The string found in the regular expression matched all sub-string, and returns a list; if no match is found, an empty list is returned.

Syntax is as follows: re.findall (pattern, string, flags = 0)

Examples are as follows:

>>> import re
>>> astr='1you are 3344 my apple\n 11开心果，you\n66a77'
>>> re.findall('\d\d',astr)         #列表形式显示所有的两个数字
['33', '44', '11', '66', '77'] 
 
>>> re.findall('\d{2,4}',astr)      #列表形式显示所有的2——4个数字，默认贪婪匹配
['3344', '11', '66', '77']
 
>>> re.findall('\d+',astr)          #（1，无穷）
['1', '3344', '11', '66', '77']
 
>>> re.findall('\d*',astr)          #（0，无穷）
['1', '', '', '', '', '', '', '', '', '3344', '', '', '', '', '', '', '', '', '', '', '', '11', '', '', '', '', '', '', '', '', '66', '', '77', '']
 
>>> re.findall('\d?',astr)          #匹配0或1
['1', '', '', '', '', '', '', '', '', '3', '3', '4', '4', '', '', '', '', '', '', '', '', '', '', '', '1', '1', '', '', '', '', '', '', '', '', '6', '6', '', '7', '7', '']
 
>>> re.findall('\d{2,3}?',astr)      #一个模式后跟？，不贪婪匹配，范围后面？，有两次就先取两次
['33', '44', '11', '66', '77']
 
>>> re.findall('\d.\d',astr)         #匹配两个数字与中间任意字符
['334', '6a7']
 
>>> re.findall('^\d',astr)           #以数字开头
['1']
 
>>> re.findall('^\d',astr,re.M)      #多行匹配
['1', '6']   
 
>>> re.findall('\d$',astr)           #以数字结尾
['7']
 
>>> re.findall('\d$',astr,re.M)      #多行匹配，影响^和$
['7']
 
>>> re.findall('\d(.)(\d)',astr,re.S)#列表形式返回，每项为一个元组
[('3', '4'), ('a', '7')]

(4) re.complie function

complie function is used to compile a regular expression, generate a regular expression (the Pattern) object.

Syntax is as follows: the re.compile (pattern, the flags = 0)
pattern: the need to match the regular expression;
the flags: flag (default 0), it can control the regular expression matching mode

Here is a quote

Common flags as follows:

re.I ignore case when matching
re.M multi-line matching, affecting ^ and $
re.S. The default does not match the newline that. All matches including newline characters, including
re.U parse character based on Unicode character set. This flag affect \ w, \ W, \ b, \ B

Examples are as follows:

>>> import re
>>> astr='AS12as34er567q!"3456'
>>> m1=re.compile(r'\d\d')     #编译
>>> m1.search(astr).group()    #匹配
'12'
 
>>> m1.findall(astr)
['12', '34', '56', '34', '56']
 
>>> m2=re.compile(r'a',re.I)  #编译
>>> m2.findall(astr)          #匹配
['A', 'a']

(5) re.split function

Dividing a character string in accordance with the result of the regular expression matching, returns a list of type

Syntax is as follows: re.split (pattern, string, maxsplit = 0, flags = 0)

pattern: the need to match the regular expression;

string: the string that matches the line;

maxsplit: frequency division, maxsplit = 1 divided once, the default is 0, the number is not limited.

flags: the flag (default is 0), it can control the regular expression matching mode

Common flags as follows:

re.l ignore case when matching

re.M multi-line matching, affecting ^ and $

re.S. default does not match the newline that. All matches including newline characters, including

re.U parse character based on Unicode character set. This flag affect \ w, \ W, \ b, \ B

>>> import re
>>> astr='AS12as34er567q!"3456'
>>> astr.split('12')           #通过12进行分割
['AS', 'as34er567q!"3456']
 
>>> re.split("\d{2}",astr)     #通过两个数字进行分割
['AS', 'as', 'er', '7q!"', '', '']
 
>>> re.split("\d+",astr)       #通过数字进行分割
['AS', 'as', 'er', 'q!"', '']
 
>>> m3=re.compile(r'\d+')      #与上面等价，运用了compile函数
>>> m3.split(astr)
['AS', 'as', 'er', 'q!"', '']  
 
>>> m3.split(astr,3)           #指定分割几次
['AS', 'as', 'er', 'q!"3456']

(6). Re.sub function

The replacement string string matches all regular expression in a string, returns after replacement

Syntax is as follows: the re.sub (pattern, the repl, String, COUNT = 0, the flags = 0)

pattern: the need to match the regular expression;

repl: replace the string, and may be a function;

string: in this string match line;

count: Maximum number of replacements of the pattern matching, the default is 0 to replace all occurrences;

flags: flag (default 0), it can control the regular expression matching method;

Common flags as follows:

re.l ignore case when matching;

re.M multi-line matching, affecting ^ and $

re.S. default does not match the newline that. All matches including newline characters, including

re.U parse character based on Unicode character set. This flag affect \ w, \ W, \ b, \ B

Examples are as follows:

>>> import re
>>> astr='AS12as34er567q!"3456' 
>>> re.sub("5",'9',astr)     #将5替换为9
'AS12as34er967q!"3496'
 
>>> m4=re.compile(r"\d+")    
>>> m4.sub(' ',astr)         #将数字替换为空字符串
'AS as er q!" '
 
>>> m4.sub(' ',astr,2)       #指定替换几次
'AS as er567q!"3456'

repl parameter is a function to realize the digital string by 2;

>>> import re
>>> def f(m):
...     return str(2*int(m.group()))
...
>>> re.sub('\d',f,'a2233q')
'a4466q'

Fourth, added:

1, the regular expression modifiers - optional flag

Regular expressions can optionally include some identifier to control the pattern matching. Modifier is specified as an optional sign. They specify | () Multiple flags can be bitwise OR through. The re.l | re.M is set to l and M flags:

Modifiers	description
re.l	The match is not case sensitive
re.L	Localizing a recognition (local-aware) Match
re.M	Multi-line matching, affecting ^ and $
re.S	Make. All matches including newline characters, including
re.U	According to parse character Unicode character set. This flag affect \ w, \ W, \ b, \ B
re.X	The flag by giving you more flexibility in format so that you will write regular expressions easy to understand.

2, regular common mode of expression

Common mode

mode	description
^	Matches the beginning of the string
$	Matches the end of the string
.	Matches any character except newline, when re.S flag is specified, will match any character string comprises including newline
*	Previous character 0 times or an unlimited number of extensions
+	1 previous character or unlimited expansion
？	Previous character 0 or 1 expansion, non-greedy way
{m}	Before we extend a character m times. For example, ab {2} c represents abbc
{m,n}	One character before extended to n times m (comprising n times), matching greedy
a \| b	Matches a or b
（）	Packet marking, using the internal function \| operator
\w	Matching alphanumeric and underscore, equivalent to [A-Za-b0-9_]
\W	Non-matching, and underscores alphanumeric equivalent to [^ A-Za-z0-9_]
\s	Matches any whitespace character, equivalent to [\ t \ n \ r \ f]
\S	Matches any non-blank character, equivalent to [^ \ f \ n \ r \ t \ v]
\d	Match any number, is equivalent to [0-9]
\b	Matches a word boundary, that is, it refers to the location and spaces between words. For example, 'er \ b' matches "never" in the 'er', but does not match the "verb" in the 'er'.
\B	Matching non-word boundary. 'Er \ B' matches "verb" in the 'er', but does not match the "never" in the 'er'

3, regular expressions examples

Examples	description
^ [A-Za-z] + $	String of 26 letters
^ [A-Za-z0-9] + $	String of 26 letters and digits
^-?\d+$	Integer string
[1-9]\d{5}	Chinese domestic zip code, 6
\u4e00-\u9fa5	Matching Chinese characters
\d{3}-\d{8}\|\d{4}-\d{7}	telephone number