table of Contents
- First, the regular expression syntax
- Second, the basic use of Re library
- Three, Re Match object library
- Fourth, greedy matching Re library and minimum match
Regular expressions are used for simplicity of expression of a set of strings expression
First, the regular expression syntax
1.1 Common regular expression operators
Operators | Explanation | Examples |
---|---|---|
. | It represents any single character | |
[ ] | Characters, a single character is given in the range of | [Abc] represents a, b, c, [a-z] represents a single character to z |
[^ ] | Non characters, a single character is given to the negative range | [^ Abc] represents a or b or c non-single character |
* | Previous character 0 times or an unlimited number of extensions | abc * represents ab, abc, abcc, abccc etc. |
+ | 1 previous character or unlimited expansion | It represents abc + abc, abcc, abccc etc. |
? | Previous character 0 or 1 extension | abc? represents ab, abc |
| | Any expression about a abc | def represents abc, def |
{m} | M times before the extended character | ab {2} c represents abbc |
{m,n} | A front extension character m to n times (including n) | represents ab {1,2} c abc, abbc |
^ | Matches the beginning of string | ^ Abc abc and to indicate the beginning of a string |
$ | End of the string | $ abc abc and represents the end of a string |
( ) | Packet marking, the operator can only use the internal | (Abc) represents abc, (abc | def) represents abc, def |
\d | Number, is equivalent to [0-9] | |
\D | Non-numeric | |
\w | Word characters (number / letter / underscore), is equivalent to [A-Za-z0-9_] | |
\W | Non-numeric / non-alphabetic / non-underlined | |
\s | Space / \ t / \ n | |
\S | Non-space / non \ t / non \ n |
Example:
. Any character (except newline)
It represents any single character
import re
s= 'abc12ab56bc'
# .: 任意字符(换行符除外)
print(re.findall(".",s))
['a', 'd', 'a', 's', 'd', 'a', 's', 'd', 'a', 's', 'd', 'a', 's', 'f', 'a', 's', 'f', '\t']
[] Meta character (character set)
Intermediate character matching, as long as a single character
May also be used [az] z represents a single character to the
import re
s = "adasdasdasdasfasf\n\t"
# []: 匹配中间的字符,只要单个字符
print(re.findall("[acef]",s))
['a', 'a', 'a', 'a', 'a', 'f', 'a', 'f']
[^] Trans taken
^ Elements of [] in negation, in addition to [] in the character to be
import re
s = "adasdasdasdasfasf\n\t"
# [^] : 把[]中的字符给排除.
print(re.findall("[^acef]",s))
['d', 's', 'd', 's', 'd', 's', 'd', 's', 's', '\n', '\t']
* 0- infinite extension of time before a character
: Match the preceding character 0- infinite number of empty will match
import re
s = r"abaacaaaaa"
# *: 匹配 *前面的字符0-无穷个
print(re.findall("a*",s)) # 匹配 0-无限个a,空也会匹配
[ 'A', '', 'aa', '', 'aaaaa', '']
+ Versus 1 before a character infinitely extended
+: + Match infinite number of previous character 1-
import re
s = r"abaacaaaaa"
# +: 匹配 +前面的字符1-无穷个
print(re.findall("a+",s)) # 匹配 1-无限个a
[ 'A', 'aa', 'aaaaa']
? For the previous character 0 or 1 time extension
? : Match? Preceding character 0 or 1 time extension
import re
s = r"abaacaaaaa"
# ?: 匹配 ?前面的字符0-1个
print(re.findall("a?",s)) # 匹配 0-1个a
['a', '', 'a', 'a', '', 'a', 'a', 'a', 'a', 'a', '']
| Left and right sides of the character to be
A | B: A and B should be
import re
s = 'abacad'
# A|B: A和B都要
print(re.findall('a|b', s))
['a', 'b', 'a', 'a']
{M} m times to extend the previous character
{M}: Match the previous character {m} m times
import re
s = r"abaacaaaaa"
# {m}: 匹配 前面的字符m个
print(re.findall("a{2}",s)) # 匹配 2个a
[ 'Aa', 'aa', 'aa']
{M, n} for the previous character extension mn times (including n)
{M, n}: match the preceding character (mn)
import re
s = r"abaacaaaaa"
# {m,n}: 匹配 前面的的字符m-n个
print(re.findall("a{2,3}",s)) # 匹配 2、3个a
['aa', 'aaa', 'aa']
^ Metacharacter
And the beginning of the string matching rules in line to match or do not match
Matches the beginning of the string. Match the beginning of each line in multi-line mode (Python3 + has failed, with the use compile)
import re
s = '王大炮打炮被大炮打死了 王大炮打炮被大炮打死了'
# ^: 匹配开头
print(re.findall("^王大炮", s))
[ 'King Cannon']
$ Metacharacter
End of the string matching rules in line with the position to match or do not match
End of the string matching, matching end of each line in multi-line mode
import re
s = '王大炮打炮被大炮打死了 王大炮打炮被大炮打死了'
# $: 匹配结尾
print(re.findall("打死了$", s))
[ 'Killed']
() As long as the brackets
(): As long as the brackets
import re
s = 'abacad'
# (): 只要括号内的
print(re.findall('a(.)', s))
['b', 'c', 'd']
\ D matches a single digit (0-9)
\ D: matches a single number
import re
s = '1#@¥23abc123 \n_def\t456'
# \d: 匹配单个数字
print(re.findall("\d",s)) # 匹配 单个数字
['1', '2', '3', '1', '2', '3', '4', '5', '6']
\ D matches a single non-numeric (including \ n)
\ D: matches a single non-numeric
import re
s = '1#@¥23abc123 \n_def\t456'
# \D: 匹配单个非数字
print(re.findall("\D",s)) # 匹配 单个 非数字(包括\n)
['#', '@', '¥', 'a', 'b', 'c', ' ', '\n', '_', 'd', 'e', 'f', '\t']
\ W match number / letter / underscore
\ W: match number / letter / underscore
import re
s = '1#@¥23abc123 \n_def\t456'
# \w: 匹配 数字/字母/下划线
print(re.findall("\w",s))
['1', '2', '3', 'a', 'b', 'c', '1', '2', '3', '_', 'd', 'e', 'f', '4', '5', '6']
\ W matches non-numeric / non-alphabetic / non-underlined
\ W: non-numeric / non-alphabetic / non-underlined
import re
s = '1#@¥23abc123 \n_def\t456'
# \W: 非数字/非字母/非下划线
print(re.findall("\W",s))
['#', '@', '¥', ' ', '\n', '\t']
\ S matches a space / \ t / \ n
\ S: space / \ t / \ n
import re
s = '1#@¥23abc123 \n_def\t456'
# \s: 空格/ \t/ \n
print(re.findall("\s", s))
[' ', '\n', '\t']
\ S matches non-whitespace / non \ T / Non \ m
\ S: non-whitespace / non \ T / Non \ m
import re
s = '1#@¥23abc123 \n_def\t456'
# \S: 非空格/ 非\t/ 非\m
print(re.findall("\S", s))
['1', '#', '@', '¥', '2', '3', 'a', 'b', 'c', '1', '2', '3', '_', 'd', 'e', 'f', '4', '5', '6']
Second, the basic use of Re library
2.1 Re library introduction
Re Python standard library library is mainly used for string matching
** invocation:import re**
2.2 Re main function library function
function | Explanation |
---|---|
re.search() | In a search for a matching string in the first position of the regular expression, returns match object |
re.match() | Match the regular expression from the start position of a character string, returns match object |
re.findall() | The search string, return a list type can match substrings of all |
re.split() | The string is divided according to a regular expression matching result, returns a list of type |
re.finditer () | Search string, return a matching result of the iterative type, each element is a match object iteration |
re.sub() | String replaces all substring match the regular expression in a string, returns after replacement |
re.search(pattern,string,flags=0)
In a search string matches the regular expression in the first position return match object
- pattern: regular expression string or a string representation of native
- string: be matched string
- : flags control mark using regular expressions
re.match(pattern,string,flags=0)
From the start position of a character string matching the regular expression returns match object
- pattern: regular expression string or a string representation of native
- string: be matched string
- : flags control mark using regular expressions
re.findall(pattern,string,flags=0)
The search string, return a list type can match substrings of all
- pattern: regular expression string or a string representation of native
- string: be matched string
- : flags control mark using regular expressions
re.split(pattern,string,maxsplit=0,flags=0)
It will return a list type string divided according to a regular expression matching result
- pattern: regular expression string or a string representation of native
- string: be matched string
- maxsplit: Large number of divisions, as a remaining portion of the output element
- : flags control mark using regular expressions
re.finditer(pattern,string,flags=0)
Search string, return a matching result of the iterative type, each element is a match object iteration
- pattern: regular expression string or a string representation of native
- string: be matched string
- : flags control mark using regular expressions
re.sub(pattern,repl, string,count=0,flags=0)
Replaces all matches the regular expression string sub-string in a string returned replacement
- pattern: regular expression string or a string representation of native
- repl: string replacement string matches
- string: be matched string
- count: the number of big matches replaced
- : flags control mark using regular expressions
Another usage is equivalent to 2.3 Re library
regex = re.compile(pattern,flags=0)
Compiles a string into a regular expression regular expression object
- pattern: regular expression string or a string representation of native
- : flags control mark using regular expressions
function | Explanation |
---|---|
regex.search() | In a search for a matching string in the first position of the regular expression, returns match object |
regex.match() | Match the regular expression from the start position of a character string, returns match object |
regex.findall() | The search string, return a list type can match substrings of all |
regex.split() | The string is divided according to a regular expression matching result, returns a list of type |
regex.finditer () | Search string, return a matching result of the iterative type, each element is a match object iteration |
regex.sub() | String replaces all substring match the regular expression in a string, returns after replacement |
When the control flag flags 2.4 using regular expressions
Modifiers | description |
---|---|
re.I | Of matches are not case sensitive, ignoring the regular expression case, [A-Z] can be match lowercase |
re.L | Do identify the localization (locale-aware) Match |
re.M | Each row in the regular expression ^ operator can be given as a character string matching the start |
re.S | Regular expression. Operator able to match all characters, default matches all characters except newline |
re.U | According to parse character Unicode character set. This flag affect \ w, \ W, \ b, \ B. |
re.X | This flag by giving you more flexibility in format so that you will write regular expressions easier to understand. |
Three, Re Match object library
Match object is the result of a match, including matching a lot of information
3.1 Match object attributes
Attributes | Explanation |
---|---|
.string | Text to be matched |
.re | Match the patter object used when (regular expressions) |
.pos | Regular expression search text starting position |
.endpos | Regular expression search text end position |
Methods 3.2 Match object
Attributes | Explanation |
---|---|
.group(0) | Matching string is obtained |
.start() | Matching string starting position of the original string |
.end() | Matching string end position of the original string |
.span() | Return (.start (), .end ()) |
Fourth, greedy matching Re library and minimum match
Re library defaults greedy match , i.e., an output matching long substrings
4.1 Minimum Match
Operators | Explanation |
---|---|
*? | Previous character 0 times or an unlimited number of extensions, little match |
+? | 1 previous character or unlimited expansion, small match |
?? | Previous character 0 or 1 expansion, a small match |
{m,n}? | A front extension character m to n times (including n), small matching |
As long as the length of the output may be different, may be increased by the operator? Becomes small matching
* Greedy
.*: 贪婪模式(最大化),找到继续找,让结果最大化
import re
s = 'abbbcabc'
# .*: 贪婪模式(最大化),找到继续找,让结果最大化
print(re.findall('a.*c', s))
print(re.findall('a.+c', s))
['abbbcabc']
['abbbcabc']
.*? 非贪婪模式(最小匹配)
.*?: 非贪婪模式(最小化),找到就马上停止
import re
s = 'abbbcabc'
# .*?: 非贪婪模式(最小化),找到就马上停止
print(re.findall('a.*?c', s))
print(re.findall('a.+?c', s))
['abbbc', 'abc']
['abbbc', 'abc']