[Reptile] Third, regular expressions -re entry

First, the concept

Characterized in formal language, the description language rather than specific content, commonly used in the matching string.

Second, the basic syntax and operators

re library types may also be employed regular expression string, but more cumbersome for example:
'[1-9]. 5} {\\ D' equivalent to the native character string '[1-9] \ {D}. 5'
'\ \ d {3} - \\ d {8} | \\ d {4} - \\ d {7} ' equivalent to the native character string' \ d {3} - \ d {8} | \ d {4 } - \ d {7} '
recommendations: when a regular expression comprises escapes, using raw string, or both is the same.
Here Insert Picture Description
Here Insert Picture Description
Common regular expression:

re expression Connotation
^ [A-Za-z] + $ As the beginning and end of the letter, length> a = 1, i.e., a string of letters 26
^ [A-Za-z0-9] + $ String of 26 letters and digits
^[0‐9][1‐9][0‐9]$ Positive integer string
[\u4e00‐\u9fa5] Matching Chinese characters
\d{3}‐\d{8}|\d{4}‐\d{7} Domestic phone number 010-68913536

IP address of the string in the form of a regular expression (IP address in four segments, each segment 0-255)
\ + D. \ + D. \ + D. \ D + or \ d {1,3}. \ D {1,3}. . \ d {1,3} \ d {1,3}
exact wording:

0‐99: [1‐9]?\d
100‐199: 1\d{2}
200‐249: 2[0‐4]\d
250‐255: 25[0‐5]

(([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5]).){3}([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5])

Other examples:

re.findall(r'[af].', 'abcaabbccfg')  # 第一位为a或者f,第二位为任意字符
输出结果:
['ab', 'aa', 'fg']

re.findall(r'[af]+', 'abcafafaabbccfg')  # 每个位置都必须是是a或者f,且长度至少为1(贪婪匹配)
输出结果:
['a', 'afafaa', 'f']

re.findall(r'[af]+', 'abcafafaabbccfg')  # 每个位置都必须是是a或者f,且长度至少为1(贪婪匹配)
输出结果:
['a', 'afafaa', 'f']

//在开始位置匹配,开始不符合就返回nonetype,符合就返回不会再考虑后续位置的str
re.findall("^[a-z]{4}",'abcd0abcd2abb3abbbb') 
re.findall("^[a-z]{4}",'3abcd0abcd2abb3abbbb')
输出:
['abcd']
[]

//在结尾匹配,结尾处不匹配就返回为nonetype不会再考虑其他的位置
string = "0abcdabce2abcd3abcab"
re.findall("[a-z]{2}$",string)
re.findall("[a-z]{4}$","0abcdabce2abcd3abc2ab")
输出:
['ab']
[]

//匹配中文
re.findall("[\u4e00-\u9fa5]{4}","0abcdabce2abcd3a匹配中文c2ab")  
输出:
['匹配中文']

Third, regular function

Before introducing the specific function, which introduces a common portion, i.e., flag and match objects

3.1 flag flags

character 含义
re.I(Ignore) 忽略正则表达式的大小写
re.M(multiline) 正则表达式中的^操作符能够将给定字符串的每行当作匹配开始
re.S(dotalll) 正则表达式中的.操作符能够匹配所有字符,默认匹配除换行外的所有字符

match对象是调用部分正则函数后返回的结果,包含很多的匹配信息。包括match属性和方法:
match属性包括:

字符 含义
.string 待匹配的文本
.re 匹配时使用的patter对象(正则表达式)
.pos 正则表达式搜索文本的开始位置
.endpos 正则表达式搜索文本的结束位置

方法包括:
.group(0) 匹配结果
.start() 匹配的字符串在原始字符串的开始位置
.end() 匹配的字符串在原始字符串的结束位置
.span() 返回(.start(), .end())

3.2正则函数Here Insert Picture Description

3.2.1 re.search(pattren, string, flag)

功能:返回第一个匹配的对象,即便原始字符串中有很多匹配的str

match = re.search(r'[0-9]\d{5}', 'BIT 100081111111')
print(match)
print(type(match))
print(match.string)
print(match.re)
print(match.group(0))
print(match.pos)
print(match.endpos)
print(match.start())
print(match.end())
print(match.span())
输出结果:
<_sre.SRE_Match object; span=(4, 10), match='100081'>
<class '_sre.SRE_Match'>
BIT 100081111111
re.compile('[0-9]\\d{5}')
100081
0
16
4
10
(4, 10)

//忽略大小写
match = re.search(r'[A-Z]{3}', 'BIt 100081111111', re.I) 
match.group(0)
输出:
'BIt'

3.2.2 re.match(pattern, string, flags)

功能:强制从头部开始匹配,如果头部的模式不符合则返回为空,即便原始字符串其他位置中有很多匹配的str

match = re.match(r'[1-9]\d{5}', 'BIT 100081')
type(match)
输出结果:
NoneType

3.2.3 findall(pattren, string, flags) # 返回列表

功能:返回所有的匹配对象,分段前进已经匹配的位置不会再次考虑。

match = re.findall(r'[0-9]\d{5}', 'BIT 100001,2000123')  # 返回列表
print(type(match))
if match:
    for i  in match:
        print(i)  
输出结果:
<class 'list'>
100001
200012

3.2.4 split(pattern, string, maxsplit=0, flags=0) # 返回列表

功能:定位匹配位置后,将原始字符串切割(不会保留匹配的部分),maxsplits参数控制切割的次数。

re.split(r'[1-9]\d{5}', 'ABC100081TSU1000792')  # 返回删除匹配字符后的部分
re.split(r'[1-9]\d{5}', 'ABC100081TSU1000792', maxsplit=1)  # 返回删除匹配字符后的部分
输出结果:
['ABC', 'TSU', '2']
['ABC', 'TSU1000792']

3.2.5 finditer(pattern, string, flags=0) # 返回可迭代match对象

Function: Similar to findall, except that it returns a list of matching demerit, once the function returns a list of match

for m in re.finditer(r'[1-9]\d{5}', 'BIT100081 TSU100084'):
    if m:
        print(type(m),m.group(0))
输出结果:
<class '_sre.SRE_Match'> 100081
<class '_sre.SRE_Match'> 100084

3.2.6 re.sub(pattern, repl, string, count=0, flags=0)

Function: the position of the matching parameters passed by repl Alternatively, count the number of alternative control.

re.sub(r'[1-9]\d{5}', ':zipcode', 'BIT100081, TSU100084')
re.sub(r'[1-9]\d{5}', ':zipcode', 'BIT100081, TSU100084', count=1)
输出结果:
'BIT:zipcode, TSU:zipcode'
'BIT:zipcode, TSU100084'

Fourth, the equivalent usage

Mr. into a regular expression object, calls the method in the class. Search function will be later explained as an example:
REGEX = the re.compile (pattern, the flags = 0)
∙ pattern: regular expression string or a string representation of the native
control flag used when the regular expression: ∙ flags

regex = re.compile(r'[1‐9]\d{5}')

Compile a regular expression string into a regular expression object
, for example:

rst = re.search(r'[1‐9]\d{5}', 'BIT 100081') 函数式用法:一次性操作
pat = re.compile(r'[1‐9]\d{5}') 面向对象用法:编译后的多次操作
rst = pat.search('BIT 100081')

Fifth, greedy match with the minimum matching

match = re.search(r'PY.*N', 'PYANBNCNDN')
match.group(0)

While matching the number of different length, which returns it? -> 'PYANBNCNDN'
Re default library greedy match, i.e., the output of the longest matching substring

Minimum match mode:

match = re.search(r'PY.*?N', 'PYANBNCNDN')
match.group(0)
'PYAN'

Operator Description
* ?: previous character 0 or unlimited extension, the minimum matching
+ ?: 1 before a character or unlimited extension, the minimum matching
??: previous character 0 or 1 expanded, a minimum matching
{m , n} ?: m to n times a character before expansion (including n), the minimum matching
regular access code

Published 12 original articles · won praise 1 · views 267

Guess you like

Origin blog.csdn.net/weixin_43522964/article/details/100063954