Python is a brief summary of Re library

First, the regular expression

Regular expressions (regular expression) (regex) (RE), for simplicity of expression of a set of expressions of characters, and characters composed operator.

General expression cassette string

Concise expression expression of a set of strings

For tool string expression "simple" and "features" thought

Analyzing the features of a home of a string

Regular expressions are commonly used in text processing

Expression characteristics of text type (virus, intrusion, etc.)

Meanwhile find or replace a set of strings

All or part of the matched string

Use regular expressions

Compile: will meet the characteristics of the regular expression regular expression syntax to convert a string

Common regular expression operators

Operators Explanation Examples
. It represents any single character (except newline by default)  
[] Characters, a single character is given in the range of [abc] represents a, B, C, [AZ] represents a to a single character, z
[^] Non characters, a single character is given to the negative range [^ abc] represents a non- a or b, or the single character c,
* Previous character 0 times or an unlimited number of extensions abc * represents ab, abc, ABCC, abccc etc.
+ Previous character once or unlimited expansion abc + represents ABC, ABCC, abccc etc.
? Previous character 0 or 1 time extension abc * represents ab, abc
| Any expression about a abc | def represents abc, DEF
{m} Before we extend a character m times ab {2} c represents abbc
{m,n} A front extension character m to n times (including n) represents ab {1,2} c abc, abbc
^ Matches the beginning of string abc and at the beginning of a string
$ End of the string abc and at the end of a string
() Packet marking, internal use only | operator (abc) represents abc, (abc | def) represents abc, DEF
\d Number, is equivalent to [0-9]  
\w Word character, equivalent to [A-Za-z0-9_]  

Example Syntax

Regular Expressions Corresponding character string
P (Y | YT | YTH | YTHO)? N 'PN', 'PYN', 'PYTN', 'PYTHN', 'Python'
PYTHON+ ‘PYTHON’、’PYTHONN’、’PYTHONNN’……
PY[TH]ON 'PYTON', 'PYHON'
PY[^TH]?ON 'PYON', 'PYaON', 'PYbON', 'PYcON' ......
PY {3} is a ' PN', 'PYN', 'the request', 'PYYYN'

Classic examples

Regular Expressions significance
^ [A-Za-z] + $ A string of letters 26
^ [A-Za-z0-9] + $ A string of 26 letters and digits
^-?\d+$ Integer string
^ [1-9]*[0-9]*$ Positive integer string
[1-9]\d{5} Chinese domestic zip code, 6
[\u4e00-\u9fa5] Matching Chinese characters
\d{3}-\d{8}|\d{4}-\d{7} Domestic phone number 3 -8 or 4 - a 7

Two, Re library 

Re library is a Python standard library, mainly used for string matching, call the method "import re"

1, the regular expression type

Re library using raw string (native type string) expression regular expression, represented as r'text ', raw string is a string that does not contain escape

The  r '[1-9] \ d { 5}', r '\ d {3} - \ d {8} | \ d {4} - \ d {7}'

string type is more complicated, you need to escape some special symbols

The ' [1-9]. 5 \\ {D}', '\\ {D}. 3 -. 8 \\ {D} |. 4} {D \\ - \\ {D}. 7'

2, the main function of the library function Re

function Explanation
re.search() Search for a matching string from a first position a regular expression, returns match object
re.match() Match the regular expression from the start position of a character string, returns match object
re.findall() The search string, return a list type can match substrings of all
re.split() The string is divided according to a regular expression matching result, returns a list of type
re.finditer () 搜索字符串,返回一个匹配结果的迭代类型,每个迭代元素是match对象
re.sub() 在一个字符串中替换所有匹配正则表达式的子串,返回替换后的字符串

(1)re.search()函数

re.search(pattern,string,flags = 0)

在一个字符串中搜索匹配正则表达式的第一个位置,返回match对象

pattern:正则表达式的字符串或原生字符串表示

string:待匹配字符串

flags:正则表达式使用时的控制标记

1 import re
2 match = re.search(r'[1-9]\d{5}','BIT 100081')
3 if match:
4         print(match.group(0))     
5 100081 
常用标记 说明
re.I  re.IGNORECASE 忽略正则表达式的大小写,[A-Z]能够匹配小写字符
re.M  re.MULTILINE 正则表达式中^操作符能够将给定字符串的每行当做匹配开始
re.S      re.DOTALL 正则表达式中的.操作符能够匹配所有字符,默认匹配除换行外的所有字符

(2)re.match()函数

re.match(pattern,string,flags = 0)

从一个字符串的开始位置起匹配正则表达式,返回match对象

pattern:正则表达式的字符串或原生字符串表示

string:待匹配字符串

flags:正则表达式使用时的控制标记

1 import re
2 match = re.match(r'[1-9]\d{5}','BIT 100081')
3 if match:
4         print(match.group(0))
5      
6 match.group(0)
7 Traceback (most recent call last):
8   File "<input>", line 1, in <module>
9 AttributeError: 'NoneType' object has no attribute 'group'    

由以上代码可看出,匹配不出以'BIT'开头的字符串,调用match.group(0)返回为空

1 import re
2 match = re.match(r'[1-9]\d{5}','100081 BIT')
3 if match:
4         print(match.group(0))  
5 100081

(3)re.findall()函数

re.findall(pattern,string,flags = 0)

搜索字符串,以列表类型返回全部能匹配的子串

pattern:正则表达式的字符串或原生字符串表示

string:待匹配字符串

flags:正则表达式使用时的控制标记

1 import re 
2 ls = re.findall(r'[1-9]\d{5}','BIT100081 TSU100084')
3 ls
4 ['100081', '100084']

(4)re.split()函数

re.split(pattern,string,maxsplit = 0,flags = 0)

将一个字符串按照正则表达式匹配结果进行分割,返回列表类型

pattern:正则表达式的字符串或原生字符串表示

string:待匹配字符串

maxsplit:最大分割数,剩余部分作为最后一个元素输出

flags:正则表达式使用时的控制标记

1 import re
2 re.split(r'[1-9]\d{5}','BIT100081 TSU100084')
3 ['BIT', ' TSU', ''] #将匹配的字符串去掉,剩余分割部分放入一个列表
4 re.split(r'[1-9]\d{5}','BIT100081 TSU100084',maxsplit = 1)
5 ['BIT', ' TSU100084'] #将匹配的第一个字符串去掉,剩余分割部分放入一个列表(匹配的第一个字符串后的所有部分作为一个整体)

(5)re.finditer()函数

re.finditer(pattern,string,flags = 0)

搜索字符串,返回一个匹配结果的迭代类型,每个迭代元素是match对象

pattern:正则表达式的字符串或原生字符串表示

string:待匹配字符串

flags:正则表达式使用时的控制标记

1 import re
2 for m in re.finditer(r'[1-9]\d{5}','BIT100081 TSU100084'):
3         if m:
4             print(m.group(0))        
5 100081
6 100084        

(6)re.sup()函数

re.sup(pattern,repl,string,count = 0,flags = 0)

在一个字符串中替换所有匹配正则表达式的子串,返回替换后的字符串

pattern:正则表达式的字符串或原生字符串表示

repl:替换匹配字符串的字符串

string:待匹配字符串

count:匹配的最大替换次数

flags:正则表达式使用时的控制标记

1 import re
2 re.sub(r'[1-9]\d{5}','zipcode','BIT100081 TSU100084')
3 'BITzipcode TSUzipcode'

 3、Re库的用法

(1)函数式用法:一次性操作

1 rst = re.search(r’[1-9]\d{5}’,’BIT 100081’)

(2)面向对象用法:编译后的多次操作

1 pat = re.compile(r’[1-9]\d{5}’)
2 rst = pat.search(‘BIT 100081’)

(3)re.compile()函数

regex = re.compile(pattern,flags = 0)

将正则表达式的字符串形式编译成正则表达式对象

pattern:正则表达式的字符串或原生字符串表示

flags:正则表达式使用时的控制标记

经过re.compile()方法后,regex被编译为正则表达式对象,可以使用对象方法,即regex.search()、regex.match()、regex.findall()、regex.split()、regex.finditer()、regex.sub()

1 import re
2 regex = re.compile(r'[1-9]\d{5}')
3 regex.search('100081')
4 <re.Match object; span=(0, 6), match='100081'>
5 regex.search('100081').group(0)
6 '100081'

4、Match对象

(1)Match的属性

属性 说明
.string 待匹配的文本
.re 匹配时使用的pattern对象(正则表达式)
.pos 正则表达式搜索文本的开始位置
.endpos 正则表达式搜索文本的结束位置
 1 import re
 2 m = re.search(r'[1-9]\d{5}','BIT100081 TSU100084')
 3 m.string #待匹配的字符串
 4 'BIT100081 TSU100084'
 5 m.re #匹配时使用的正则表达式
 6 re.compile('[1-9]\\d{5}')
 7 m.pos #正则表达式搜索文本的开始位置
 8 0
 9 m.endpos #正则表达式搜索文本的结束位置
10 19

(2)Match的方法

方法 说明
.group(0) 获得匹配后的字符串
.start() 匹配字符串在原始字符串的开始位置
.end() 匹配字符串在原始字符串的结束位置
.span() 返回(.start(),.end())
 1 import re
 2 m = re.search(r'[1-9]\d{5}','BIT100081 TSU100084')
 3 m.group(0) #获得匹配后的字符串
 4 '100081' #返回第一次匹配的结果
 5 m.start() #匹配字符串在原始字符串的开始位置
 6 3
 7 m.end() #匹配字符串在原始字符串的结束位置
 8 9
 9 m.span() #返回(.start(),.end())
10 (3, 9)

5、贪婪匹配和最小匹配

(1)贪婪匹配

贪婪匹配即匹配最长的子串,Re库默认采用贪婪匹配

1 import re
2 match = re.search(r'PY.*N','PYANBNCNDN')
3 match.group(0)
4 'PYANBNCNDN'

(2)最小匹配

最小匹配即匹配最短的子串

操作符 说明
*? 前一个字符0次或无限次扩展,最小匹配
+? 前一个字符1次或无限次扩展,最小匹配
?? 前一个字符0次或1次扩展,最小匹配
{m,n}? 扩展前一个字符m至n次(含n),最小匹配
1 import re
2 match = re.search(r'PY.*?N','PYANBNCNDN') #最小匹配
3 match.group(0)
4 'PYAN'

资料来源:北京理工大学,嵩天,MOOC

Guess you like

Origin www.cnblogs.com/huskysir/p/12467491.html