Python module | re module

Regular expressions (regular expression) describes a set of strings (pattern), can be used to check whether a string containing the certain substring, replacing the sub-matching string or a string extracted from a condition matches substring like.

 

 

First, the basic use of the re module

character

Metacharacters Matched content
It matches any character except newline
\w Match letters or numbers or an underscore
\s Matches any whitespace
\d Matching numbers
\n Matches a newline
\t A matching tab
\b Match the end of a word, but also refers to the location and the space between words
^ Matches the beginning of the string
$ End of the string
\A Only the start of the string, with ^
\FROM Only the end of the string, with $
\W Non-matching letters or numbers or an underscore
\D Matching non-numeric
\S Matching non-whitespace characters
a|b A matching character or character b
() Group match () inside as a whole, if () is followed by an special metacharacters such as (adc) *    then * the leading character control is () in the overall content, is no longer a leader character
[...] Matches the character set of characters
[^...] Matches all characters except the characters in the character set

\ D and \ D

  \ d matches any decimal number, equivalent to the class [0-9] , \ d + if required to match one or a plurality of digits with; \ D matches any non-numeric characters, equivalent to the class [^ 0-9 ]

print(re.findall('\d','1234567890 summer *(_'))  # ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
print(re.findall('\D','1234567890 summer *(_'))  # [' ', 's', 'u', 'm', 'm', 'e', 'r', ' ', '*', '(', '_']
print(re.findall("\d+", "spring2summer134444autumn5winter"))  # ['2', '134444', '5']

 

\ W and \ W

  \ w underlined matching includes alphanumeric characters, including any which corresponds to the class [a-zA-Z0-9_], attention underscore; \ W is a non-match any alphanumeric character, including underscore, equivalent to the class [^ a -zA-Z0-9_]

Print (the re.findall ( ' \ W ' , ' Hello! * Chinese 123 () _ ' ))    # [ 'H', 'E', 'L', 'L', 'O', 'in', ' country ',' 1 ',' 2 ',' 3 ',' _ '] 
Print (re.findall ( ' \ W ' , ' the Hello! China 123 * () _ ' ))    # ['! ',' ' , '*', '(', ')', '']

 

\ S and \ S

  \ s matches any whitespace character, equivalent to the class [\ T \ n-\ R & lt \ F \ V] ; \ S matches any non-whitespace character, equivalent to the class [^ \ t \ n \ r \ f \ v]

Print (the re.findall ( ' \ S ' , ' Hello! * Chinese 123 (_ \ T \ n- ' ))   # [ '', '', '\ T', '', '\ n-'] 
Print (Re .findall ( ' \ S ' , ' Hello! * Chinese 123 (_ \ T \ n- ' ))   # [ 'H', 'E', 'L', 'L', 'O', '!', ' in ',' country ',' 1 ',' 2 ',' 3 ',' * ',' ( ',' _ ']

 

\ A and ^
  beginning of the string matching rules in line to match or do not match. ^ Metacharacter if written [] character set is anti-take

print(re.findall('\Ahel','hello!中国 123*(_ \t \n'))  # ['hel']
print(re.findall('^hel','hello!中国 123*(_ \t \n'))  # ['hel']

 

\Z与 $

   字符串结束位置与匹配规则符合就匹配,否则不匹配

print(re.findall('666\Z','hello!中国 123* *-_-* \n666'))    # ['666']
print(re.findall('666$','hello!中国 123* *-_-* \n666'))    # ['666']

 

\n 与 \t

print(re.findall('\n','hello \n summer \t*-_-*\t \n666'))  # ['\n', '\n']
print(re.findall('\t','hello \n summer \t*-_-*\t \n666'))  # ['\t', '\t']

 

[^a-z]

  取反,匹配出除字母外的字符

print(re.findall("[^a-z]",'hello!中国 123* world\n666'))      # ['!', '中', '国', ' ', '1', '2', '3', '*', ' ', '\n', '6', '6', '6']
print(re.findall("[^a-z]+",'hello!中国 123* world\n666'))      # ['!中国 123* ', '\n666']
print(re.findall("[^a-z]*",'hello!中国 123* world\n666'))      # ['', '', '', '', '', '!中国 123* ', '', '', '', '', '', '\n666', '']

 

[]
  括号中可以放任意一个字符,一个中括号代表一个字符

print(re.findall('a.b', 'a1b a3b aeb a*b arb a_b'))  # ['a1b', 'a3b', 'a4b', 'a*b', 'arb', 'a_b']
print(re.findall('a[abc]b', 'aab abb acb adb afb a_b'))  # ['aab', 'abb', 'acb']
print(re.findall('a[0-9]b', 'a1b a3b aeb a*b arb a_b'))  # ['a1b', 'a3b']
print(re.findall('a[a-z]b', 'a1b a3b aeb a*b arb a_b'))  # ['aeb', 'arb']

 

量词:

量词 用法说明
* 重复零次或更多次
+ 重复一次或更多次
? 重复零次或一次
{n} 重复n次
{n,} 重复n次或更多次
{n,m} 重复n到m次
 

.   匹配任意一个字符,除了换行符(re.DOTALL 这个参数可以匹配\n)。

print(re.findall('a.b', 'ab aab a*b a2b a喜欢b a\nb'))        # ['aab', 'a*b', 'a2b']
print(re.findall('a.b', 'ab aab a*b a2b a喜欢b a\nb',re.DOTALL))  # ['aab', 'a*b', 'a2b', 'a\nb']

 

?  匹配一个字符0次或1次。还有一个功能是可以防止贪婪匹配

print(re.findall('a?b', 'ab aab abb aaaab a喜欢b aba**b'))  # ['ab', 'ab', 'ab', 'b', 'ab', 'b', 'ab', 'b']

 

*   匹配0个或者多个左边字符表达式。 满足贪婪匹配

print(re.findall('a*b', 'ab aab aaab abbb'))  # ['ab', 'aab', 'aaab', 'ab', 'b', 'b']
print(re.findall('ab*', 'ab aab aaab abbbbb'))  # ['ab', 'a', 'ab', 'a', 'a', 'ab', 'abbbbb']

 

+  匹配1个或者多个左边字符表达式。 满足贪婪匹配

print(re.findall('a+b', 'ab aab aaab abbb')) # ['ab', 'aab', 'aaab', 'ab']

 

.*   贪婪匹配 从头到尾.

print(re.findall('a.*b', 'ab aab a*()b')) # ['ab aab a*()b']

 

.*?   只是针对.*这种贪婪匹配的模式进行一种限定:非贪婪匹配

print(re.findall('a.*?b', 'ab a1b a*()b, aaaaaab')) # ['ab', 'a1b', 'a*()b', 'aaaaaab']

 

{}    范围。{m}匹配前一个字符m次,{m,n}匹配前一个字符m至n次,若省略n,则匹配m至无限次

print(re.findall('a{2,4}b', 'ab aab aaab aaaaabb'))  # ['aab', 'aaab']

 

()   分组:制定一个规则,将满足规则的结果匹配出来

print(re.findall('(.*?)_sun', 'spring_sun summer_sun autumn_sun winter_sun'))     # ['spring', ' summer', ' autumn', ' winter']
print(re.findall('href="(.*?)"','<a href="http://www.baidu.com">点击</a>'))       # ['http://www.baidu.com']

 

二、re模块中常用功能函数

1. findall()

  全部找到返回一个列表,未匹配成功返回空列表.一旦匹配成功,再次匹配,是从前一次匹配成功后面一位开始的,也可以理解为匹配成功的字符串,不在参与下次匹配

import re

string = 'aspringasummerasautumnawinter'


# 无分组.如果没写匹配规则,就是空规则,返回的是一个比原始字符串多一位的,空字符串列表
print(re.findall('a', string))                                          # ['a', 'a', 'a', 'a', 'a']
print(re.findall("\d+\w\d+", "1a2b3c4d5"))                              # ['1a2', '3c4']
print(re.findall("", "1a2b3c4d"))                                       # ['', '', '', '', '', '', '', '', '']

# 有分组:只将匹配到的字符串里,组的部分放到列表里返回,相当于groups()方法
print(re.findall("a(\w+)",string ))                                     # ['springasummerasautumnawinter']

# 多个分组:只将匹配到的字符串里,组的部分放到一个元组中,最后将所有元组放到一个列表里返回,相当于在group()
print(re.findall("(a)(\w+)", string))                                   #[('a', 'springasummerasautumnawinter')]
print(re.findall("(a)(\w+)", 'aspring asummer asautumn awinter'))       #[('a', 'spring'), ('a', 'summer'), ('a', 'sautumn'), ('a', 'winter')]

# 分组中有分组:只将匹配到的字符串里,组的部分放到一个元组中,先将包含有组的组,看作一个整体也就是一个组,把这个整体组放入一个元组里,然后在把组里的组放入一个元组,最后将所有组放入一个列表返回
print(re.findall("(a)(\w+(e))", string))                                # [('a', 'springasummerasautumnawinte', 'e')]
print(re.findall("(a)(\w+(e))", 'aspring asummer asautumn awinter'))    # [('a', 'summe', 'e'), ('a', 'winte', 'e')]

# ?:在有分组的情况下findall()函数,不只拿分组里的字符串,拿所有匹配到的字符串,注意?:只用于不是返回正则对象的函数如findall()
print(re.findall("a(?:\w+)", string))                                   # ['aspringasummerasautumnawinter']
print(re.findall("a(?:\w+)", 'aspring asummer asautumn awinter'))       # ['aspring', 'asummer', 'asautumn', 'awinter']

 

2. search()

  search,浏览整个字符串,匹配第一个符合规则的字符串,未匹配成功返回None
  search(pattern, string, flags=0)
    pattern: 正则模型
    string : 要匹配的字符串
    falgs : 匹配模式

search 只是找到第一个匹配然后返回一个包含匹配信息的对象,该对象可以通过调用group()方法得到匹配的字符串,如果字符串没有匹配,则返回None。
print(re.search('summer', 'spring summer autumn winter'))           # <_sre.SRE_Match object; span=(7, 13), match='summer'>
print(re.search('summer', 'spring summer autumn winter').group())      # summer
print(re.search('spring|summer', 'spring summer autumn winte'))        # <_sre.SRE_Match object; span=(0, 6), match='spring'>
print(re.search('spring|summer', 'spring summer +autumn winter').group())    # spring,还是返回一个

r.group() 获取匹配到的所有结果,不管有没有分组将匹配到的全部拿出来
r.groups() 获取模型中匹配到的分组结果,只拿出匹配到的字符串中分组部分的结果
r.groupdict() 获取模型中匹配到的分组结果,只拿出匹配到的字符串中分组部分定义了key的组结果

import re
string = "spring summer autumn winter about also 18"

# 无分组
r = re.search("a\w+", string)
print(r.group())                    # autumn
print(r.groups())                   # ()
print(r.groupdict())                # {}


# 有分组
# 为何要有分组?提取匹配成功的指定内容(先匹配成功全部正则,再匹配成功的局部内容提取出来)
r = re.search("a(\w+).*(\d)", string)
print(r.group())                # autumn winter about also 18
print(r.groups())               # ('utumn', '8')
print(r.groupdict())            # {}


# 有两个分组定义了key
# ?P<>定义组里匹配内容的key(键),<>里面写key名称,值就是匹配到的内容
r = re.search("a(?P<n1>\w+).*(?P<n2>\d)", string)
print(r.group())                # autumn winter about also 18
print(r.groups())               # ('utumn', '8')
print(r.groupdict())            # {'n1': 'utumn', 'n2': '8'}

 

3. match()

re.match(pattern, string[, flags=0])
从字符串开始处进行匹配,匹配成功返回一个对象,未匹配成功返回None。完全可以用search+^代替match  
注意:match()函数 与 search()函数基本是一样的功能,不一样的就是match()匹配字符串开始位置的一个符合规则的字符串,search()是在字符串全局匹配第一个合规则的字符串
import re
string = 'spring summer autumn winter'

# 无分组
print(re.match('summer', string))  # None
print(re.match('spring', string))  # <_sre.SRE_Match object; span=(0, 6), match='spring'>
print(re.match('spring', string).group()) # spring

r = re.match("s\w+", string)
print(r.group())                    # spring
print(r.groups())                   # ()
print(r.groupdict())                # {}

# 有分组
r = re.match("s(\w+)", string)
print(r.group())                # spring
print(r.groups())               # ('pring',)
print(r.groupdict())            # {}

# 有两个分组定义了key
r = re.match("(?P<n1>s)(?P<n2>\w+)", string)
print(r.group())                # spring
print(r.groups())               # ('s', 'pring')
print(r.groupdict())            # {'n1': 's', 'n2': 'pring'}

 

4 .split()

  根据正则匹配分割字符串,返回分割后的一个列表

  split(pattern, string, maxsplit=0, flags=0)
    pattern: 正则模型
    string : 要匹配的字符串
    maxsplit:指定分割个数
    flags : 匹配模式
import re

# 按照一个字符将全部字符串进行分割
print(re.split("a", 'aspringasummeraautumnawinter'))  # ['', 'spring', 'summer', '', 'utumn', 'winter']
print(re.split('[ ::,;;,]','spring:summer:autumn,winter'))  # ['spring', 'summer', 'autumn', 'winter']

# 将匹配到的字符串作为分割标准进行分割
print(re.split("a\w+", 'spring asummer bautumn cwinter'))  # ['spring ', ' b', ' cwinter']
 

5. sub()

  替换匹配成功的指定位置字符串

  sub(pattern, repl, string, count=0, flags=0)
    pattern: 正则模型
    repl : 要替换的字符串
    string : 要匹配的字符串
    count : 指定匹配个数
    flags : 匹配模式
print(re.sub('spring', 'summer', 'spring是最好的季节,summer太热了,我还是喜欢spring'))   # summer是最好的季节,summer太热了,我还是喜欢summer
print(re.sub('spring', 'summer', 'spring是最好的季节,summer太热了,我还是喜欢spring',1))   # summer是最好的季节,summer太热了,我还是喜欢spring

 

l = ['1 2 ', '2   3', '  3 4']
print(eval(re.sub(r'\s*', '', str(l))))          # ['12', '23', '34']
print(re.sub(r'\s*', '', str(l)))                # ['12','23','34']
 
 

 

6. subn()

  替换匹配成功的指定位置字符串,并且返回替换次数,可以用两个变量分别接受

  subn(pattern, repl, string, count=0, flags=0)
    pattern: 正则模型
    repl : 要替换的字符串
    string : 要匹配的字符串
    count : 指定匹配个数
    flags : 匹配模式
a, b = re.subn("spring", "summer", "spring summer autumn winter spring spring ")  
print(a)            # summer summer autumn winter summer summer
print(b)            # 3
 
  

7. compile()

编译正则表达式模式,返回一个正则对象的模式。这个方法是Pattern类的工厂方法,用于将字符串形式的正则表达式编译为Pattern对象。(可以把那些常用的正则表达式编译成正则表达式对象,这样可以提高一点效率。)

re.compile(pattern[,flags=0])
  • pattern: 编译时用的表达式字符串。
  • flags: 编译标志位,用于修改正则表达式的匹配方式,如:re.I(不区分大小写)、re.S等。取值可以使用按位或运算符'|'表示同时生效,比如re.I | re.M。

 

obj=re.compile('\d{2}')
print(obj.search('12spring3456summer autumn winter').group())     # 12
print(obj.findall('12spring3456summer autumn winter'))            # ['12', '34', '56']

 

import re

# 将正则表达式编译成Pattern对象
pattern = re.compile(r'hello')
# 使用Pattern匹配文本,获得匹配结果,无法匹配时将返回None
match = pattern.match('hello world!')

if match:
    # 使用Match获得分组信息
    print(match.group())                # hello

 

8. finditer()

  返回一个存放匹配结果的迭代器

 

ret = re.finditer('\d', '1spring2summer3autumn4winter')
print(ret)                              # <callable_iterator object at 0x02C43270>
print(next(ret).group())                # 查看第一个结果  1
print(next(ret).group())                # 查看第二个结果  2
print([i.group() for i in ret])         # 查看剩余的结果 ['3', '4']

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/Summer-skr--blog/p/12124014.html