Regular Expression:Python正则表达式

版权声明:本文为 [onefine] 原创文章,转载请注明出处。 https://blog.csdn.net/jiduochou963/article/details/86558318

Python正则表达式

正则表达式的基本概念

为什么要使用正则表达式

先来看几个使用字符串方法的例子:

# 字符串匹配就能实现
# 场景1. 找到course开头的语句
def find_start_str(file_name, start_str):
    file = open(file_name)

    for line in file:
        if line.startswith(start_str):
            print(line)

    file.close()

# find_start_str('course.txt', 'course')


# 场景2, 找到course开头和2019结尾的语句
def find_start_end_str(file_name, start_str, end_str):
    file = open(file_name)

    for line in file:
        if line.startswith(start_str)\
                and line[:-1].endswith(end_str):  # Python是以\n结束的
            print(line)

    file.close()

# find_start_end_str('course.txt', 'course', '2019')

# 场景3,匹配一个下划线或者字母开头的变量名
a = "_valuel"
print(a and(a[0] == '_' or 'a' <= a[0] <= 'Z'))

注:course.txt文件:

course Java 2019
course Html 2018
course Python 2010
course C++ 2018
course Python3 2019

C 0000
C# 0000
.net 0000
php 0000

问题:每一次匹配都要单独写函数完成,有没有相应简单的方法?

  1. 使用单个字符串来描述匹配一系列符合某个句法规则的字符串

  2. 是对字符串操作的一种逻辑公式

  3. 应用场景:处理文本和数据

  4. 正则表达式过程:

    依次拿出表达式和文本中的字符比较,如果每个字符都能匹配则匹配成功;否则匹配失败。

Python正则表达式re模块

两种方式
>ipython
In [1]: import re

In [3]: type(re.compile(r'course'))
Out[3]: re.Pattern

In [4]:

In [4]: re.match(r'course','course Python3.x').group()
Out[4]: 'course'
>ipython
In [1]: str = 'course python 2019'

In [2]: str.find('one')
Out[2]: -1

In [3]: str.find('py')
Out[3]: 7

In [4]: str.startswith('python')
Out[4]: False

In [5]: str.startswith('course')
Out[5]: True

In [6]: import re

In [7]: pa = re.compile(r'2019\n')

In [8]: type(pa)
Out[8]: re.Pattern

In [9]: re.Pattern.
                    findall()   fullmatch() match()     scanner()   sub()
                    finditer()  groupindex  mro()       search()    subn()
                    flags       groups      pattern     split()

In [9]: help(re.Pattern.match)
Help on method_descriptor:

match(self, /, string, pos=0, endpos=9223372036854775807)
    Matches zero or more characters at the beginning of the string.

In [10]:

注:r代表原始字符串,避免转义


Match对象的属性

  1. string 属性:
    获取匹配时使用的字符串对象获取匹配时使用的字符串对象
>>> m = re.match(r'\d+','456abc')
>>> m.string
'456abc'
  1. re 属性:
    匹配时使用的pattern对象,也就是匹配到内容的正则表达式对象匹配时使用的pattern对象,也就是匹配到内容的正则表达式对象
>>> m
<_sre.SRE_Match object at 0x02C8FA68>

>>> m.re
<_sre.SRE_Pattern object at 0x02D4ECD0>
  1. pos属性:
    该属性表示文本中正则表达式开始搜索的索引。值与Pattern.match()和Pattern.seach()方法中的同名参数相同
>>> m.pos
0
  1. endpos属性:
    该属性表示文本中正则表达式结束搜索的索引。值与Pattern.match()和 Pattern.seach()方法中的同名参数相同
>>> m.endpos
6
  1. lastindex属性:
    该属性表示最后一个被捕获的分组在文本中的索引。如果没有被捕获的分组,将为None
>>> m= re.match(r'a(b)(c)d','abcdef')
>>> m.lastindex
2
  1. lastgroup属性:
    该属性表示最后一个被捕获的分组别名。如果这个分组没有别名或者没有被捕获的分组,将为None。

  2. group([group1, …]):
    获得一个或多个分组截获的字符串;指定多个参数时将以元组形式返回。group1可以使用编 号也可以使用别名;编号0代表匹配的整个子串;默认返回group(0)
    实例:group函数传多个参数

p = re.compile('(a(b)c)d')
m = p.match('abcd')
resTup = m.group(1,2,1)
print resTup
>>>('abc', 'b', 'abc')
  1. groups([default=None])
    以元组形式返回全部分组截获的字符串。相当于调用group(1,2,…last)

  2. start([group=0])
    返回指定的组截获的子串在string中的起始索引(子串第一个字符的索引)。默认为第0组,即整个字符串

  3. end([group=0])
    返回指定的组截获的子串在string中的结束索引(子串最后一个字符的索引)。group默认值 为0,即整个字符串

  4. span([group])
    该方法表示以元组的形式返回 (start(group), end(group)),即某个分组的匹配文字内容在被 匹配字符串的开始索引位置和结束索引位置

  5. expand(template)
    将匹配到的分组代入template中然后返回。template中可以使用\id或\g、\g 引用分组,但不能使用编号0。\id与\g是等价的;但\10将被认为是第10个分组,如果 你想表达\1之后是字符’0’,只能使用\g<1>0。

m = re.search(r'(\w+)! (\w+) (\w+)','HMan! How finny!')
# 将匹配的结果带入 print m.expand(r'resut:\3 \2 \1')  
>>> resut:finny How HMan
  1. groupdict([default=None])
    该函数的作用是,将所有匹配到并且指定了别名的分组,以别名为key,匹配到的字串为value, 存于字典中,然后返回这个字典。如果表达式中未设置别名分组,就会返回一个空字典
>>> m = re.search(r'(?P<num>\d+)(\w+)','78fd')
>>> m.groupdict()
{'num': '78'}

match()方法

  • 匹配字符串开头的零个或多个字符。返回一个match对象或None。
  • 可选参数pos和endpos分别指定被匹配字符串的开始和结束位置
In [10]: str
Out[10]: 'course python 2019'

In [11]: pa.match(str)
Out[11]: <re.Match object; span=(0, 6), match='course'>

In [12]: ma = pa.match(str)

In [13]: ma
Out[13]: <re.Match object; span=(0, 6), match='course'>

In [14]: type(ma)
Out[14]: re.Match

In [15]: re.Match.
                   end()       group()     lastgroup   pos         span()
                   endpos      groupdict() lastindex   re          start()
                   expand()    groups()    mro()       regs        string

In [15]: ma.group()
Out[15]: 'course'

In [16]: help(re.Match.group)
Help on method_descriptor:

group(...)
    group([group1, ...]) -> str or tuple.
    Return subgroup(s) of the match by indices or names.
    For 0 returns the entire match.

In [17]:

group()方法

  • 返回一个字符串或者元组
  • 有括号括起来返回元组,否则返回字符串

匹配结果在原始字符串的索引位置:span()方法

In [17]: ma.span()
Out[17]: (0, 6)

In [18]:

被匹配字符串:string属性

In [18]: ma.string
Out[18]: 'course python 2019'

Pattern实例:re属性

In [19]: ma.re
Out[19]: re.compile(r'course', re.UNICODE)

In [20]:

忽略大小写:

In [20]: pa
Out[20]: re.compile(r'course', re.UNICODE)

In [21]: pa = re.compile('course', re.I)

In [22]: pa
Out[22]: re.compile(r'course', re.IGNORECASE|re.UNICODE)

IIn [23]: ma = pa.match('Course: Python3.x \n course:etc...')

In [24]: ma.group()
Out[24]: 'Course'

In [25]: ma = pa.match('Couse: Python3.x \n course:etc...')

In [26]: ma.group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-ad89060ab833> in <module>
----> 1 ma.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [27]: pa = re.compile('course')

In [28]: pa
Out[28]: re.compile(r'course', re.UNICODE)

In [29]: ma = pa.match('Couse: Python3.x \n course:etc...')

In [30]: ma.group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-18-ad89060ab833> in <module>
----> 1 ma.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [31]:

(元)组的形式返回groups()

In [31]: pa = re.compile(r'(course)', re.I)

In [32]: pa
Out[32]: re.compile(r'(course)', re.IGNORECASE|re.UNICODE)

In [33]: pa.match('Course: Python3.x \n course:etc...').group()
Out[33]: 'Course'

In [34]: pa.match('Course: Python3.x \n course:etc...').groups()
Out[34]: ('Course',)

In [35]: pa = re.compile(r'course', re.I)

In [36]: pa
Out[36]: re.compile(r'course', re.IGNORECASE|re.UNICODE)

In [37]: pa.match('Course: Python3.x \n course:etc...').group()
Out[37]: 'Course'

In [38]: pa.match('Course: Python3.x \n course:etc...').groups()
Out[38]: ()

In [39]: pa = re.compile(r'course', re.I)

In [40]: pa
Out[40]: re.compile(r'course', re.IGNORECASE|re.UNICODE)

In [41]: pa.match('Course: Python3.x \n course:etc...').group()
Out[41]: 'Course'

In [42]: pa.match('Course: Python3.x \n course:etc...').groups()
Out[42]: ()

In [43]: pa = re.compile(r'(course)', re.I)

In [44]: pa
Out[44]: re.compile(r'(course)', re.IGNORECASE|re.UNICODE)

In [45]: pa.match('Course: Python3.x \n course:etc...').group()
Out[45]: 'Course'

In [46]: pa.match('Course: Python3.x \n course:etc...').groups()
Out[46]: ('Course',)

In [47]:

关于groups()

In [52]: ma = pa.match('Course: Python3.x \n course:etc...')

In [53]: ma
Out[53]: <re.Match object; span=(0, 6), match='Course'>

In [54]: type(ma)
Out[54]: re.Match

In [55]: help(re.Match.groups)
Help on method_descriptor:

groups(self, /, default=None)
    Return a tuple containing all the subgroups of the match, from 1.

    default
      Is used for groups that did not participate in the match.

In [56]:
  1. 上面方法是先生成判断对象再匹配字符串。下面介绍直接使用match方法:

  2. 直接使用match方法

In [56]: help(re.match)
Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found.


In [57]:

match()方法

  • 尝试应用字符串开头的模式,返回匹配对象,如果没有找到匹配,则返回None。
In [57]: ma = re.match(r'course','course python3.x etc..')

In [58]: ma
Out[58]: <re.Match object; span=(0, 6), match='course'>

In [59]: type(ma)
Out[59]: re.Match

In [60]: ma.group()
Out[60]: 'course'

In [61]:

注:这种方式适合匹配次数较少的情况,因为每次匹配都会生成一个Pattern对象。


In [10]: re.
             A              copyreg     enum            findall     functools
             L              Match       Pattern         S           split
             sub            TEMPLATE    UNICODE         ASCII       DEBUG
             error          finditer    I               LOCALE      match
             purge          Scanner     sre_compile     subn        template
             VERBOSE        compile     DOTALL          escape      fullmatch
             IGNORECASE     M           MULTILINE       RegexFlag   search
             sre_parse      T           U               X


In [10]: str
Out[10]: 'course python 2019'

In [11]: pa = re.compile(r'2019')

In [12]: pa.match(str)

In [13]: pa.search(str)
Out[13]: <re.Match object; span=(14, 18), match='2019'>

In [14]: sc = pa.search(str)

In [15]: sc.group
Out[15]: <function Match.group>

In [16]: sc.group()
Out[16]: '2019'

In [17]: pa = re.compile(r'course')

In [18]: pa.match(str).group()
Out[18]: 'course'

In [22]: help(sc.group)
Help on built-in function group:

group(...) method of re.Match instance
    group([group1, ...]) -> str or tuple.
    Return subgroup(s) of the match by indices or names.
    For 0 returns the entire match.


In [23]:

正则表达式语法

正则表达式语法1:

字符 匹配
. 匹配任意字符(除了\n)
[…] 匹配字符集
\d/\D 匹配数字/非数字
\s/\S 匹配空白/非空白字符
\w/\W 匹配单词字符[a-zA-Z0-9]/非单词字符

注:[…]匹配的是[]中包含字符的任意一个字符,匹配多个字符的情况详下面

In [3]: import re

In [4]: ma = re.match(r'{[abc]}','{b}')

In [5]: ma.group()
Out[5]: '{b}'

In [6]: ma = re.match(r'{[abc]}','{d}')

In [7]: type(ma)
Out[7]: NoneType

In [8]: ma = re.match(r'{[.]}','{d}')

In [9]: ma.group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-ad89060ab833> in <module>
----> 1 ma.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [10]: ma = re.match(r'{.}','{d}')

In [11]: ma.group()
Out[11]: '{d}'

In [16]: ma = re.match(r'{\[[abcd]\]}','{[d]}')

In [17]: ma.group()
Out[17]: '{[d]}'

In [18]:

注:如果要匹配的模式包含[]需要转义.

正则表达式语法2:

字符 匹配
* 匹配前一个字符0次或者无限次
+ 匹配前一个字符1次或者无限次
? 匹配前一个字符0次或者1次
{m}/{m,n} 匹配前一个字符m次或者n次
*?/+?/?? 匹配模式变为非贪婪(尽可能少匹配字符)
In [18]: ma = re.match(r'[A-Z][a-z]*','A')

In [19]: ma.group()
Out[19]: 'A'

In [20]: ma = re.match(r'[A-Z][a-z]','A')

In [21]: ma.group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-21-ad89060ab833> in <module>
----> 1 ma.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [22]: ma = re.match(r'[A-Z][a-z]*','AbcdefgHIJK')

In [23]: ma.group()
Out[23]: 'Abcdefg'

In [24]:

一个例子:

Python变量命名规则

  • 变量名只能包含字母、数字和下划线。变量名可以字母或下划线开头,但不能以数字开头。
  • 变量名不能包含空格,但可使用下划线来分隔其中的单词。
  • 不要将Python关键字和函数名用作变量名,即不要使用Python保留用于特殊用途的单词,如print。
  • 变量名应既简短又具有描述性。
  • 慎用小写字母l和大写字母O,因给他们可能被人错看成数字1和0;
    注意:应使用小写的Python变量名。在变量名中使用大写字母虽然不会导致错误,但避免使用大写字母是个不错的注意。

我们重点关注前两点,用正则表达式来描述:

In [38]: while(True):
    ...:     pa = re.compile(r'[_a-zA-Z][_\w]*')
    ...:     str = input("输入字符串:")
    ...:     ma = pa.match(str)
    ...:     if ma != None:
    ...:         print(ma.group())
    ...:

In [44]: while(True):
    ...:     pa = re.compile(r'[_a-zA-Z][_a-zA-Z0-9]*')
    ...:     str = input("输入:")
    ...:     ma = pa.match(str)
    ...:     if ma != None:
    ...:         print(ma.group())
    ...:

44是对38的改进,可以尝试38输入__哈哈仍然匹配成功,而44中匹配得到的是__

匹配0-99之间的数字:

In [5]: ma = re.match(r'[1-9]?[0-9]', '0')

In [6]: ma.group()
Out[6]: '0'

In [7]: ma = re.match(r'[1-9]?[0-9]', '10')

In [8]: ma.group()
Out[8]: '10'

In [9]: ma = re.match(r'[1-9]?[0-9]', '99')

In [10]: ma.group()
Out[10]: '99'

In [11]: ma = re.match(r'[1-9]?[0-9]', '100')

In [12]: ma.group()
Out[12]: '10'

In [13]:

匹配指定次数:

In [21]: re.match(r'[a-z]{2}','haha').group()
Out[21]: 'ha'

In [22]: re.match(r'[a-z]{2,3}','haha').group()
Out[22]: 'hah'

In [23]: re.match(r'[a-z]{2,4}','hahaxxxioflsaj').group()
Out[23]: 'haha'

In [24]: re.match(r'[a-z]{2,4}','ha3ha').group()
Out[24]: 'ha'

In [25]: re.match(r'[a-z]{2,4}','hah3ha').group()
Out[25]: 'hah'

In [26]: re.match(r'[a-z]{2,4}','hahx3ha').group()
Out[26]: 'hahx'

In [27]:

非贪婪模式:*?+???

In [28]: ma = re.match(r'[0-9][a-z]*','1abcdef').group()

In [29]: re.match(r'[0-9][a-z]*','1abcdef').group()
Out[29]: '1abcdef'

In [30]: re.match(r'[0-9][a-z]*?','1abcdef').group()
Out[30]: '1'

In [31]: re.match(r'[0-9][a-z]+?','1abcdef').group()
Out[31]: '1a'

In [32]: re.match(r'[0-9][a-z]?','1abcdef').group()
Out[32]: '1a'

In [33]: re.match(r'[0-9][a-z]+','1abcdef').group()
Out[33]: '1abcdef'

In [34]: re.match(r'[0-9][a-z]??','1abcdef').group()
Out[34]: '1'

In [35]: re.match(r'[0-9][a-z]?','1abcdef').group()
Out[35]: '1a'

In [36]:

正则表达式语法3:边界语法

字符 匹配
^ 匹配字符串开头
$ 匹配字符串结尾
\A/\Z 指定字符串必须出现在开头/结尾

问题引入:

In [37]: re.match(r'[a-zA-Z]{4,10}@onefine.top','[email protected]').group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-37-5c2cf60a1190> in <module>
----> 1 re.match(r'[a-zA-Z]{4,10}@onefine.top','[email protected]').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [38]: re.search(r'[a-zA-Z]{4,10}@onefine.top','[email protected]').group()
Out[38]: '[email protected]'

In [39]: re.search(r'^[a-zA-Z]{4,10}@onefine.top$','[email protected]').group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-70e4e4e6988c> in <module>
----> 1 re.search(r'^[a-zA-Z]{4,10}@onefine.top$','[email protected]').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [40]: re.search(r'^[a-zA-Z]{4,10}@onefine.top','[email protected]').group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-40-3c138c9aac5e> in <module>
----> 1 re.search(r'^[a-zA-Z]{4,10}@onefine.top','[email protected]').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [41]: re.search(r'^[a-zA-Z]{4,10}@onefine.top$','[email protected]').group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-41-4ca35adfb0df> in <module>
----> 1 re.search(r'^[a-zA-Z]{4,10}@onefine.top$','[email protected]').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [42]: re.search(r'^[a-zA-Z]{4,10}@onefine.top$','[email protected]').group()
Out[42]: '[email protected]'

In [43]: re.match(r'^[a-zA-Z]{4,10}@onefine.top$','[email protected]').group()
Out[43]: '[email protected]'

In [45]: re.match(r'^[a-zA-Z]{4,10}@onefine.top','[email protected]').group()
Out[45]: '[email protected]'

In [46]:

\A/\Z 指定字符串必须出现在开头/结尾,用法
指定必须以coding开头,top结尾

In [52]: re.match(r'\A(coding)[\w]*@onefine.(top)\Z','[email protected]').group()
Out[52]: '[email protected]'

In [53]:

正则表达式语法4:分组匹配

字符 匹配
| 匹配左右任意一个表达式
(ab) 括号中表达式作为一个分组
\<number> 引用编号为number的分组匹配到的字符串,编号从1开始
(?P<name>) 分组起一个别名name
(?P=name) 引用别名为name的分组匹配字符串
In [54]: re.match(r'abc|d','d').group()
Out[54]: 'd'

In [55]: re.match(r'abc|d','abc').group()
Out[55]: 'abc'

In [56]: re.match(r'abc|d','c').group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-56-dfcd4a662ebe> in <module>
----> 1 re.match(r'abc|d','c').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [57]:

匹配0-100的字符串:

In [57]: re.match(r'[1-9]?\d$|100','100').group()
Out[57]: '100'

In [58]: re.match(r'[1-9]?\d$|100','99').group()
Out[58]: '99'

In [59]: re.match(r'[1-9]?\d$|100','1').group()
Out[59]: '1'

In [60]: re.match(r'[1-9]?\d$|100','0').group()
Out[60]: '0'

In [61]:

匹配多邮箱:

n [63]: re.match(r'[\w]{4,8}@(163|126).com','[email protected]').group()
Out[63]: '[email protected]'

In [64]: re.match(r'[\w]{4,8}@(163|126).com','[email protected]').group()
Out[64]: '[email protected]'

In [65]: re.match(r'[\w]{4,8}@(163|126).com','[email protected]').group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-65-afd701da2fb6> in <module>
----> 1 re.match(r'[\w]{4,8}@(163|126).com','[email protected]').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [66]:

XML–引用分组

In [80]: re.match(r'<([\w]+>)[\w]+<\\\1','<img>onefine<\img>').group()
Out[80]: '<img>onefine<\\img>'

In [81]: re.match(r'<([\w]+>)[\w]+</\1','<img>onefine</img>').group()
Out[81]: '<img>onefine</img>'

In [82]:

80写错了,注意上面的\\是转义\
正确的是81,但是有了新的发现:

正则转义

先看个例子:

import re 

string = '3\8' 
m = re.search('(\d+)\\\\', string) 

if m is not None: 
print m.group(1) # 结果为:3 

n = re.search(r'(\d+)\\', string) 

if n is not None: 
print n.group(1) # 结果为:3

正则表达式字符串需要经过两次转义,这两次分别是上面的“字符串转义”和“正则转义”,个人认为“字符串转义”一定先于“正则转义”。

1)'\\\\'的过程:
先进行“字符串转义”,前两个反斜杠和后两个反斜杠分别被转义成了一个反斜杠;即“\\|\\”被转成了“\|\”(“|”为方便看清,请自动忽略)。“字符串转义”后马上进行“正则转义”,“\\”被转义为了“\”,表示该正则式需要匹配一个反斜杠。

2)r'\\'的过程:
由于原始字符串中所有字符直接按照字面意思来使用,不转义特殊字符,故不做“字符串转义”,直接进入第二步“正则转义”,在正则转义中“\\”被转义为了“\”,表示该正则式需要匹配一个反斜杠。

结论:也就是说原始字符串(即r'...')与“正则转义”毫无关系,原始字符串仅在“字符串转义”中起作用,使字符串免去一次转义。

分组起别名&引用,即最后两个成对使用:

In [97]: re.match(r'<(?P<mark>[\w]+>)[\w]+</(?P=mark)','<img>onefine</img>').group()
Out[97]: '<img>onefine</img>'

re模块相关方法使用

模块re中一些重要的函数

函数 描述
compile(pattern[, flags]) 根据包含正则表达式的字符串创建模式对象
search(pattern, string[,flags]) 在字符串中查找模式
match(pattern, string[,flags]) 在字符串开头匹配模式
split(pattern, string[,maxsplit=0]) 根据模式来分割字符串
findall(pattern, string) 返回一个列表, 其中包含字符串中所有与模式匹配的子串
sub(pat, repl, string[,count=0]) 将字符串中与模式pat匹配的子串都替换为repl
escape(string) 对字符串中所有的正则表达式特殊字符都进行转义
C:\Users\ONEFINE>ipython
Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import re

In [2]: help(re.match)
Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found.
In [3]: help(re.search)
Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.

search()方法:

  • 扫描字符串string,查找模式pattern指定的正则表达式模式,产生匹配的第一个位置,返回相应的Match对象实例,如果字符串中没有位置与模式匹配,则返回None。请注意,这与在字符串中的某处找到零长度匹配不同。
In [4]: help(re.findall)
Help on function findall in module re:

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result.

findall()方法

  • 返回的所有非重叠的匹配模式 的字符串,如字符串列表。该字符串进行扫描左到右,并匹配以发现的顺序返回。如果模式中存在一个或多个组,返回组列表; 如果模式有多个组,这将是一个元组列表。空结果包含在结果中,除非他们触及另一场比赛的开始。

例子:

In [15]: re.search(r':\d+','Python3.x:4000 C++:5000 JAVA:4800').group()[1:]
Out[15]: '4000'

In [18]: re.findall(r':\d+','Python3.x:4000 C++:5000 JAVA:4800')
Out[18]: [':4000', ':5000', ':4800']

In [19]:
In [5]: help(re.compile)
Help on function compile in module re:

compile(pattern, flags=0)
    Compile a regular expression pattern, returning a Pattern object.
In [6]: help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.

split()方法

  • 根据模式的出现拆分字符串。如果在模式中使用捕获括号,则模式中所有组的文本也会作为结果列表的一部分返回。
  • 如果maxsplit不为零,则最多发生maxsplit分割,并且字符串的其余部分作为列表的最后一个元素返回。

例子:

In [50]: str_2 = "course: Python C C++ C# JAVA,etc"

In [51]: re.split(r':| |,',str_2)
Out[51]: ['course', '', 'Python', 'C', 'C++', 'C#', 'JAVA', 'etc']

In [52]: re.split(r' :| |,',str_2)
Out[52]: ['course:', 'Python', 'C', 'C++', 'C#', 'JAVA', 'etc']

In [53]:
In [7]: help(re.sub)
Help on function sub in module re:

sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the Match object and must return
    a replacement string to be used.

sub()方法

  • 通过用替换repl替换字符串中最左边不重叠出现的模式而获得的字符串。 如果未找到该模式,则字符串将保持不变。 repl可以是一个字符串或一个函数; 如果它是一个字符串,则处理其中的任何反斜杠转义。 也就是\ n被转换为单个换行符,\ r被转换为回车符,等等。 像\ j这样的未知转义字符将被单独保留。 反向引用(例如\ 6)被替换为模式中由组6匹配的子字符串。

repl为字符串的例子:

In [24]: str = "course Python videonum: 1000"

In [25]: re.sub(r'\d+','9631', str)
Out[25]: 'course Python videonum: 9631'

In [26]:

与分组一起用:

In [23]: re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',r'static PyObject*\npy_\1(void)\n{','def myfunc():')
Out[23]: 'static PyObject*\npy_myfunc(void)\n{'

repl为函数的例子:

In [43]: def add(match):
    ...:     val = match.group()
    ...:     num = int(val) + 1
    ...:     return str(num)
    ...:
    ...:

In [44]: str
Out[44]: 'course Python videonum: 1000'

In [45]: del(str)

In [46]: str_1 = 'course Python videonum: 1000'

In [48]: re.sub(r'\d+', add, str_1)
Out[48]: 'course Python videonum: 1001'

In [49]:

提示:str是Python内置函数。千万不要定义为变量名。错误TypeError: ‘str’ object is not callable字面上意思:就是str不可以被系统调用,其实原因就是:你正在调用一个不能被调用的变量或对象,具体表现就是你调用函数、变量的方式错误。

In [8]: help(re.escape)
Help on function escape in module re:

escape(pattern)
    Escape special characters in a string.


In [9]: help(re.Pattern)
Help on class Pattern in module re:

class Pattern(builtins.object)
 |  Compiled regular expression object.
 |
 |  Methods defined here:
 |
 |  __copy__(self, /)
 |
 |  __deepcopy__(self, memo, /)
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __gt__(self, value, /)
 |      Return self>value.
 |
 |  __hash__(self, /)
 |      Return hash(self).
 |
 |  __le__(self, value, /)
 |      Return self<=value.
 |
 |  __lt__(self, value, /)
 |      Return self<value.
 |
 |  __ne__(self, value, /)
 |      Return self!=value.
 |
 |  __repr__(self, /)
 |      Return repr(self).
 |
 |  findall(self, /, string, pos=0, endpos=9223372036854775807)
 |      Return a list of all non-overlapping matches of pattern in string.
 |
 |  finditer(self, /, string, pos=0, endpos=9223372036854775807)
 |      Return an iterator over all non-overlapping matches for the RE pattern in string.
 |
 |      For each match, the iterator returns a match object.
 |
 |  fullmatch(self, /, string, pos=0, endpos=9223372036854775807)
 |      Matches against all of the string.
 |
 |  match(self, /, string, pos=0, endpos=9223372036854775807)
 |      Matches zero or more characters at the beginning of the string.
 |
 |  scanner(self, /, string, pos=0, endpos=9223372036854775807)
 |
 |  search(self, /, string, pos=0, endpos=9223372036854775807)
 |      Scan through string looking for a match, and return a corresponding match object instance.
 |
 |      Return None if no position in the string matches.
 |
 |  split(self, /, string, maxsplit=0)
 |      Split string by the occurrences of pattern.
 |
 |  sub(self, /, repl, string, count=0)
 |      Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
 |
 |  subn(self, /, repl, string, count=0)
 |      Return the tuple (new_string, number_of_subs_made) found by replacing the leftmost non-overlapping occurrences of pattern with the replacement repl.
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  flags
 |      The regex matching flags.
 |
 |  groupindex
 |      A dictionary mapping group names to group numbers.
 |
 |  groups
 |      The number of capturing groups in the pattern.
 |
 |  pattern
 |      The pattern string from which the RE object was compiled.


In [10]: help(re.I)
Help on RegexFlag in module re object:

class RegexFlag(enum.IntFlag)
 |  RegexFlag(value, names=None, *, module=None, qualname=None, type=None, start=1)
 |
 |  An enumeration.
 |
 |  Method resolution order:
 |      RegexFlag
 |      enum.IntFlag
 |      builtins.int
 |      enum.Flag
 |      enum.Enum
 |      builtins.object
 |
 |  Data and other attributes defined here:
 |
 |  ASCII = <RegexFlag.ASCII: 256>
 |
 |  DEBUG = <RegexFlag.DEBUG: 128>
 |
 |  DOTALL = <RegexFlag.DOTALL: 16>
 |
 |  IGNORECASE = <RegexFlag.IGNORECASE: 2>
 |
 |  LOCALE = <RegexFlag.LOCALE: 4>
 |
 |  MULTILINE = <RegexFlag.MULTILINE: 8>
 |
 |  TEMPLATE = <RegexFlag.TEMPLATE: 1>
 |
 |  UNICODE = <RegexFlag.UNICODE: 32>
 |
 |  VERBOSE = <RegexFlag.VERBOSE: 64>
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from enum.Enum:
 |
 |  name
 |      The name of the Enum member.
 |
 |  value
 |      The value of the Enum member.
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from enum.EnumMeta:
 |
 |  __members__
 |      Returns a mapping of member name->value.
 |
 |      This mapping lists all enum members, including aliases. Note that this
 |      is a read-only view of the internal mapping.


In [11]:

Flags标志符

正则表达式可以包含一些标志修饰符来控制匹配模式,用在正则表达式处理函数中的flag参数中,为可选参数。

标志 描述
re.I(IGNORECASE) 忽略大小写
re.M(MULTILINE) 多行模式,改变’^‘和’$'的行为
re.S(DOTALL) 改变’.'的行为
re.X(VERBOSE) 可以给你的表达式写注释

注:

  • 除以上标志外还有re.L和re.U,但不常用
  • 可以通过使用运算符“|“来指定多个标志,表示同时生效。如
>ipython
In [1]: str_1 = 'Hello world'
In [2]: import re

In [3]: re.search(r'wo..d', str_1, re.I|re.M).group()
Out[3]: 'world'

In [4]:
re.I(re.IGNORECASE): 表示使匹配时,忽略大小写
In [1]: import re

In [2]: re.search('a','abc').group()
Out[2]: 'a'

In [3]: re.search('a','Abc').group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-94da83bf8997> in <module>
----> 1 re.search('a','Abc').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [4]: re.search('a','Abc',re.I).group()
Out[4]: 'A'

In [5]:
M(MULTILINE): 多行模式,影响(改变)^$的行为
In [6]: re.search('foo.$', 'foo2\nfoo3\n').group()
Out[6]: 'foo3'

In [7]: re.search('foo.$', 'foo2\nfoo3\n', re.M).group()
Out[7]: 'foo2'

In [8]:

疑问:$不包括换行??? 加了M之后匹配第一行的结尾

S(DOTALL): 影响.的行为,使点.匹配包括换行在内的所有字符
- make the '.' special character match any character at all, including a newline; without flag, '.' will match anything except a newline.
In [15]: re.search('.', '\n').group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-15-d4cd05f33b46> in <module>
----> 1 re.search('.', '\n').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [16]: re.search('.', '\n', re.S).group()
Out[16]: '\n'

In [17]:

注:加S可以匹配包括换行符在内的所有字符

X(re.VERBOSE): 这个模式下正则表达式可以是多行,忽略空白字符,并可以加入注释,使其更可读。
In [22]: re.search('.  # test', 'onefine', re.X).group()
Out[22]: 'o'

In [23]: re.search('.# test', 'onefine', re.X).group()
Out[23]: 'o'

In [24]: re.search('.       # test', 'onefine', re.X).group()
Out[24]: 'o'

In [25]:  re.search('.  # test', 'onefine').group()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-bffa2d2d5cfc> in <module>
----> 1 re.search('.  # test', 'onefine').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [26]:  re.search('.  # test', 'o  # test  haha').group()
Out[26]: 'o  # test'

In [27]:

注:

  • #前面的空格都无效
  • 当该标志被指定时,在 RE 字符串中的空白符被忽略,除非该空白符在字符类中或在反斜杠之后。
  • 它也可以允许你将注释写入 RE,这些注释会被引擎忽略;
  • 注释用 #号 来标识,不过该符号不能在字符串或反斜杠之后。


猜你喜欢

转载自blog.csdn.net/jiduochou963/article/details/86558318
今日推荐