Regex + Python re module details speaking

Regular expressions are some small rules

    ① metacharacters

rule usage
\d Match number is equal to -> [0-9]
\w Underlined letters match numbers equal -> [_a-zA-Z0-9]
\s Matching space with newline tab key
\D In addition to matching what numbers
\W In addition to matching something underlined alphanumeric
\S In addition to matching space with newline something other than the tab key
. In addition to matching all things newline er module can be matched to the line breaks such re.S
[] Left and right square brackets is used to specify a character class. Character class is a group of characters to be based upon matching.
[^] The opposite
^ Matches the beginning
$ End of the match

    ② quantifiers

rule usage
? Match 0 or 1
+ Match 1 or more times
* Match 0 or more times

    ③ greed and matching non-greedy

    Always within the scope of quantifiers as many matches - greed
    always match as little as possible within the scope of quantifiers - inert
    . * X matches any content encountered any number of times to stop x?
    + X matches any content at least once encountered on x.? stop

    ④ escape problems

    There is a special meaning, the abolition of special significance \
    abolition of a special metacharacter meaning there are two ways
    in front of the meta characters plus \
     on the part of the character into force, this meta-characters in the character set in
    [. () +? * ]

Python -> re module

findall
        会优先显示分组内的内容
        *****取消优先显示(?:正则)
search
        只能返回第一个符合条件的项
        得到的结果需要.group取值
        默认获取完整的匹配结果
        通过group(n)取第n个分组中的内容
# search 还是按照完整的正则进行匹配,显示也显示匹配到的第一个内容,但是我们可以通过给group方法传参数
# 来获取具体文组中的内容
ret = re.search('9(\d)(\d)','19740ash93010uru')
print(ret)  # 变量 -- > <re.Match object; span=(1, 4), match='974'>
if ret:
    print(ret.group()) #  --> 974
    print(ret.group(1)) # --> 7
    print(ret.group(2)) # --> 4

# findall
    # 取所有符合条件的,优先显示分组中的
# search 只取第一个符合条件的,没有优先显示这件事儿
    # 得到的结果是一个变量
        # 变量.group() 的结果 完全和 变量.group(0)的结果一致
        # 变量.group(n) 的形式来指定获取第n个分组中匹配到的内容


# 加上括号 是为了对真正需要的内容进行提取
ret = re.findall('<\w+>(\w+)</\w+>','<h1>askh930s02391j192agsj</h1>')
print(ret) # --> ['askh930s02391j192agsj']

    Other content detailed comments in the code, you can copy my code run step by step, then experiment

    The following contents: Split Sub subn the Math, the compile, finditer

# split sub subn math,compile,finditer

# split
res = re.split('\d+', "cyx123456cyxx")
print(res)  # --> ['cyx', 'cyxx']

res = re.split('(\d+)', "cyx123456cyxx")  # 保留分组
print(res)  # --> ['cyx', '123456', 'cyxx']

# sub 替换
res = re.sub('\d+', '我把数字替换了',
             "cyx123456cyxxx123456")  # 默认全部替换,当然也可以替换一次re.sub('\d+','我把数字替换了',"cyx123456cyxxx123456",1)
print(res)  # --> cyx我把数字替换了cyxxx我把数字替换了

# subn 替换了并显示替换的次数
res = re.subn('\d+', '我把数字替换了', "cyx123456cyxxx123456")
print(res)  # --> ('cyx我把数字替换了cyxxx我把数字替换了', 2)

# match 这个就相当与加了个^ (和search差不多) --> 主要用来规定这个字符号必须是什么样的
res = re.match('\d+', 'cyx123456cyxxx')
print(res)  # --> None
res = re.match('\d+', '123cyx456cyxxx')
print(res.group())  # --> 123

# compile -- 节省代码的时间的工具
# 假如同一个正则表达式要被使用多次
# 节省了多次解析同一个正则表达式的时间
ret = re.compile("\d+")
res = ret.search("cyx12456cyxXX123")
print(res.group())  # --> 12456

# finditer --> 节省空间
ret = re.finditer("\d+", "cyx123456cyxxx125644")
for r in ret:
    print(r.group())  # --> 123456
    # 125644

# 怎么又节省时间又节省空间呢?
ret = re.compile('\d+')
res = ret.finditer("cyx222231fddsf45746sdf2123sdf56456sdf10123sdf123132sdf")
for r in res:
    print(r.group())
"""
222231
45746
2123
56456
10123
123132
"""

# 分组命名(?P<组名>正则) (?P=组名)
# 有的时候我们要匹配的内容是包含在不想要的内容之中的,
# 只能先把不想要的内容匹配出来,然后再想办法从结果中去掉

# 分组命名的用法 (找两个组里面是一样的内容)
exp = '<abc>asdasf54545645698asdasd</abc>00545sdfsdf</abd>'
ret = re.search('<(?P<tag>\w+)>.*?</(?P=tag)', exp)
print(ret) # -- > <re.Match object; span=(0, 33), match='<abc>asdasf54545645698asdasd</abc'>
# exp2:
import re

ret = re.search('\d(\d)\d(\w+?)(\d)(\w)\d(\d)\d(?P<name1>\w+?)(\d)(\w)\d(\d)\d(?P<name2>\w+?)(\d)(\w)',
                '123abc45678agsf_123abc45678agsf123abc45678agsf')
print(ret.group('name1')) # -- > agsf_123abc
print(ret.group('name2')) # -- > agsf

Small thinking today

    When we have a list like this:

lis = ['', 'z', 'c', 'asd', 'sdf', '', 'asd']

    So how do we delete it inside the null character?

ret = filter(lambda n: n, lis)
print(list(ret))  # --> ['z', 'c', 'asd', 'sdf', 'asd']
发布了12 篇原创文章 · 获赞 7 · 访问量 161

Guess you like

Origin blog.csdn.net/caiyongxin_001/article/details/105022789