Regular expressions & re module small sense of ~

Regular expressions & re module

Regular expressions are a special character sequence, it can help you to easily check whether a string matches a pattern.
Since Python 1.5 version adds the re module that provides Perl-style regular expression pattern.
re the Python language module has all the features of regular expressions.
compile function generates a regular expression object according to a pattern string flag and an optional parameter. This object has a series of methods for regular expression matching and replacement.
re module provides functions identical with the functions of these methods, which use a pattern string functions as their first argument.

Regular Expressions

  • Regular: rule is a rule of string matching
    • Regular rules:
      • Which in itself is a character string which would match the characters
      • Character set [Character 1 Character 2], one character set to represent matching a character, as long as the character appears in character the group, then it shows the character to match the (character set can also be scope, all ranges must follow ascii code from small to large to develop [0-9] [a - z] [A - Z])

Yuan characters:

  • [0-9] \ d represents multiple digital
  • \ W represent letters. Numbers. Underline
  • \ S indicates a space. Tabs. Newline
  • \ T denotes a tab
  • \ N represent a newline character
  • \ D represents a non-digital
  • \ W represents all alphanumeric characters except underscore
  • \ S represents a non-blank
  • Indicates that all characters except newline
  • [] Character set: As long as all of the characters in brackets are in line with the rules of the character
  • [^] Shown in brackets all the characters do not comply with the rules
  • ^ Indicates the start of a character
  • $ Represents the end of a character
  • | Representation or, if there are two rules overlap, always sing in front of short behind
  • () Indicates the population to being a part of as a set of rules, | the scope of this symbol can be reduced
  • What \ b denote both ends

quantifier:

  • {N} indicating the occurrence of n times
  • {N,} appears at least n times represents
  • {N, m} indicates that an at least n times m times the most frequent
  • ? Indicates a match represents zero or 1 optional but there is only one such as decimal point
  • + Means match one or more times
  • * Represents 0 or more times matches represent optional but may for example have a plurality of n-bit after Xiao Shuxian
  • . *? X represents the content of any match any number of times, to stop the event x
  • a ?? represents a Match 0
匹配0次
    # 整数 \d+
    # 小数 \d+\.\d+
    # 整数或小数 : \d+\.?\d*
    # 分组的作用 : \d+(\.\d+)?
Greed match

In the case of quantifier scope allowed as many matches
. * X which matches any character any number of times encounter came to a halt last x

Non-greedy (inert) Match

.? * X represent any character matches any number of times but stopped the event x -> quantifier is behind the increase '? '

*? 重复任意次,但尽可能少重复
+? 重复1次或更多次,但尽可能少重复
?? 重复0次或1次,但尽可能少重复
{n,m}? 重复n到m次,但尽可能少重复
{n,}? 重复n次以上,但尽可能少重复

.*?
. 是任意字符
* 是取 0 至 无限长度
? 是非贪婪模式。
何在一起就是 取尽量少的任意字符,一般不会这么单独写,他大多用在:
.*?x

就是取前面任意长度的字符,直到一个x出现

General practice:

1、 匹配一段文本中的每行的邮箱 查看详细说明
      http://blog.csdn.net/make164492212/article/details/51656638

2、 匹配一段文本中的每行的时间字符串,比如:‘1990-07-12’;

   分别取出1年的12个月(^(0?[1-9]|1[0-2])$)、
   一个月的31天:^((0?[1-9])|((1|2)[0-9])|30|31)$

3、 匹配qq号。(腾讯QQ号从10000开始)  [1,9][0,9]{4,}

4、 匹配一个浮点数。       ^(-?\d+)(\.\d+)?$   或者  -?\d+\.?\d*

5、 匹配汉字。 匹配全文是汉子的      ^[\u4e00-\u9fa5]{0,}$ 
Packet name - matching tag
import re
# 分组命名(?P<组名>正则) (?P=组名)   一定不能忘记()分组

ret = re.search("<(?P<tag_name>\w+)>\w+</(?P=tag_name)>","<h1>hello</h1>")
#还可以在分组中利用?<name>的形式给分组起名字
#获取的匹配结果可以直接用group('名字')拿到对应的值
print(ret.group('tag_name'))  #结果 :h1
print(ret.group())  #结果 :<h1>hello</h1>

ret = re.search(r"<(\w+)>\w+</\1>","<h1>hello</h1>")
#如果不给组起名字,也可以用\序号来找到对应的组,表示要找的内容和前面的组内容一致
但是注意\1等在python中有特殊的意义,要让其转义掉,所以前边要加一个r
#获取的匹配结果可以直接用group(序号)拿到对应的值
print(ret.group(1))
print(ret.group())  #结果 :<h1>hello</h1>

re module

findall will match all eligible entries in a string and returns a list of rules, if no match to return empty list

import re
regex
ret = re.findall('\d+','alex83')#效率低
print(ret)
# ret = re.findall('\d(\d)','aa1alex83')
# # findall遇到正则表达式中的分组,会优先显示分组中的内容
# print(ret)
#取消优先显示分组(?:正则)

练习 取整数
# 有的时候我们要匹配的内容是包含在不想要的内容之中的,
    # 只能先把不想要的内容匹配出来,然后再想办法从结果中去掉
  
ret = re.findall('\d+','1-2*(60+(-40.35/5)-(-4*3))')
print(ret) #['1', '2', '60', '40', '35', '5', '4', '3']
#这一步只能把数字取出来,不能区分小数是什么,是不是正负数
ret = re.findall('-?\d+\.\d*|(-?\d+)','1-2*(60+(-40.35/5)-(-4*3))')
print(ret) # ['1', '-2', '60', '', '5', '-4', '3']
#这一步(-?\d+)外边的括号很精髓,小数也取值,但是显示,只显示货号中的内容,如果没有这个()结果是:
#['1', '-2', '60', '-40.35', '5', '-4', '3']
ret.remove('')
print(ret)
#这一步,虽然显示括号里的内容,但是用‘’来占位,表示这有东西,只是没显示,所以需要remove掉  但是这里remove只能删除到从左到有的第一个目标,没法实现全部删除

#所以需要重新想办法,函数filter()筛选列表函数
ret = filter(lambda n:n, ['1', '-2', '60', '', '5', '-4', '3',''])
这样就能达到目的

If the search returns an object can be matched, if not match Back None, by the group value

# ret = re.search('\d+','alex83')
# print(ret) 
# if ret:
#     print(ret.group()) # 如果是对象,那么这个对象内部实现了group,所以可以取值
#                        # 如果是None,那么这个对象不可能实现了group方法,所以报错
# 会从头到尾从带匹配匹配字符串中取出第一个符合条件的项
# 如果匹配到了,返回一个对象,用group取值
# 如果没匹配到,返回None,不能用group

would match the string is removed from the head matches the first character whether the rule, if so, the object is returned with value group, if not, return None

re.match
ret = re.match('\d','alex83') == re.match('^\d','alex83')
print(ret.group())
# match = search + ^正则

finditer computer generated iterator occupy a small space (save memory space)

# ret = re.finditer('\d','safhl02urhefy023908'*20000000)  # ret是迭代器
# for i in ret:    # 迭代出来的每一项都是一个对象
#     print(i.group())  # 通过group取值即可

compile a pre-compiled regular written to save time repeat

compile
s = '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>' \
    '.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>'
ret = re.compile('\d3')
print(ret)
r1 = ret.search('alex83')
print(r1)
ret.findall('wusir74')
ret = re.compile('\d+')
r3 = ret.finditer('taibai40')
for i in r3:
    print(i.group())

先compile(如果没有重复使用同一个正则,也不能节省时间)
再finditer  从时间和空间上都节省了
ret= re.compile('\d+')
res = ret.finditer('agks1ak018as093')
for r in res:
    print(r.group())

split divided by the contents of the regular expression matching

# ret = re.split('\d(\d)','alex83wusir74taibai')  # 默认自动保留分组中的内容
# print(ret)
结果['alex', '3', 'wusir', '4', 'taibai']

sub replaced, replaced by the contents of the regular expression matching

ret = re.sub('\d','D','alex83wusir74taibai')
print(ret)
结果:alexDDwusirDDtaibai
ret = re.sub('\d','D','alex83wusir74taibai',1)
print(ret)
结果:alexD3wusir74taibai

subn replaced based on the sub, returns a tuple, the result is to replace the first content, the second number is replaced

ret = re.subn('\d','D','alex83wusir74taibai')
print(ret)
结果:('alexDDwusirDDtaibai', 4)
  • Packet name :(? P <name> Regular)
  • Group references :(? P = group life) represents a set of matching to this group must already exist before and exactly the same content
  • Grouping value search group ( 'Group name')
  • \ 1 further shows a display content in a regular first group *************
#标签的匹配
exp = <h1>sdfssdsad</h1><h2>sfsdfsfs</h2>
import re
re.findall('<\w+>(.*?)</\w>',exp)

ret = re.search(r'<(\w+)>(.*?)</\1>',exp)
ret.group(2)

#用户输入身份证号匹配
inp = input('>')
ret = re.match('^[1-9]\d{14}(\d{2}[\dx])?$',inp)
print(ret.group())

#匹配年月日日期,格式2018.12.9或2017-12-09
[1-9]\d{3}(?P<sub>[^\d])(1[0-2]|0?[1-9])(?P=sub)([12]\d|3[01]|0?[1-9])

#匹配邮箱地址
#邮箱规则
# @之前必须有内容且只能是字母(大小写)、数字、下划线(_)、减号(-)、点(.)
# @和最后一个点(.)之间必须有内容且只能是字母(大小写)、数字、点(.)、减号(-),且两个点不能挨着
# 最后一个点(.)之后必须有内容且内容只能是字母(大小写)、数字且长度为大于等于2个字节,小于等于6个字节
如:[email protected]
[-\w.]+@([-\da-zA-Z]+\.)+[a-zA-Z\d]{2,6}

#ret = re.findall(par,content,flags=re.S)
flags=re.S#表示匹配换行
# 例题
    # 有的时候我们想匹配的内容包含在不相匹配的内容当中,这个时候只需要把不想匹配的先匹配出来,再通过手段去掉
import re
ret=re.findall(r"\d+\.\d+|(\d+)","1-2*(60+(-40.35/5)-(-4*3))")
print(ret)
ret.remove('')
print(ret)
  • Examples of the crawler re module
def parsePage(s):
    com = re.compile(
        '<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)
    ret = com.finditer(s)
    for i in ret:
        yield {
            "id": i.group("id"),
            "title": i.group("title"),
            "rating_num": i.group("rating_num"),
            "comment_num": i.group("comment_num"),
        }


def main(num):
    url = 'https://movie.douban.com/top250?start=%s&filter=' % num
    response_html = getPage(url)
    ret = parsePage(response_html)
    f = open("move_info7", "a", encoding="utf8")
    for obj in ret:
        print(obj)
        data = json.dumps(obj, ensure_ascii=False)
        f.write(data + "\n")


if __name__ == '__main__':
    count = 0
    for i in range(10):
        main(count)
        count += 25
flags parameter re module
flags有很多可选值:

re.I(IGNORECASE)忽略大小写,括号内是完整的写法
re.M(MULTILINE)多行模式,改变^和$的行为
re.S(DOTALL)点可以匹配任意字符,包括换行符
re.L(LOCALE)做本地化识别的匹配,表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境,不推荐使用
re.U(UNICODE) 使用\w \W \s \S \d \D使用取决于unicode定义的字符属性。在python3中默认使用该flag
re.X(VERBOSE)冗长模式,该模式下pattern字符串可以是多行的,忽略空白字符,并可以添加注释

Guess you like

Origin www.cnblogs.com/zheng0907/p/12488895.html