[Note] Advanced python regular expression

[Python advanced base notes] regular expressions

table of Contents

1. re module

2. Match a single character

3. match multiple characters

4. match the end of the beginning

4.1. Examples of variable names to determine whether compliance with the requirements

4.2 Examples of match-mail address

5. Match packet

5.1 | Examples: matching the number between 0-100.

5.2 () Example:

5.3 \ Examples

5.4. (? P) (? P = name) example

6. re advanced usage (search, findall, sub, split ...)

6.1. search 

6.2. findall 

6.3. sub 

6.4 split 


 

When required by a regular expression matching the string, you can use a module in Python, named re.

1. re module

re module use :

 #coding=utf-8

    # 导入re模块
    import re

    # 使用match方法进行匹配操作
    result = re.match(正则表达式,要匹配的字符串)

    # 如果上一步匹配到数据的话,可以使用group方法来提取数据
    result.group()

 

Example re module (starting with hello matching statement):

 #coding=utf-8

    import re

    result = re.match(r"hello","hello python")  # 'r'是防止字符转义的(转义字符无效)

    result.group()

Operating results: hello

re.match function:

re.match attempts string from the starting position to match a pattern matching the start position if not successful, match () returns none.

Function syntax :

re.match(pattern, string, flags=0)
pattern Matching regular expression
string To string matching.
flags

Flag for controlling the regular expression matching method, such as: whether or not case-sensitive, multi-line matching and the like.

Flag can refer to: https://www.runoob.com/python3/python3-reg-expressions.html#flags

 

2. Match a single character

Single-character match the regular expression:

character Features
. Match any one character (except \ n)
[ ] Matching character [] listed in
\d Matching numbers, that is, 0-9
\D Matching non-digital, that is not a number
\s Matching blank, space, tab key
\S Matching non-blank
\w Matching word character, ie az, AZ, 0-9, _
\W Matches non-word character

Notice: uppercase and lowercase characters match the characters matching function just the opposite.

 

3. match multiple characters

Associated with a character format

character Features
* Match the previous character appears zero or infinite, that is dispensable
+ Match the previous character appear more than once or unlimited, that is at least 1
? Matches the preceding character appear more than once or zero times, that there is either 1 or 0 times
{m} M times a character appears before match
{m,n} Before a matching character appears from m to n times

 

Example 1: *

Demand: matched, a string of uppercase letters, lowercase letters, and these are later lowercase dispensable.

#coding=utf-8
import re

ret = re.match("[A-Z][a-z]*","M")
print(ret.group())

ret = re.match("[A-Z][a-z]*","MnnM")
print(ret.group())

ret = re.match("[A-Z][a-z]*","Aabcdef")
print(ret.group())

operation result:

M
Mnn
Aabcdef

 

 

4. match the end of the beginning

character Features
^ Matches the beginning of string
$ End of the string

 

4.1. Examples of variable names to determine whether compliance with the requirements

import re


def main():
    names = ["age", "_age", "1age", "age1", "a_age", "age_1_", "age!", "a#123", "__________"]
    for name in names:
        # ret = re.match(r"[a-zA-Z_][a-zA-Z0-9_]*", name)  # 有弊端,开头符合后,后面的没有强制要求判断,即不一定能判断到结尾

        # ^规定开头  $规定结尾  
        # python中的match默认是从头开始判断的所以,在match中可以不写^,但是match不会判断结尾,所以
        # 当需要以xxx结尾的时候 还需要写上$
        ret = re.match(r"^[a-zA-Z_][a-zA-Z0-9_]*$", name)  # 满足匹配则有返回值
        if ret:
            print("变量名:%s 符合要求....通过正则匹配出来的数据是:%s" % (name, ret.group()))
        else:
            print("变量名:%s 不符合要求...." % name)


if __name__ == "__main__":
    main()

operation result:

变量名:age 符合要求....通过正则匹配出来的数据是:age
变量名:_age 符合要求....通过正则匹配出来的数据是:_age
变量名:1age 不符合要求....
变量名:age1 符合要求....通过正则匹配出来的数据是:age1
变量名:a_age 符合要求....通过正则匹配出来的数据是:a_age
变量名:age_1_ 符合要求....通过正则匹配出来的数据是:age_1_
变量名:age! 不符合要求....
变量名:a#123 不符合要求....
变量名:__________ 符合要求....通过正则匹配出来的数据是:__________

 

4.2 Examples of match-mail address

Matching email address 163, and there is 4-20 letters, numbers or underscores before the @ sign, e.g. [email protected].

import re


def main():
    email = input("please input e-mail addres:")
    # 如果在正则表达式中需要用到了某些普通的字符,比如 . 比如? 等,仅仅需要在他们前面添加一个 反斜杠进行转义
    ret = re.match(r"[a-zA-Z_0-9]{4,20}@163\.com$", email)
    # ret = re.match(r"[\w]{4,20}@163.com$", email)  # \w会匹配中文?不行? 实测可用\w
    if ret:
        print("%s 符合要求." % email)
    else:
        print("%s 不符合要求." % email)

if __name__ == "__main__":
    main()

 

5. Match packet

 

character Features
| About a match in any expression
(from) The characters in parentheses as a packet (in brackets matching expression, but also represents a group)
\num Num matched reference packet string
(?P<name>) Packet surnamed
(?P=name) Reference packet matches the alias name string

5.1 | Examples: matching the number between 0-100.

# coding=utf-8

# 目标:匹配出0-100之间的数字。

import re

ret = re.match("[1-9]?\d","8")  # 这样写匹配不到100(三位数),匹配不了结尾
print(ret.group())  # 8

ret = re.match("[1-9]?\d","78")
print(ret.group())  # 78

# 不正确的情况
ret = re.match("[1-9]?\d","08")
print(ret.group())  # 0 ,?匹配前一个字符出现1次或者0次,本处出现0次,符合匹配。为什么返回的是0而不是08?

# 修正之后的
ret = re.match("[1-9]?\d$","08")  # 还是匹配不到100(三位数)
if ret:
    print(ret.group())
else:
    print("不在0-100之间")

# 添加|
ret = re.match("[1-9]?\d$|100","8")
print(ret.group())  # 8

ret = re.match("[1-9]?\d$|100","78")
print(ret.group())  # 78

ret = re.match("[1-9]?\d$|100","08")
# print(ret.group())  # 不是0-100之间

ret = re.match("[1-9]?\d$|100","100")
print(ret.group())  # 100

 

5.2 () Example:

1) Matching the 163,126, qq-mail

import re

ret = re.match("\w{4,20}@163\.com", "[email protected]")
# \w匹配单词字符,即a-z、A-Z、0-9、_ ,此处匹配@符号之前有4到20位 英文字母、数字或下划线
print(ret.group())  # [email protected]

ret = re.match("\w{4,20}@(163|126|qq)\.com", "[email protected]")  # |匹配左右任意一个表达式
print(ret.group())  # [email protected]

ret = re.match("\w{4,20}@(163|126|qq)\.com", "[email protected]")
print(ret.group())  # [email protected]

ret = re.match("\w{4,20}@(163|126|qq)\.com", "[email protected]")  # 不匹配时无返回
if ret:
    print(ret.group())
else:
    print("不是163、126、qq邮箱")  # 不是163、126、qq邮箱

Results of the:

[email protected]
[email protected]
[email protected]
不是163、126、qq邮箱

2) extraction area code and phone number

import re
ret = re.match("([^-]*)-(\d+)","010-12345678")
# [^-]匹配除了-外的字符;*匹配前一个字符出现0次或者无限次;+匹配\d(数字)1次或者无限次
print(ret.group())  # 010-12345678
print(ret.group(1))  # 010
print(ret.group(2))  # 12345678

Note: [^ ...]: matches are not [] characters.

Such as: [^ abc] In addition to matching a, b, c characters; [^ -] In addition to matching - outside characters.

 

5.3 \ Examples

\ Num Example:

Requirements: a match<html><h1>hello</h1></html

import re

labels = ["<html><h1>hello</h1></html>", "<html><h1>hello</h2></html>"]

for label in labels:
    ret = re.match(r"<(\w*)><(\w*)>.*</\2></\1>", label)
    if ret:
        print("%s 是符合要求的标签" % ret.group())
    else:
        print("%s 不符合要求" % label)

注:正则表达式内,第一个()是分组1,用\1匹配;第2个()是分组2,用\2匹配。

Output:

<html><h1>hello</h1></html> 是符合要求的标签
<html><h1>hello</h2></html> 不符合要求


NOTE: \ num reference packet num matched string . Within a regular expression, a first () is a packet 1, with \ matching; of 2 () is a packet 2, with \ 2 match.

 

5.4. (?P<name>) (?P=name)实例

(? P <name>) packet aliases, (? P = name) reference packet matches the alias name string.

Requirements: a match<html><h1>hello</h1></html>

#coding=utf-8

import re

ret = re.match(r"<(?P<name1>\w*)><(?P<name2>\w*)>.*</(?P=name2)></(?P=name1)>", "<html><h1>hello</h1></html>")
print(ret.group())  # <html><h1>hello</h1></html>

ret = re.match(r"<(?P<name1>\w*)><(?P<name2>\w*)>.*</(?P=name2)></(?P=name1)>", "<html><h1>hello</h2></html>")
# print(ret.group())  # 不匹配,异常

? Explain :( P <name1> \ w *) name1 aliases to the packet, packet matches \ w 0 times or infinite;? (P = name1) name1 reference packet matches the alias string.

 

6. re advanced usage (search, findall, sub, split ...)

6.1. search 

re.search scan the entire string and returns the first successful match : re.search (pattern, String, the flags = 0) 

Re.search method returns an object matching the success of a match, otherwise None.

We can use the group (num) or groups () function to obtain the matching object matching expression.

re.match matches only the beginning of the string, if the string does not conform to begin regular expression, the match fails, the function returns None; and re.search match the entire string until it finds a match.

Example: matching the number of times the article read.

# coding=utf-8
import re

ret = re.search(r"\d+", "阅读次数为 9999")
print(ret.group())  # 9999

 

6.2. findall 

Being found in the string expression that matches all sub-strings , and returns a list , if no match is found, an empty list is returned.

(Note:  match and search is a match findall match all .)

re.findall(string[, pos[, endpos]])
  • string string to be matched.
  • pos optional parameter specifies the starting position of the string, the default is 0.
  • endpos optional parameter specifying the end position of the string, the string length defaults.

Example: the statistics of the number of python, c, c ++ to read the article

#coding=utf-8
import re

ret = re.findall(r"\d+", "python = 9999, c = 7890, c++ = 12345")
print(ret)  # ['9999', '7890', '12345']

NOTE: The return value is a list, no group.

 

6.3. sub 

Python re module provides re.sub match for the replacement string.

re.sub(pattern, repl, string, count=0, flags=0)
  • pattern: in a regular pattern string.
  • repl: replace the string, it may be a function.
  • string: find the original string to be replaced.
  • count: Maximum number of replacements of the pattern matching, default 0 means to replace all occurrences.
  • flags: the pattern against the compile-time, in digital form.

The first three are mandatory parameters, optional parameters of the two.

Simply put: take the pattern and string matching, matching results into repel, and then replace the entire string is returned.

 

Example: Read the matched number is incremented.

# coding=utf-8
import re

ret = re.sub(r"\d+", '998', "python = 997")
print(ret)  # python = 998

repl replaced by function:

# coding=utf-8
import re


def add(temp):
    strNum = temp.group()  # 取出匹配到的值
    num = int(strNum) + 1
    return str(num)


ret = re.sub(r"\d+", add, "python = 997")  # 匹配到了,则调add函数(匹配出来的对象传给函数),函数返回值用来替换
print(ret)  # python = 998

ret = re.sub(r"\d+", add, "python = 99")
print(ret)  # python = 100

 

Example: Remove from the following text string

<div> <p> Technical requirements: </ the p->
<the p-> 1, more than one year Python development experience, grasp of object-oriented analysis and design, understanding of design patterns </ the p->
<the p-> 2, to master the HTTP protocol, familiar with MVC , MVVM concepts and related WEB development framework </ the p->
<the p-> 3, grasp the development of relational database design, master SQL, skilled use of MySQL / PostgreSQL in a <br> </ the p->
<the p-> 4, master NoSQL , MQ, using the corresponding technical solutions skilled </ P>
<P>. 5, familiar Javascript / the CSS / the HTML5, JQuery, React, Vue.js </ P>
<P> & nbsp; <br> </ P> </ div>
# coding=utf-8
import re

string =r"<div><p>技术要求:</p><p>1、一年以上 Python 开发经验,掌握面向对象分析和设计,了解设计模式</p><p>2、掌握HTTP协议,熟悉MVC、MVVM等概念以及相关WEB开发框架</p><p>3、掌握关系数据库开发设计,掌握 SQL,熟练使用 MySQL/PostgreSQL 中的一种<br></p><p>4、掌握NoSQL、MQ,熟练使用对应技术解决方案</p><p>5、熟悉 Javascript/CSS/HTML5,JQuery、React、Vue.js</p><p>&nbsp;<br></p></div>"

ret = re.sub(r"<[^>]*>|&nbsp;|\n", "", string)
# [^>]*匹配除了>之外的字符,<[^>]*>即匹配<...>。总体匹配<...>、空格、换行
print(ret) 

Note:...

 

6.4 split 

Cutting the matching string, and returns a list . I.e., as to be able return a list of matching sub-sequence after the divided character string.

re.split(pattern, string[, maxsplit=0, flags=0])
  • pattern: match the regular expression;
  • string: string to match;
  • maxsplit: partition number, maxsplit = 1 once separated, the default is 0, the number is not limited;
  • the flags: flags, for controlling the regular expression matching method, such as: whether or not case-sensitive, multi-line matching and the like.

Example: according to: and (space) cutting the string "info: xiaoZhang 33 shandong"

# coding=utf-8
import re

ret = re.split(r":| ","info:xiaoZhang 33 shandong")
print(ret)  # ['info', 'xiaoZhang', '33', 'shandong']

 

More may refer rookie Tutorial: https://www.runoob.com/python3/python3-reg-expressions.html#flags

 

----------end-------------

 

Published 50 original articles · won praise 10 · views 6616

Guess you like

Origin blog.csdn.net/qq_23996069/article/details/104069963