[Regular Expression] Learning of regular expressions and python's re module


I used to listen to regular expressions a lot, but I have never used them. This time I just need them, so I will learn them.

Reference link:
https://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001386832260566c26442c671fa489ebc6fe85badda25cd000

http://www.runoob.com/regexp/regexp-syntax.html

http://www.runoob.com/python3/python3-reg-expressions.html


1. Introduction to grammar

1.1 Character composition

Regular expressions are mainly used for string matching.

Regular expressions are literal patterns consisting of ordinary and special characters.

Normal characters include printing and non-printing characters.

write picture description here

write picture description here

write picture description here

1.2 Specific usage

1. "\d" means it can match a number, "\w" means it can match a number or letter, "." means it can match all characters except "\n", "*" means it contains 0-n character, "+" means 1-n characters, "?" means 0 or 1 character.

Example:

\d\d\d”可以匹配“345”

"\w\w\w"可以匹配"we2"

“a*”可以匹配“aaa”"aa"

2. {n} means that it contains n characters, and {n,m} means that it contains nm characters.

Example:

"\d{3}\s+\d{3,9}"

表示可以匹配3个数字、空格1-n个,数字3-9个。

如:"231     34367"

3. Range representation []

[0-9a-zA-Z\_]表示可以匹配一个数字或字母或下划线,如"a","3"

[0-9a-zA-Z\_]+表示可以匹配由1-n个(数字或字母或下划线)组成的串,如"sdag1","sf321_12"

[a-zA-Z\_][0-9a-zA-Z\_]*表示可以匹配由一个字母或下划线开头,后面接0-n个(数字或字母或下划线)组成的串。我们在各种语言中定义的变量就是这种。

[a-zA-Z\_][0-9a-zA-Z\_]{0,19}表示可以匹配由一个字母或下划线开头,后面接0-19个(数字或字母或下划线)组成的串,即总长度在1-20之间。

4. "A|B" means matching A or B

^ means start of line, ^\d means must start with a digit.

surface Show Row of Knot bundle \d Indicates that it must end with a number.

^py$ means match the entire line, only 'py'.

2. re module

2.1 Module introduction

1. re.match function

write picture description here

2. re.search function

write picture description here

3. re.sub retrieval and replacement functions

write picture description here

4. compile function

write picture description here

5. findall function

write picture description here

6. finditer function

write picture description here

7. Split function

write picture description here

2.2 Expand the introduction

1、r

Add r before the expression, regardless of escaping.

s = "adb\\-sx2" 正则表达式:"adb\-sx2"

s = r"adb\-sx2"正则表达式:"adb\-sx2"

2. Example 1, if match matches, a Match object is returned, otherwise it returns None

import re

a = re.match(r"\d{3}\-\d{6,8}","010-123467")
b = re.match(r"\d{3}\-\d{6,8}","010 123467")
print a
print b

if a:
    print "yes"
else:
    print "no"
if b:
    print "yes"
else:
    print "no"

write picture description here

3. Cut strings, such as continuous spaces, to split all unwanted

import re

s1 = "a b     c  d"
s2 = "a, b,c,   d,  e:  f"
x1 = s1.split(" ")
x2 = re.split(r" +", s1)
x3 = re.split(r"[\s\,\:]+", s2)

print "x1=", x1
print "x2=", x2
print "x3=", x3

write picture description here

4. Group, extract substrings, use () to represent a group

import re

m = re.match(r"^(\d{3})-(\d{5,8})","010-123456")
print m
print m.group(0)
print m.group(1)
print m.group(2)

write picture description here

import re

time = "19:34:46"
m = re.match(r"(0[0-9]|1[0-9]|2[0-3]):(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]):(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])", time)
print m.groups()
print m.group(1)
print m.group(2)
print m.group(3)

write picture description here

5. Greedy matching

Regular matching defaults to greedy matching, matching as many characters as possible.

import re

m = re.match(r"^(\d+)(0*)$", "230010000")
print m.groups()

write picture description here

Here we originally wanted to match two groups, one group is 0-n groups starting with 0 at the end, and one group is the number of the previous group.

However, "\d+" uses greedy matching to match the following "0000" pages, resulting in "0*" not being matched.

So to use non-greedy matching, add a "?" after "\d+".

import re

m = re.match(r"^(\d+?)(0*)$", "230010000")
print m.groups()

write picture description here

6. Compile

When we use regular expressions in Python, the re module does two things internally:

Compile the regular expression, if the string of the regular expression itself is illegal, an error will be reported;

Use compiled regular expressions to match strings.

If a regular expression needs to be reused thousands of times, for efficiency reasons, we can precompile the regular expression, and then we don't need to compile this step for repeated use, and match directly.

import re

pattern = re.compile(r"^(\d{3})-(\d{5,9})$")
s1 = "010-1234567"
s2 = "212-3421534"
m1 = pattern.match(s1)
m2 = pattern.match(s2)
print m1.groups()
print m2.groups()

write picture description here

7. Homework 1

Please try to write a regular expression that validates email addresses. Version 1 should be able to verify a similar Email:

[email protected]
[email protected]

import re

pattern = re.compile(r"^[\w.]+@\w+.com$")
s1 = "[email protected]"
s2 = "[email protected]"
print pattern.match(s1).group()
print pattern.match(s2).group()

write picture description here

8. Homework 2

Version 2 can verify and extract email addresses with names:

[email protected]

import re

pattern = re.compile(r"^(<[\w\s]+>)\s[\w.]+@\w+.\w+$")
s = "<Tom Paris> [email protected]"
print pattern.match(s).group()
print pattern.match(s).group(1)

write picture description here

above.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324441220&siteId=291194637