Regular expressions, the re module

regular expression

When writing programs or web pages that process strings, there is often a need to find strings that conform to some complex rules. Regular expressions are the tools used to describe these rules. In other words, a regular expression is the code that records the rules of text.

Regular expressions, the re module

^: at the beginning
$: at the end
For example, if a website requires you to fill in the QQ number must be 5 to 12 digits, you can use: ^\d{5,12}$.

Character escape
To find deerchao.net, special symbols need to be escaped deerchao.net
To find C:\Windows, special symbols need to be escaped C:\Windows

repeat

Regular expressions, the re module

Windows\d+ matches Windows followed by 1 or more digits
^\w+ matches the first word of a line

11{1,3}: Indicates that 11 appears 1 to 3 times (111, 1111, 11111), for example, enter 1141114111411114

Regular expressions, the re module

character class

[aeiou] matches any English vowel, [.?!] matches punctuation (. or ? or !)

The meaning represented by [0-9] is exactly the same as \d: a digit; similarly [a-z0-9A-Z_] is also completely equivalent to \w

(?0\d{2}[) -]?\d{8} : 1, escape character (0 or 1 occurrence (?) 2, followed by 0 and then followed by 2 digits (\d{2}) 3. Then one of ) or space or - [) -], appearing 0 times or 1 time (?)
4. Followed by 8 numbers (\d{8})

The branch condition
The expression just now can also match the "incorrect" format such as 010)12345678 or (022-87654321. To solve this problem, we need to use the branch condition. The branch condition in the regular expression refers to There are several rules, if any one of them is satisfied, it should be regarded as a match. The specific method is to separate different rules with |

0\d{2}-\d{8}|0\d{3}-\d{7}: This expression matches two phone numbers separated by hyphens: one is a three-digit area code, one is an 8-digit area code Local code (such as 010-12345678), one is a 4-digit area code and a 7-digit local code (0376-2233445).

grouping

(\d{1,3}.){3}\d{1,3}: 1, means (\d{1,3}.) is a grouping number repeated 1,3 times, . means escape 2 , {3} means repeat 3 times 2, \d{1,3}: repeat the previous number 1,3 times
\d: 1 to 9

antonym

Sometimes it is necessary to find characters that do not belong to a character class that can be easily defined. For example, if you want to find any characters other than numbers, you need to use antonyms:

Regular expressions, the re module

Example: \S+ matches strings that do not contain whitespace.

<a[^>]+> matches a string starting with a enclosed in angle brackets.

re module

The re.match function
attempts to match a pattern from the beginning of the string. If the match is not successful at the beginning, match() returns none

Function syntax:
re.match(pattern, string, flags=0)

例子1:
import re
s = "ab<h1>xxx</h1>dsafasdf<html>sdfads</html>"
reg = re.compile(r"(<(?P<tag>\w+)>(.*)</(?P=tag)>)")
print(reg.findall(s))

结果:
[('<h1>xxx</h1>', 'h1', 'xxx'), ('<html>sdfads</html>', 'html', 'sdfads')]

Analysis: 1. The compile method is to compile regular expressions into an object, which is more efficient than writing regular expressions directly.
2. r"": It means that special characters in double quotes do not need to be escaped, such as ? , $ sign, no need to add \
3, <?P<tag>\w>: means that the data letters are formed into a group named tag
1, ?P<name> command a group named tag, and the subsequent call is ?P =tag
2, \w: number, letter
4, .*: greedy mode, indicating all
5, findall is to match the regular expression with (), as long as the match in () is placed in the list

例子2:
s = "ab<h1>xxx</h1>dsafasdf<html>sdfads</html>"
reg = re.compile(r"(<(?P<tag>\w+)>(.*)</(?P=tag)>)")
print(reg.match(s))

The result is: None (no match, because s starts with ab, and the regular expression starts with <)

The difference between re.match and re.search:

re.match only matches the beginning of the string. If the beginning of the string does not conform to the regular expression, the match fails and the function returns None; while re.search matches the entire string and returns until a match is found.
Example:
s = "ab<h1>xxx</h1>dsafasdf<html>sdfads</html>"
reg = re.compile(r"(<(?P<tag>\w+)>(.*)</ (?P=tag)>)")

print(reg.search(s).group(1)) #group(1): Indicates that the content of the first parenthesis is returned, and group() is the same as group(0) to match the overall result of the regular expression

If no match is successful, re.search() returns None

split

p = re.compile(r'\d+')
print p.split('one1two2three3four4')

结果: ['one', 'two', 'three', 'four', '']

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325237349&siteId=291194637