11 Python regular expressions

overview

        In the previous section, we introduced Python's file operations, including: opening files, reading files, writing files, closing files, moving file pointers, and obtaining directory lists, etc. In this section, we will introduce Python's regular expressions. Regular expressions are a powerful tool for matching, searching, replacing, and more in text, providing an efficient and flexible way to manipulate strings. Using regular expressions can greatly improve the efficiency of string processing and help us quickly find strings that match specific patterns in a large amount of text data.

Definition of regular expression

        Regular expression, also known as regular expression, regular expression, regex, is a text pattern, especially suitable for searching, verifying and replacing text that matches a specific pattern. It is a text pattern composed of ordinary characters and special characters. This pattern describes a string matching pattern and can be used to search, replace, and intercept strings that match a specific pattern.

        Python provides a built-in re module for dealing with regular expressions. By importing the re module, we can use its functions to perform regular expression operations.

Syntax of regular expressions

        The syntax of Python regular expressions includes some special characters and metacharacters, which can be used to represent specific patterns. The following table lists some commonly used Python regular expression syntax.

grammar

meaning

.

matches any character except newline

^

matches the beginning of the string

$

matches the end of the string

*

Matches the preceding subexpression zero or more times

+

Matches the preceding subexpression one or more times

?

Matches the preceding subexpression zero or one time

()

matches an expression enclosed in parentheses, also denoting a group

a|b

match a or b

{n}

Matches the preceding subexpression exactly n times

{n,}

Matches the preceding subexpression at least n times

{n,m}

Matches the preceding subexpression at least n but not more than m times

[...]

Indicates a set of characters that can be matched, for example: [A-Za-z] matches any letter or number

[^...]

Indicates that this character set does not match, for example: [^A-Za-z] matches any character except letters and numbers

\d

matches any decimal digit, equivalent to [0-9]

\D

Matches any non-numeric character, equivalent to [^0-9]

\s

Matches any whitespace character, including spaces, tabs, form feeds, etc., equivalent to [\f\n\r\t\v]

\S

Matches any non-blank character, equivalent to [^ \f\n\r\t\v]

\w

Match any letter, number, underscore character, equivalent to [a-zA-Z0-9_]

\W

Match any non-letter, number, underscore character, equivalent to [^a-zA-Z0-9_]

        In addition, there are some special character classes and escape sequences that are common in regular expressions, see the table below.

grammar

meaning

\t

Tab (Tab)

\n

line break

\r

carriage return

\f

Form feed

\b

backspace

\\

the backslash itself

\'

single quote itself

\"

double quote itself

\0

null character

\xnn

ASCII character, where nn is a two-digit hexadecimal number

\unnnn

Unicode characters, where nnnn is a four-digit hexadecimal number

re.search function

        The re.search function searches the given string for matches to the regular expression and returns a match object. If multiple groups are matched, the first group is returned; if no match is found, None is returned.

        The re.search function is defined as follows:

          re.search(pattern, string, flags=0)

        The meaning of each parameter is as follows:

        pattern : The regular expression to match.

        string : The string to search for.

        flags : Flags to control the behavior of the regular expression, optional. Multiple flags can be used, combined via the bitwise OR (|) operator. For example: you can use re.IGNORECASE to ignore case, use re.MULTILINE to match each line separately.

        The re.search function returns a match object, or None if no match is found. When the match is successful, the match object has the following properties.

        group(index) : Returns the group at the specified index, or the entire matched text if the index does not exist.

        groups() : Returns a tuple containing all groups (groups with index numbers greater than 0).

        start(index) : Returns the starting position of the group at the specified index in the string.

        end(index) : Returns the end position of the group at the specified index in the string.

        span(index) : Returns a tuple of the start and end positions of the grouping at the specified index in the string.

        The following method throws an exception if the match fails:

        group(index) : Attempting to get matching results for a group that does not exist will throw an exception.

        start(index), end(index), span(index) : attempting to get the boundary position of a non-existent group will throw an exception.

        We can understand the re.search function through the sample code below.

import re

text = "Hello CSDN!"
result = re.search('(CSDN)', text)
if result:
    # 输出:Found: CSDN (6, 10)
    print("Found:", result.group(1), result.span(1))
else:
    print("Not found")

text = 'be greater than ever'
result = re.search('(.*) greater (.*?) .*', text)
if result:
   # 输出:Found all: be greater than ever
   print ("Found all:", result.group())
   # 输出:Found group 1: be
   print ("Found group 1:", result.group(1))
   # 输出:Found group 2: than
   print ("Found group 2:", result.group(2))
else:
   print("Not found")

re.match function

        The re.match function is used to perform a regular expression match at the beginning of a string and return a match object. Returns None if no match is found. The difference between the re.match function and the re.search function is that the re.match function only matches the beginning of the string. If the string does not match the regular expression at the beginning, the match fails and returns None; while the re.search function matches the entire character string until a match is found.

import re

text = 'Hello CSDN'
result = re.match('Hello', text)
# 从字符串的开始处进行匹配,能找到,输出:Found: Hello
if result:
    print("Found:", result.group())
else:
    print("Not found")

text = 'CSDN Hello'
result = re.match('Hello', text)
# 从字符串的开始处进行匹配,找不到,输出:Not found
if result:
    print("Found:", result.group())
else:
    print("Not found")

text = '[email protected]'
result = re.match(r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$', text)
# 校验是否为有效的电子邮件地址,输出:Found
if result:
    print("Found")
else:
    print("Not found")

re.findall function

        The re.findall() function is used to find all matching parts in a string and return a list containing all matching parts.

        The re.findall() function is defined as follows:

          re.findall(pattern, string)

        Among them, pattern is a regular expression used to match what you want to find; string is a string in which you want to find matches. The returned list contains all matching parts, or an empty list if no match is found.

import re

text = "Hello, CSDN! Be greater than ever."
matches = re.findall(r'\b\w+\b', text)
# 输出:['Hello', 'CSDN', 'Be', 'greater', 'than', 'ever']
print(matches)

re.sub function

        The re.sub() function is used to use a regular expression in a string to perform a replacement operation. The function returns a new string in which the matched string is replaced with the specified replacement object.

        The re.sub() function is defined as follows:

          re.sub(pattern, repl, string, count=0, flags=0)

        The meaning of each parameter is as follows:

        pattern : The regular expression to match.

        repl : The replacement object used to replace the matched string.

        string : The original string to replace in.

        count : Specifies the maximum number of replacements, optional. The default is 0, which means replace all matching strings.

        flags : Flags to control the behavior of the regular expression, optional. Multiple flags can be used, combined via the bitwise OR (|) operator.

        In the sample code below, we use the re.sub() function to replace all words in a string with "CSDN", and the regular expression \b\w+\b matches both word boundaries and the word itself.

import re

text = "Hello, world! Be greater than ever."
result = re.sub(r'\b\w+\b', 'CSDN', text)
# 输出:CSDN, CSDN! CSDN CSDN CSDN CSDN.
print(result)

re.compile function

        The re.compile() function is used to compile the given regular expression into a reusable regular expression object. The function returns a regular expression object that can be used for repeated matching or search operations.

        The re.compile() function is defined as follows:

          re.compile(pattern, flags=0)

        The meaning of each parameter is as follows:

        pattern : The regular expression string to compile.

        flags : Flags to control the behavior of the regular expression, optional. Multiple flags can be used, combined via the bitwise OR (|) operator.

        In the sample code below, we use the re.compile() function to compile the regular expression \d\w\d into a reusable regular expression object, and use this object for search operations.

import re

pattern = re.compile(r'\d+\w+\d+')
result = pattern.search('Hello 666OK999 CSDN')
# 输出:666OK999
print(result.group())

re.finditer function

        The re.finditer() function is used to find matches of a regular expression in a string and returns an iterator containing the matching results. Each matching result is a Match object, and the matched string can be obtained through the group() method of the object.

        The re.finditer() function is defined as follows:

          re.finditer(pattern, string, flags=0)

        The meaning of each parameter is as follows:

        pattern : The regular expression pattern to match.

        string : The string to look for matches in.

        flags : Flags to control the behavior of the regular expression, optional. Multiple flags can be used, combined via the bitwise OR (|) operator.

        In the following sample code, we use the re.finditer() function to obtain an iterator of matching objects, and traverse the iterator to output the matching string.

import re

text = 'Hello 666 CSDN 999'
pattern = re.compile(r'\d+')
matches = pattern.finditer(text)
# 依次输出:666 999
for match in matches:
    print(match.group())

re.split function

        The re.split() function is used to split a string according to a regular expression and return a list of split substrings.

        The re.split() function is defined as follows:

          re.split(pattern, string, maxsplit=0)

        The meaning of each parameter is as follows:

        pattern : The regular expression pattern to use to split the string.

        string : The string to split.

        maxsplit : Specifies the maximum number of splits, optional. If this parameter is specified, it will be divided up to maxsplit times according to the regular expression. The default value is 0, which means no limit.

        In the sample code below, we split a string using the re.split() function and return a list of the split substrings.

import re

text = 'ocean-sky-continent'
result = re.split('-', text)
# 输出:['ocean', 'sky', 'continent']
print(result) 

Guess you like

Origin blog.csdn.net/hope_wisdom/article/details/132701007