[Python crawler development basics ②] Python basics (regular expressions)

  • Friendly reminder : Since the articles in this column are biased towards crawlers, it is impossible to cover everything about python. Here we only focus on the key points.
    If you feel that there is something that has not been mentioned, welcome to add~

  • Previous recommendations : [Python crawler development basics ①] Python basics (1)
    The previous article has already talked about the basic variable types of Python. Today we will take a look at the "regular expressions" that Python uses more often in crawler development.



1 What is a regular expression

A regular expression (regular expression, often abbreviated as regex, regexp or RE in code) is a patterned string used to search, replace, split and match text data . The basic idea is to use some special characters to represent a given pattern , and then match this pattern in the text.

The role of regular expressions:

  1. Matching : Determine whether a given string meets the filter logic of the regular expression;
  2. Obtain substring : We can obtain the specific part we want from the string through regular expressions.

Features of regular expressions:

  1. Very strong flexibility, logic and functionality ;
  2. Complex control of strings can be quickly achieved in an extremely simple manner .
  3. For those who are new to contact, it is relatively obscure.

2 The use of regular expressions in Python

In Python, the re module and regex module are mainly used to implement regular expression operations. The following are detailed explanations respectively.

2.1 re module

Python's re module (regular expressions) is a powerful and flexible tool for pattern matching, replacement, and splitting of strings . The re module can be used to process a variety of character data, including text files, log files, programming language codes, and more. The re module contains a large number of regular expression functions, including search, replace, split, match, copy, extract, etc., which can help users complete efficient text processing tasks.

Note : Python comes with the re module, no additional installation is required. Therefore, it is also very convenient to use the re module without running the pip install command in the terminal like installing other third-party libraries.

2.1.1 re.search()

re.search() is a function commonly used in the Python re module, which is used to search for a specified regular expression pattern in a string . When searching, the function will scan the entire string until it finds the first string substring that matches the pattern, and then returns a matching object (Match Object). If no matching substring is found, the function returns None.

A common usage of the re.search() function is as follows:

match_object = re.search(pattern, string, flags=0)

Among them, the pattern parameter indicates the regular expression to be matched, the string parameter indicates the string to be matched, and the flags parameter indicates the matching option (such as whether to ignore case, etc.). When the function returns a matching object, you can call the match_object.group() method to obtain the matched substring , and the parameter of this method represents the serial number of the substring to be obtained (if there are multiple brackets in the regular expression, each bracket Indicates a group whose serial number increases from left to right).

For example, here is sample code to search a string for a regular expression:

import re

text = "Python is a popular programming language"
pattern = "programming"
match_object = re.search(pattern, text)

if match_object:
    print("Found a match:", match_object.group())
else:
    print("No match found.")

Output result:

Found a match: programming

2.1.2 re.match()

In Python, the re.match() module is similar to the re.search() module, which is used to search for matches of regular expressions in strings. However, re.match() will only match at the beginning of the string, and if no match can be found at the beginning, it will return a None object . Therefore, re.match() is more suitable for scenarios that need to match at the beginning of the string.

The function usage of match is the same as that of search. Let's take a look at the same test string and the result returned by match:

text = "Python is a popular programming language"
pattern = "programming"
match_object = re.match(pattern, text)

if match_object:
    print("Found a match:", match_object.group())
else:
    print("No match found.")

output:

No match found.

Since programming is not the first word, it cannot match.

2.1.3 re.findall()

re.findall() is another pattern matching function provided by the re module in Python. It can search for all patterns matching regular expressions in a string and return a list. Each element in the list is the same as the regular expression. pattern matching substring . Unlike and
, all matches are returned , not just the first or last match. Therefore, it is a very useful function if you need to find all instances of a pattern in a text that match a certain regular expression .re.search()re.match()re.findall()re.findall()

Here is sample code for this function:

import re

# 定义一个正则表达式,匹配以数字开头的子字符串
pattern = r'\d+'

# 定义一个待匹配的字符串
text = "Today is Oct 15, 2021, and the temperature is 20 degrees Celsius."

# 使用re.findall()函数查找所有匹配项,并将它们存储在一个list对象中
matches = re.findall(pattern, text)

# 输出匹配结果
print(matches)

In this example, we first define a regular expression pattern, r'\d+', for matching substrings that begin with a number. Then, we define a string to be matched, text. Next, we re.findall()find all matches using the function and store them in a list object, matches. Finally, we output the matching results to the screen.

Output result:

['15', '2021', '20']

This is because, in the text in the example, there are three substrings starting with a number, namely 15, 2021 and 20 degrees. re.findall()function finds them, and stores them in a list object.

2.1.4 re.sub()

re.sub() is another pattern matching function provided in the Python re module, which is used to replace a substring matched by a certain pattern in a string . The re.sub() function returns a new string in which all substrings matching the specified pattern are replaced with the specified content.

Here is a simple code example of how to use re.sub() for string substitution:

import re

# 定义一个正则表达式,匹配所有'is'字符
pattern = 'is'

# 定义一个待匹配的字符串
text = "The pattern of the book is not easy to find."

# 使用re.sub()函数将匹配项替换为指定字符串
new_text = re.sub(pattern, "was", text)

# 输出结果
print(new_text)

In this example, we first define a regular expression pattern, 'is', that matches all ischaracters. Then, we define a string to be matched, text. Next, we use re.sub()the function to replace all matches with "was". Finally, we output the new string after replacement.

output:

The pattern of the book was not easy to find.

2.2 regex module

In addition to the re module in the standard library, there are some third-party regular expression modules, such as the regex module, which provide functions that are more comprehensive, advanced, and compatible with Perl regular expression syntax than the re module.

The regex module is similar to the re module, providing most of the functions in the re module, but it supports more regular expression syntax and features, such as complex assertions, Unicode attributes, and matching nested structures. In addition, the performance of the regex module is also better than that of the re module, which can handle larger regular expressions and longer text data.

In short, there are many regular expression modules in Python, among which the re module in the standard library is the most commonly used. If you need more complex syntax and functionality when dealing with regular expressions, you can try the regex module.


3 Classification of regular expressions

A regular expression consists of some ordinary characters and some metacharacters (metacharacters). Ordinary characters include uppercase and lowercase letters and numbers, while metacharacters have special meanings, which are the tokens we want to use for matching.

3.1 Simple metacharacters

metacharacter effect
\ Puts the next character token, or a backreference, or an octal escape.
^ Matches the beginning of the input line.
$ Match the end of input line.
(Asterisk)* Matches the preceding subexpression any number of times.
(plus sign) + Matches the preceding subexpression one or more times (greater than or equal to 1 time).
? Matches the preceding subexpression zero or one time.

Sample code:

import re
# 匹配开头字符
res1 = re.match('^a', 'abandon')
print(res1)			# <re.Match object; span=(0, 1), match='a'>
print(res1.group())	# a
# 匹配结尾字符
res2 = re.match('.*d$', 'wood')
print(res2)			# <re.Match object; span=(0, 4), match='wood'>
print(res2.group())	# wood
# 匹配至少出现一次的字符
res3 = re.match('a+', 'aabcd')
print(res3)			# <re.Match object; span=(0, 2), match='aa'>
print(res3.group())	# aa
# 匹配一次或零次的字符
res4 = re.match('a?', 'aaabandon')
print(res4)			# <re.Match object; span=(0, 1), match='a'>
print(res4.group())	# a

3.2 Metacharacters for single-character matching

metacharacter effect
. (point) Matches any single character except "\n" and "\r".
\d Matches a numeric character.
\D Matches a non-numeric character.
\f Matches a form feed character.
\n Matches a newline character.
\r Matches a carriage return.
\s Matches any invisible character, including spaces, tabs, form feeds, and so on.
\S Matches any visible character.
\t Matches a tab character.
\w Matches any word character including an underscore.
\W Matches any non-word character.

Note : Add a plus sign (+) after the metacharacter to match one or more characters of this type

Code example one (.):

# 指定要匹配的模式
pattern = "py."
# 测试字符串1
test_str1 = "python"
result1 = re.match(pattern, test_str1)
print(result1)	# 输出 <re.Match object; span=(0, 3), match='pyt'>

Code example 2 (\d, \D):

# 指定要匹配的模式
pattern = "\d"
# 测试字符串2
test_str2 = "The price is 199.99 dollars"
result2 = re.findall(pattern, test_str2)
print(result2)  # 输出['1', '9', '9', '9', '9']

# 指定要匹配的模式
pattern = "\D"

# 测试字符串3
test_str3 = "My phone number is 555-1234"
result3 = re.findall(pattern, test_str3)
print(result3)  # 输出 ['M', 'y', ' ', 'p', 'h', 'o', 'n', 'e', ' ', 'n', 'u', 'm', 'b', 'e', 'r', ' ', 'i', 's', ' ', '-']

Code example four (\s, \S):

# 指定要匹配的模式
pattern1 = r"\s+"  # 匹配一个或多个空白字符
pattern2 = r"\S+"  # 匹配一个或多个非空白字符

# 测试字符串1
test_str1 = "Hello\tworld\n"
result1 = re.findall(pattern1, test_str1)
print(result1)  # 输出 ['\t', '\n']

result2 = re.findall(pattern2, test_str1)
print(result2)  # 输出 ['Hello', 'world']

# 测试字符串2
test_str2 = " This is a demo. "
result3 = re.findall(pattern1, test_str2)
print(result3)  # 输出 [' ', ' ', ' ']

result4 = re.findall(pattern2, test_str2)
print(result4)  # 输出 ['This', 'is', 'a', 'demo.']

Code example four (\w, \W):

# 指定要匹配的模式
pattern1 = r"\w+"  # 匹配一个或多个单词字符
pattern2 = r"\W+"  # 匹配一个或多个非单词字符

# 测试字符串1
test_str1 = "Hello, world!"
result1 = re.findall(pattern1, test_str1)
print(result1)  # 输出 ['Hello', 'world']

result2 = re.findall(pattern2, test_str1)
print(result2)  # 输出 [', ', '!']

# 测试字符串2
test_str2 = "This is a demo."
result3 = re.findall(pattern1, test_str2)
print(result3)  # 输出 ['This', 'is', 'a', 'demo']

result4 = re.findall(pattern2, test_str2)
print(result4)  # 输出 [' ', ' ', ' ', '.']

3.3 Metacharacters for character set matching

metacharacter effect
[xyz] collection of characters. Matches any one of the contained characters.

Code example :

import re

# 指定要匹配的模式
pattern1 = r"[aeiou]"  # 匹配任何元音字母
pattern2 = r"[A-Z]"   # 匹配任何大写字母

# 测试字符串1
test_str1 = "Hello, world!"
result1 = re.findall(pattern1, test_str1)
print(result1)  # 输出 ['e', 'o', 'o']

# 测试字符串2
test_str2 = "This is a Demo."
result2 = re.findall(pattern2, test_str2)
print(result2)  # 输出 ['T', 'D']

3.4 Metacharacters for Quantifier Matching

metacharacter effect
{n} n is a non-negative integer. Match determined n times.
{n,} n is a non-negative integer. Match at least n times.
{n,m} Both m and n are non-negative integers, where n<=m. Match at least n times and at most m times.

Code example :

import re

# 指定要匹配的模式
pattern1 = r"[a-z]{3}"  # 匹配任何由三个小写字母组成的连续子串
pattern2 = r"\d{2,3}"   # 匹配任何由两个或三个数字组成的连续子串

# 测试字符串1
test_str1 = "apple banana cherry"
result1 = re.match(pattern1, test_str1)
print(result1)  # 输出 <re.Match object; span=(0, 3), match='app'>

# 测试字符串2
test_str2 = "1234567890 12 123 1234"
result2 = re.findall(pattern2, test_str2)
print(result2)  # 输出 ['12', '123', '234']

3.5 Metacharacters for group matching

metacharacter effect
() Define the expression between ( and ) as a "group", and save the characters matching this expression into a tuple

Sample code :

# 指定要匹配的模式
pattern1 = r"(\d{3})-(\d{4})-(\d{4})"  # 匹配格式为 3-4-4 的电话号码
pattern2 = r"<(\w+)>.*</\1>"   # 匹配任何形如 <tag>value</tag> 的 XML 节点

# 测试字符串1
test_str1 = "My phone number is 123-4567-8901."
result1 = re.search(pattern1, test_str1)
if result1:
    area_code, prefix, line_number = result1.groups()
    print("Area code: {}, Prefix: {}, Line number: {}".format(area_code, prefix, line_number))
else:
    print("No match.")

# 测试字符串2
test_str2 = "<title>This is a title</title>"
result2 = re.match(pattern2, test_str2)
if result2:
    tag = result2.group(1)
    print("Tag name: {}".format(tag))
else:
    print("No match.")

output:

Area code: 123, Prefix: 4567, Line number: 8901
Tag name: title

In the above example, we defined two patterns (\d{3})-(\d{4})-(\d{4})(where ()means to use the character set it contains as a capture group, groups()and the content of the group can be obtained through the method in subsequent processing) and <(\w+)>.*</\1>(where \1means to refer to the first capture group matched content). We then use re.searchthe function to search two different test strings to see if they match these patterns. The first test string contains a phone number in the format 3-4-4of , we matched it, and further extracted the area code, prefix and line number information of the number. And the second test string is an XML node, we matched it, and further extracted the name of the node.
You can modify this example as needed to implement more complex text processing functions.

  • Finally, there is a metacharacter " | ", which is used to perform a logical "or" (or) operation on two matching conditions .

The sample code is as follows:

# 指定要匹配的模式
pattern = r"(cat|dog|bird)\d*"  # 匹配任何形如 cat\d* 或 dog\d* 或 bird\d* 的字符串

# 测试字符串
test_str = "I have a cat3 and a dog4 but no bird."
results = re.findall(pattern, test_str)
if results:
    print("Matching results: ", results)
else:
    print("No match.")

output:

Matching results:  ['cat', 'dog', 'bird']

4 Equivalence of regular expressions

Regex is difficult to understand because there is a concept of equivalence in it. This concept greatly increases the difficulty of understanding and makes many beginners seem confused. If you restore the equivalence to the original way of writing, it will be super simple to write regular expressions yourself, just like talking It's time to write your regex

?,*,+,\d,\w are all equivalent characters

  • ? Equivalent to matching length {0,1}
  • *Equivalent to matching length {0,}
  • + is equivalent to matching length {1,}
  • \d is equivalent to [0-9]
  • \D is equivalent to [^0-9]
  • \w is equivalent to [A-Za-z_0-9]
  • \W is equivalent to [^A-Za-z_0-9].

5 greedy matching

Greedy matching and non-greedy matching of regular expressions mean that when there are multiple possible matching texts in the regular expression, they choose different matching methods.

Greedy matching refers to matching all qualified strings as much as possible when matching, that is, matching longer texts first . For example, a.*bmeans astarting with and bending with , and any character (including spaces) in the middle appears at least once, and the regular expression engine will select as many characters as possible from the leftmost to meet the condition when matching a string that meets the rule. For example, for strings abbbcbbbd, the greedy matching result of this regular expression is abbbcbbb.

Correspondingly, non-greedy matching is also called lazy matching or minimum matching, which refers to only matching the shortest string that meets the conditions during matching . By default, the regular expression engine adopts the greedy matching mode, adding a ? prefix after the quantifier can convert it into a non-greedy matching mode . For example, a.*?bit means that it astarts with and bends with , and any character (including spaces) in the middle appears at least once, and ?after adding , it means that this rule is satisfied under the shortest possible condition. For example, for strings abbbcbbbd, the non-greedy matching result of this regular expression is abb.

  • The greedy or non-greedy matching of regular expressions depends on whether there is a question mark after the quantifier, which should be determined according to actual needs. Note that while non-greedy is generally safer and more reliable, it also incurs a performance penalty as the engine needs to keep backtracking to find eligible text.

Sample code :

# 原始字符串
str1 = "hello-world-and-hi"

# 贪婪匹配,获取第一个连字符到最后一个连字符之间的所有字符
result1_greedy = re.findall(r"-.*-", str1)
print(result1_greedy)    # 输出 ['-world-and-']

# 非贪婪匹配,只获取第一个连字符到第二个连字符之间的所有字符
result1_non_greedy = re.findall(r"-\w+?-", str1) 
print(result1_non_greedy)   # 输出 ['-world-']

# 原始字符串
s = "hello world, this is a test string."

# 贪婪匹配,获取以 h 开头、以空格结尾的所有字符
result_greedy = re.findall(r"h.* ", s)
print(result_greedy)	# ['hello world, this is a test ']

# 非贪婪匹配,获取以 h 开头、以空格结尾的最短字符
result_non_greedy = re.findall(r"h.*? ", s)
print(result_non_greedy)	# ['hello ', 'his ']

Guess you like

Origin blog.csdn.net/z135733/article/details/131078962