Crawler from entry to jail (1) - regular expression

The content of the article is from "python crawler development"

1.1 Regular expressions

A regular expression is a string that can represent a regular piece of information . Python comes with a regular expression module, through which you can find, extract, and replace a regular piece of information. It is difficult to find one person out of 10,000 people, but it is very easy to find a very "characteristic" person out of 10,000 people. Suppose there is a person with green skin and a height of three meters. Even if this person is among 10,000 people, others can find him at a glance. This "finding" process is called "matching" in regular expressions. In program development, regular expressions can be used to make a computer program find what it needs from a large piece of text. There are the following steps to use regular expressions.
(1) Find the law.
(2) Use regular symbols to express laws.
(3) Extract information.

1.2 Basic symbols of regular expressions

1.2.1 Dot " . "

A period can replace any character except line break , including but not limited to English letters, numbers, Chinese characters, English punctuation and Chinese punctuation.

1.2.2 Asterisk "*"

An asterisk can represent a subexpression ( ordinary character , another, or several regular expression symbols) preceding it from 0 to infinite times.

Please add image description
All of the above are OK: (The asterisk indicates the previous expression)
Please add image description

1.2.3 dot + asterisk " .* "

A dot means any non-newline character, and an asterisk means match the preceding character zero or any number of times. So ".*" means match a string of any length any number of times.
Please add image description
The above is fine:
it means "any number of arbitrary characters other than newlines" appearing between "such as" and "ha".
Please add image description

1.2.4 Question mark " ? "

A question mark means that the subexpression preceding it either 0 or 1 time. Note that the question mark here is an English question mark Please add image description
as above:Please add image description

1.2.5 dot + asterisk + question mark ".*?" (most common)

Combined usage: All of the
Please add image description
above can be used:
Please add image description
Note: The difference between ".*?" and ".*"
. *? The meaning is to match a shortest string that satisfies the requirements.
Summarized in one sentence.
①".*": Greedy mode, get the longest string that satisfies the condition.
②“.*?”: Non-greedy mode, get the shortest string that satisfies the condition.

1.2.6 Parentheses "()"

A part of the content is "extracted" from a string.
There is a string as follows:
Please add image description
It can be seen that the password here has an English colon on the left and a Chinese character "you" on the right. When constructing a regular expression: .*? you, the result will be:
Please add image description
However, the colon and the Chinese character "you" are not part of the password, if you only want "12345abcde", you need to use parentheses:
Please add image description
get:
Please add image description

1.2.7 Backslash " \ "

In regular expressions, many symbols have special meanings, such as question marks, asterisks, curly brackets, square brackets, and parentheses. The backslash needs to be used in conjunction with other characters to turn special symbols into ordinary symbols, and ordinary symbols into special symbols.
Please add image description

1.2.8 The number "\d"

Regular expressions use "\d" to represent a single digit.
If you want to extract two numbers, you can use \d\d; if you want to extract 3 numbers, you can use \d\d\d. But what if you don't know how many digits the number has? You need to use the * sign to represent a number of any number of digits.

Please add image description
All can be represented using the following regular expression:
Please add image description

1.3 Using regular expressions

Python's regular expression module is named "re", which is an acronym for "regular expression". In Python, you need to import this module first before using it. The imported statement is:

import re

1.3.1 findall method

Python's regular expressions module includes a findall method that returns all strings that satisfy the requirements in the form of a list .
The function prototype of findall is:

re.findall(pattern,string,flags=0)

pattern represents a regular expression, string represents the original string, and flags represents the flags of some special functions. The result of findall is a list of all matching results. If no matches are found, an empty list is returned.

When you need to extract something, use parentheses to enclose the content so that you won't get irrelevant information. How to return if it contains multiple "(.*? )"? As shown in Figure 3-2, the returned list is still a list, but the elements in the list have become tuples. The first element in the tuple is the account number, and the second element is the password.
Please add image description

There is a flags parameter in the function prototype. This parameter can be omitted. When not omitted, it has some auxiliary functions, such as ignoring case, ignoring newlines, etc.
Here is an example of ignoring newlines to illustrate that to ignore newlines, you need to use the "re.S" flag. Please add image description
Although the "\n" symbol appears in the matched result, it is better than nothing. The newline character in the content can be replaced when cleaning the data later.

1.3.2 The search method

The usage of search() is the same as that of findall(), but search() will only return the first string that satisfies the requirements . Once it finds something that matches its requirements, it stops looking. It is especially useful for finding only the first data from the super large text, which can greatly improve the running efficiency of the program.

The function prototype of search() is: Please add image description
for the result, if the match is successful, it is a regular expression object; if no data is matched, it is None.

If you need to get the matching result, you need to use the .group() method to get the value inside.
Please add image description
Only when the parameter in .group() is 1, will the result in parentheses in the regular expression be printed.

The parameter of .group() cannot exceed the number of parentheses in the regular expression. A parameter of 1 means reading the content in the first bracket, a parameter of 2 means reading the content in the second bracket, and so on.
(Note that the picture is not findall)
Please add image description

1.3.3 compile method

re.findall() comes with re.compile() function, so there is no need to use re.compile().

1.4 Extraction skills of regular expressions

1.4.1 Grasp the big first and then the small: secondary extraction

Please add image description

1.4.2 Inside parentheses outside parentheses

There can be other characters in parentheses.
The specific impact is shown in the figure below.
Please add image description
If there are other ordinary characters in the brackets, then these ordinary characters will appear in the obtained result.

Guess you like

Origin blog.csdn.net/weixin_55159605/article/details/124085670