The content of the article is from "python crawler development"
Article directory
1.1 Regular expressions
A regular expression is a string that can represent a regular piece of information . Python comes with a regular expression module, through which you can find, extract, and replace a regular piece of information. It is difficult to find one person out of 10,000 people, but it is very easy to find a very "characteristic" person out of 10,000 people. Suppose there is a person with green skin and a height of three meters. Even if this person is among 10,000 people, others can find him at a glance. This "finding" process is called "matching" in regular expressions. In program development, regular expressions can be used to make a computer program find what it needs from a large piece of text. There are the following steps to use regular expressions.
(1) Find the law.
(2) Use regular symbols to express laws.
(3) Extract information.
1.2 Basic symbols of regular expressions
1.2.1 Dot " . "
A period can replace any character except line break , including but not limited to English letters, numbers, Chinese characters, English punctuation and Chinese punctuation.
1.2.2 Asterisk "*"
An asterisk can represent a subexpression ( ordinary character , another, or several regular expression symbols) preceding it from 0 to infinite times.
All of the above are OK: (The asterisk indicates the previous expression)
1.2.3 dot + asterisk " .* "
A dot means any non-newline character, and an asterisk means match the preceding character zero or any number of times. So ".*" means match a string of any length any number of times.
The above is fine:
it means "any number of arbitrary characters other than newlines" appearing between "such as" and "ha".
1.2.4 Question mark " ? "
A question mark means that the subexpression preceding it either 0 or 1 time. Note that the question mark here is an English question mark
as above:
1.2.5 dot + asterisk + question mark ".*?" (most common)
Combined usage: All of the
above can be used:
Note: The difference between ".*?" and ".*"
. *? The meaning is to match a shortest string that satisfies the requirements.
Summarized in one sentence.
①".*": Greedy mode, get the longest string that satisfies the condition.
②“.*?”: Non-greedy mode, get the shortest string that satisfies the condition.
1.2.6 Parentheses "()"
A part of the content is "extracted" from a string.
There is a string as follows:
It can be seen that the password here has an English colon on the left and a Chinese character "you" on the right. When constructing a regular expression: .*? you, the result will be:
However, the colon and the Chinese character "you" are not part of the password, if you only want "12345abcde", you need to use parentheses:
get:
1.2.7 Backslash " \ "
In regular expressions, many symbols have special meanings, such as question marks, asterisks, curly brackets, square brackets, and parentheses. The backslash needs to be used in conjunction with other characters to turn special symbols into ordinary symbols, and ordinary symbols into special symbols.
1.2.8 The number "\d"
Regular expressions use "\d" to represent a single digit.
If you want to extract two numbers, you can use \d\d; if you want to extract 3 numbers, you can use \d\d\d. But what if you don't know how many digits the number has? You need to use the * sign to represent a number of any number of digits.
All can be represented using the following regular expression:
1.3 Using regular expressions
Python's regular expression module is named "re", which is an acronym for "regular expression". In Python, you need to import this module first before using it. The imported statement is:
import re
1.3.1 findall method
Python's regular expressions module includes a findall method that returns all strings that satisfy the requirements in the form of a list .
The function prototype of findall is:
re.findall(pattern,string,flags=0)
pattern represents a regular expression, string represents the original string, and flags represents the flags of some special functions. The result of findall is a list of all matching results. If no matches are found, an empty list is returned.
When you need to extract something, use parentheses to enclose the content so that you won't get irrelevant information. How to return if it contains multiple "(.*? )"? As shown in Figure 3-2, the returned list is still a list, but the elements in the list have become tuples. The first element in the tuple is the account number, and the second element is the password.
There is a flags parameter in the function prototype. This parameter can be omitted. When not omitted, it has some auxiliary functions, such as ignoring case, ignoring newlines, etc.
Here is an example of ignoring newlines to illustrate that to ignore newlines, you need to use the "re.S" flag.
Although the "\n" symbol appears in the matched result, it is better than nothing. The newline character in the content can be replaced when cleaning the data later.
1.3.2 The search method
The usage of search() is the same as that of findall(), but search() will only return the first string that satisfies the requirements . Once it finds something that matches its requirements, it stops looking. It is especially useful for finding only the first data from the super large text, which can greatly improve the running efficiency of the program.
The function prototype of search() is:
for the result, if the match is successful, it is a regular expression object; if no data is matched, it is None.
If you need to get the matching result, you need to use the .group() method to get the value inside.
Only when the parameter in .group() is 1, will the result in parentheses in the regular expression be printed.
The parameter of .group() cannot exceed the number of parentheses in the regular expression. A parameter of 1 means reading the content in the first bracket, a parameter of 2 means reading the content in the second bracket, and so on.
(Note that the picture is not findall)
1.3.3 compile method
re.findall() comes with re.compile() function, so there is no need to use re.compile().
1.4 Extraction skills of regular expressions
1.4.1 Grasp the big first and then the small: secondary extraction
1.4.2 Inside parentheses outside parentheses
There can be other characters in parentheses.
The specific impact is shown in the figure below.
If there are other ordinary characters in the brackets, then these ordinary characters will appear in the obtained result.