Article directory
In the development of crawlers, it is necessary to extract useful information from a large piece of text, which
正则表达式
is one of the methods of extracting information.
1. Regular Expression Basics
A regular expression ( Regular Expression
) is a string that can represent a regular piece of information. Python comes with a regular expression module - re
, through which you can find, extract, and replace a regular piece of information. In program development, regular expressions can be used to make a computer program find what it needs from a large piece of text.
There are the following steps to use regular expressions:
(1) 寻找规律
(2) 使用正则符号表示规律
(3)提取信息
2. Basic symbols of regular expressions
2.1 Point number“.”
A dot canSubstitute any character except a newline, including but not limited to English letters, numbers, Chinese characters, English punctuation marks and Chinese punctuation marks.
2.2 Asterisks“*”
an asterisk canRepresents a subexpression (ordinary character, another, or several regular expression symbols) preceding it 0 to infinite times。
2.3 Question mark“?”
a question mark canIndicates the subexpression preceding it 0 or 1 times. Note that the question mark here is an English question mark.
2.4 Backslash“\”
Backslashes cannot be used alone in regular expressions, not even in Python as a whole. backslash requiredTo be used in conjunction with other characters to turn special symbols into ordinary symbols, and ordinary symbols into special symbols:
2.5 numbers“\d”
inside the regular expressionUse "\d" to represent a single digit. Why use the letter d? Because d is the first letter of "digital" in English. It should be emphasized that although "\d" is composed of backslashes and the letter d, "\d" should be regarded as a whole regular expression symbol.
2.6 Parentheses“()”
parentheses canExtract the content in brackets。
3. Using regular expressions in Python
Python already comes with a very powerful regular expression module. Using this module, it is very convenient to extract regular information from a large piece of text through regular expressions. Python's regular expression module is named "re", which is an acronym for "regularexpression". In Python, you need to import this module first before using it. The imported statement is:
import re # pycharm 如果报错 Alt+Enter 自动导入即可
Let's introduce the commonly used APIs:
3.1 findall
Python's regular expressions module contains a findall method that enablesReturns all strings that meet the requirements as a list。
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
pattern
Represents a regular expression, string
represents the original string, and flags
represents some special function flags.
findall
The result is a list containing all matching results. If no match is found, an empty list is returned:
content = '我的电脑密码是:123456,我的手机密码是:888888,我的家门密码是:000000,勿忘!'
pwd_list = re.findall('是:(.*?),', content)
machine_list = re.findall('我的(.*?)密码是:', content)
name_list = re.findall('名字是(.*?),', content)
print('所有密码为:{}'.format(pwd_list))
print('所属为:{}'.format(machine_list))
print('用户姓名为:{}'.format(name_list))
It is obvious that there is no matching result in the empty List. There is another change here: when matching passwords, there will be one less as shown on the left. The reason lies in the matching. My matching rules are: '是:(.*?),'
, the middle password part of the text that strictly meets this format can be extracted, and the focus is on the latter .,
To ,勿忘!
extract:
When you need to extract some content, use parentheses to enclose the content, so as not to get irrelevant information. If 包含多个
“(.*?)”
as shown in the figure below, 返回的仍然是一个列表,但是列表里面的元素变为了元组
, the first element in the tuple is the account number, and the second element is the password:
There is one flags
parameter in the function prototype. This parameter can be omitted; when not omitted, it has some auxiliary functions, such as 忽略大小写
, 忽略换行符
etc. Here is an example of ignoring newlines:
Common parameters:
re.I
IGNORECASE
忽略字母大小写
re.L
LOCALE
影响 “w, “W, “b, 和 “B,这取决于当前的本地化设置。
re.M
MULTILINE
使用本标志后,‘^’和‘$’匹配行首和行尾时,会增加换行符之前和之后的位置。
re.S
DOTALL
使 “.” 特殊字符完全匹配任何字符,包括换行;没有这个标志, “.” 匹配除了换行符外的任何字符。
re.X
VERBOSE
当该标志被指定时,在 RE 字符串中的空白符被忽略,除非该空白符在字符类中或在反斜杠之后。
它也可以允许你将注释写入 RE,这些注释会被引擎忽略;
注释用 “#”号 来标识,不过该符号不能在字符串或反斜杠之后。
Reference: Python regular expression flags parameter
3.2 serach
search()
The usage is the same as findall()
the usage of , butsearch() will only return the first string that meets the requirements. Once it finds something that matches its requirements, it stops looking. It is especially useful for finding only the first data from the super large text, which can greatly improve the running efficiency of the program.
def search(pattern, string, flags=0):
"""Scan through string looking for a match to the pattern, returning
a Match object, or None if no match was found."""
return _compile(pattern, flags).search(string)
For the result, if the match is successful, it is one 正则表达式的对象
. To get the matched result, you need to use .group()
this method to get the value inside; if no data is matched, it is None
:
Only when .group()
the parameter inside is 1
, will the result in parentheses in the regular expression be printed out.
.group()
the parametersThe maximum number of parentheses in the regular expression cannot exceed. A parameter of 1 means reading the content in the first bracket, a parameter of 2 means reading the content in the second bracket, and so on:
3.3 Difference between ".*" and ".*?"
In crawler development, .*?
these 3 symbols are mostly used together.
- A dot means any non-newline character, and an asterisk means match the preceding character zero or any number of times. So.
“.*”表示匹配一串任意长度的字符串任意次
_ - At this time, other symbols must be added before and after ".*" to limit the range, otherwise the result will be the original entire string.
- If you
“.*”
add a question mark at the end, it becomes“.*?”
, what kind of result can be obtained? The question mark means match the symbol preceding it 0 or 1 times. So.“.*?” 的意思就是匹配一个能满足要求的最短字符串
_
The “(.*)”
result is a list with only one element, which is a very long string.
The result of using “(.*?)”
this is a list of 3 elements, each of which corresponds directly to each password in the original text.
Summarize:
- ①
“.*”:贪婪模式,获取最长的满足条件的字符串
。 - ②
“.*?”:非贪婪模式,获取最短的能满足条件的字符串
。
4. Regular expression extraction skills
4.1 No need to use compile
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
def compile(pattern, flags=0):
"Compile a regular expression pattern, returning a Pattern object."
return _compile(pattern, flags)
When used re.compile()
, the method is called inside the program _compile()
; when used re.finall()
, the method is automatically called first within the module _compile()
, and then the method is called findall()
. re.findall()
It comes with its own re.compile()
function, so there is no need to use it re.compile()
.
4.2 Grasp the big first and then the small
Some invalid content and valid content may have the same rules. In this case it is easy to mix up valid and invalid content, as in the following text:
有效用户:
姓名: 张三
姓名: 李四
姓名: 王五
无效用户:
姓名: 不知名的小虾米
姓名: 隐身的张大侠
The names of valid users and invalid users are preceded by "name: ". If they are used “姓名: (.*?)\n”
to match, the valid information and invalid information will be mixed together, making it difficult to distinguish:
To solve this problem, you need to use the technique of catching the big and then the small. First match the effective user as a whole, and then match the name from the effective user:
4.3 Inside and outside parentheses
In the above example, 括号
and “.*?”
are used together, so some readers may think that there can only be these three kinds of characters in parentheses, and no other ordinary characters. But in fact, there can also be other characters in the brackets, and the effect on the matching result is as follows:
In fact, it is not difficult to understand, just remember: " 按照匹配规则查找,括号内的被提取
"!