[Pyhton crawler] Regular expression



In the development of crawlers, it is necessary to extract useful information from a large piece of text, which 正则表达式is one of the methods of extracting information.


1. Regular Expression Basics

A regular expression ( Regular Expression) is a string that can represent a regular piece of information. Python comes with a regular expression module - re, through which you can find, extract, and replace a regular piece of information. In program development, regular expressions can be used to make a computer program find what it needs from a large piece of text.

There are the following steps to use regular expressions:
(1) 寻找规律
(2) 使用正则符号表示规律
(3)提取信息

Back to top


2. Basic symbols of regular expressions

2.1 Point number“.”

A dot canSubstitute any character except a newline, including but not limited to English letters, numbers, Chinese characters, English punctuation marks and Chinese punctuation marks.


2.2 Asterisks“*”

an asterisk canRepresents a subexpression (ordinary character, another, or several regular expression symbols) preceding it 0 to infinite times


2.3 Question mark“?”

a question mark canIndicates the subexpression preceding it 0 or 1 times. Note that the question mark here is an English question mark.


2.4 Backslash“\”

Backslashes cannot be used alone in regular expressions, not even in Python as a whole. backslash requiredTo be used in conjunction with other characters to turn special symbols into ordinary symbols, and ordinary symbols into special symbols

insert image description here


2.5 numbers“\d”

inside the regular expressionUse "\d" to represent a single digit. Why use the letter d? Because d is the first letter of "digital" in English. It should be emphasized that although "\d" is composed of backslashes and the letter d, "\d" should be regarded as a whole regular expression symbol.


2.6 Parentheses“()”

parentheses canExtract the content in brackets

Back to top


3. Using regular expressions in Python

Python already comes with a very powerful regular expression module. Using this module, it is very convenient to extract regular information from a large piece of text through regular expressions. Python's regular expression module is named "re", which is an acronym for "regularexpression". In Python, you need to import this module first before using it. The imported statement is:

import re # pycharm 如果报错 Alt+Enter 自动导入即可

Let's introduce the commonly used APIs:

3.1 findall

Python's regular expressions module contains a findall method that enablesReturns all strings that meet the requirements as a list

def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

patternRepresents a regular expression, stringrepresents the original string, and flagsrepresents some special function flags.
findallThe result is a list containing all matching results. If no match is found, an empty list is returned:

content = '我的电脑密码是:123456,我的手机密码是:888888,我的家门密码是:000000,勿忘!'

pwd_list = re.findall('是:(.*?),', content)
machine_list = re.findall('我的(.*?)密码是:', content)
name_list = re.findall('名字是(.*?),', content)
print('所有密码为:{}'.format(pwd_list))
print('所属为:{}'.format(machine_list))
print('用户姓名为:{}'.format(name_list))

It is obvious that there is no matching result in the empty List. There is another change here: when matching passwords, there will be one less as shown on the left. The reason lies in the matching. My matching rules are: '是:(.*?),', the middle password part of the text that strictly meets this format can be extracted, and the focus is on the latter . To ,勿忘!extract:
insert image description here
When you need to extract some content, use parentheses to enclose the content, so as not to get irrelevant information. If 包含多个 “(.*?)”as shown in the figure below, 返回的仍然是一个列表,但是列表里面的元素变为了元组, the first element in the tuple is the account number, and the second element is the password:

insert image description here
There is one flagsparameter in the function prototype. This parameter can be omitted; when not omitted, it has some auxiliary functions, such as 忽略大小写, 忽略换行符etc. Here is an example of ignoring newlines:
insert image description here
Common parameters:

re.I
    IGNORECASE
    忽略字母大小写

re.L
    LOCALE
    影响 “w, “W, “b,  “B,这取决于当前的本地化设置。

re.M
    MULTILINE
    使用本标志后,‘^’和‘$’匹配行首和行尾时,会增加换行符之前和之后的位置。

re.S
    DOTALL
    使 . 特殊字符完全匹配任何字符,包括换行;没有这个标志, . 匹配除了换行符外的任何字符。

re.X
    VERBOSE
    当该标志被指定时,在 RE 字符串中的空白符被忽略,除非该空白符在字符类中或在反斜杠之后。
    它也可以允许你将注释写入 RE,这些注释会被引擎忽略;
    注释用 #”号 来标识,不过该符号不能在字符串或反斜杠之后。

Reference: Python regular expression flags parameter

Back to top


3.2 serach

search()The usage is the same as findall()the usage of , butsearch() will only return the first string that meets the requirements. Once it finds something that matches its requirements, it stops looking. It is especially useful for finding only the first data from the super large text, which can greatly improve the running efficiency of the program.

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

For the result, if the match is successful, it is one 正则表达式的对象. To get the matched result, you need to use .group()this method to get the value inside; if no data is matched, it is None:

insert image description here
Only when .group()the parameter inside is 1, will the result in parentheses in the regular expression be printed out.
.group()the parametersThe maximum number of parentheses in the regular expression cannot exceed. A parameter of 1 means reading the content in the first bracket, a parameter of 2 means reading the content in the second bracket, and so on:

insert image description here
Back to top


3.3 Difference between ".*" and ".*?"

In crawler development, .*?these 3 symbols are mostly used together.

  • A dot means any non-newline character, and an asterisk means match the preceding character zero or any number of times. So. “.*”表示匹配一串任意长度的字符串任意次_
  • At this time, other symbols must be added before and after ".*" to limit the range, otherwise the result will be the original entire string.
  • If you “.*”add a question mark at the end, it becomes “.*?”, what kind of result can be obtained? The question mark means match the symbol preceding it 0 or 1 times. So. “.*?” 的意思就是匹配一个能满足要求的最短字符串_

insert image description here

The “(.*)”result is a list with only one element, which is a very long string.
The result of using “(.*?)”this is a list of 3 elements, each of which corresponds directly to each password in the original text.

Summarize:

  • “.*”:贪婪模式,获取最长的满足条件的字符串
  • “.*?”:非贪婪模式,获取最短的能满足条件的字符串

Back to top


4. Regular expression extraction skills

4.1 No need to use compile

def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

def compile(pattern, flags=0):
    "Compile a regular expression pattern, returning a Pattern object."
    return _compile(pattern, flags)

When used re.compile(), the method is called inside the program _compile(); when used re.finall(), the method is automatically called first within the module _compile(), and then the method is called findall(). re.findall()It comes with its own re.compile()function, so there is no need to use it re.compile().

Back to top


4.2 Grasp the big first and then the small

Some invalid content and valid content may have the same rules. In this case it is easy to mix up valid and invalid content, as in the following text:

有效用户:
姓名: 张三
姓名: 李四
姓名: 王五
无效用户:
姓名: 不知名的小虾米
姓名: 隐身的张大侠

The names of valid users and invalid users are preceded by "name: ". If they are used “姓名: (.*?)\n”to match, the valid information and invalid information will be mixed together, making it difficult to distinguish:

insert image description here
To solve this problem, you need to use the technique of catching the big and then the small. First match the effective user as a whole, and then match the name from the effective user:

insert image description here
Back to top


4.3 Inside and outside parentheses

In the above example, 括号and “.*?”are used together, so some readers may think that there can only be these three kinds of characters in parentheses, and no other ordinary characters. But in fact, there can also be other characters in the brackets, and the effect on the matching result is as follows:

insert image description here
In fact, it is not difficult to understand, just remember: " 按照匹配规则查找,括号内的被提取"!

Back to top


Guess you like

Origin blog.csdn.net/qq_45797116/article/details/123160308