Python crawler selection 03 episodes (re-regular parsing module explained in simple language)

python learning directory portal

Article Directory

Python crawler selection 03 episodes (re-regular parsing module explained in simple language)
One. Regular expressions (basic introduction)
2. Metacharacter use:
Three. Matching rules
Four. Summary

Chasing dreams requires passion and ideals, and realizing dreams requires struggle and dedication

One. Regular expressions (basic introduction)

Purpose
1. Processing text data
2. Searching, positioning, and extracting text content are logically complex tasks
3. In order to solve the above problems quickly and conveniently, regular expression technology is produced
Regular definition
refers to the advanced matching mode of text. Its essence is a string composed of a series of characters and special symbols. This string is a regular expression.
Principle
A string is composed of ordinary characters and characters with specific meanings to describe certain string rules, such as repetition, position, etc., to express a certain type of specific string, and then match.
aims

Familiar with regular expression metacharacters
Able to read common regular expressions and edit simple regular rules
Able to use the re module to manipulate regular expressions

re模块官方文档：
https://docs.python.org/zh-cn/3.8/library/re.html
re模块库源码：
https://github.com/python/cpython/blob/3.8/Lib/re.py

2. Metacharacter use:

以下案例若需要在python或pycharm中执行需要引入re模块

Template example:

import re
print(re.findall('ab',"abcdefabcd"))
 # ['ab', 'ab']

2.1 Ordinary characters:

匹配规则：每个普通字符匹配其对应的字符
例子：re.findall('ab',"abcdefabcd")
# ['ab', 'ab']

注意事项：正则表达式在python中也可以匹配中文

2.2 Metacharacters: | (or relation)

匹配规则：匹配 | 两侧任意的正则表达式即可
例子：re.findall('com|cn',"www.baidu.com/www.jingdong.cn")
#['com', 'cn']

2.3 Metacharacter:. (Matches a single metacharacter)

匹配规则：匹配除换行外的任意一个字符
例子：re.findall('钱天.',"钱天二,钱天三,钱天四")
# ['钱天二', '钱天三', '钱天四']

2.4 Metacharacters: [Character Set]

匹配规则: 匹配字符集中的任意一个字符
表达形式: 
    [abc#!好] 表示 [] 中的任意一个字符
    [0-9],[a-z],[A-Z] 表示区间内的任意一个字符
    [_#?0-9a-z]  混合书写，一般区间表达写在后面
例子：re.findall('[aeiou]',"How are you!")
# ['o', 'a', 'e', 'o', 'u']

2.5 metacharacters: [^character set] (matches anti-character set)

匹配规则：匹配除了字符集以外的任意一个字符
例子：re.findall('[^0-9]',"Use 007 port")
#['U', 's', 'e', ' ', ' ', 'p', 'o', 'r', 't']

2.6 Metacharacters: ^ \A

匹配规则：匹配字符串开始位置
例子：re.findall('^Jame',"Jame,hello")
#['Jame']

2.7 Metacharacters: $ \Z

匹配规则：匹配目标字符串的结尾位置
例子：re.findall('Jame$',"Hi,Jame")
#['Jame']

规则技巧: ^ 和 $必然出现在正则表达式的开头和结尾处。如果两者同时出现，则中间的部分必须匹配整个目标字符串的全部内容。

2.8 Metacharacters: *

匹配规则：匹配前面的字符出现0次或多次
例子：re.findall('ha*',"haaaaaa~~~~h!")
#['haaaaaa', 'h']

2.9 Metacharacter: +

匹配规则：匹配前面的字符出现1次或多次
例子：re.findall('[A-Z][a-z]+',"Hello World")
#['Hello', 'World']

2.10 Metacharacters:?

匹配规则：匹配前面的字符出现0次或1次
例子：匹配手机号 re.findall('-?[0-9]+',"Jame,age:18, -26")
#['18', '-26']

2.11 Metacharacters: {n}

匹配规则：匹配前面的字符出现n次
例子：re.findall('1[0-9]{10}',"Jame:13886495728")
#['13886495728']

2.12 Metacharacters: {m,n}

匹配规则： 匹配前面的字符出现m-n次
例子：匹配QQ号 re.findall('[1-9][0-9]{5,10}',"QQ:1259296994") 
#['1259296994']

2.13 Metacharacters: \d \D

匹配规则：\d 匹配任意数字字符，\D 匹配任意非数字字符
例子：匹配端口 re.findall('\d{1,5}',"Mysql: 3306, http:80")
#['3306', '80']

2.14 Metacharacters: \w \W

匹配规则：\w 匹配普通字符，\W 匹配非普通字符
说明: 普通字符指数字，字母，下划线，汉字。
例子： re.findall('\w+',"server_port = 8888")
#['server_port', '8888']

2.15 metacharacters: \s \S

匹配规则：\s 匹配空字符，\S 匹配非空字符
说明:空字符指 空格 \r \n \t \v \f 字符
例子： re.findall('\w+\s+\w+',"hello    world")
#['hello    world']

2.16 metacharacters: \s \S

匹配规则：\s 匹配空字符，\S 匹配非空字符
说明:空字符指 空格 \r \n \t \v \f 字符
例子： re.findall('\w+\s+\w+',"hello    world")
#['hello    world']

2.17 Metacharacters: \b \B

匹配规则：\b 表示单词边界，\B 表示非单词边界
说明:单词边界指数字字母(汉字)下划线与其他字符的交界位置。
例子：re.findall(r'\bis\b',"This is a test.")
#['is']

注意：当元字符符号与Python字符串中转义字符冲突的情况则需要使用r将正则表达式字符串声明为原始字符串，如果不确定那些是Python字符串的转义字符，则可以在所有正则表达式前加r。

Three. Matching rules

3.1 Special character matching

Purpose: If the matched target string contains regular expression special characters, the metacharacter in the expression needs to be processed when it wants to express its own meaning.
```
特殊字符: . * + ? ^ $ [] () {} | \
```
Operation method: add \ before the regular expression metacharacter, then the metacharacter is to remove its special meaning, which means the character itself

e.g. 匹配特殊字符 . 时使用 \. 表示本身含义
In : re.findall('-?\d+\.?\d*',"123,-123,1.23,-1.23")
Out: ['123', '-123', '1.23', '-1.23']

3.2 Greedy mode and non-greedy mode

definition

贪婪模式: 默认情况下，匹配重复的元字符总是尽可能多的向后匹配内容。比如: * + ? {m,n}

非贪婪模式(懒惰模式): 让匹配重复的元字符尽可能少的向后匹配内容。

Greedy mode is converted to non-greedy mode
Greedy mode [match repeat] default maximum
The minimum value of non-greedy mode, add after [Match Repeat]?

Add the'?' sign after the corresponding matching repeated metacharacter

*  ->  *?
+  ->  +?
?  ->  ??
{
    
    m,n} -> {
    
    m,n}?

e.g.
In : re.findall(r'\(.+?\)',"(abcd)efgh(higk)")
Out: ['(abcd)', '(higk)']

Demo

import re 
# 贪婪匹配  ['《java入门到放弃》，派神:《python入门到放弃》，前端:《Html直接放弃,are you ok》']
print(re.findall('《.+》', "抓娃：《java入门到放弃》，派神:《python入门到放弃》，前端:《Html直接放弃,are you ok》"))
# 非贪婪匹配 ['《java入门到放弃》', '《python入门到放弃》', '《Html直接放弃,are you ok》']
print(re.findall('《.+?》', "抓娃：《java入门到放弃》，派神:《python入门到放弃》，前端:《Html直接放弃,are you ok》"))

3.3 Regular expression grouping

definition:

In regular expressions, () is used to establish internal groupings of regular expressions. Subgroups are part of regular expressions and can be used as internal overall operation objects.
effect:

(1) You can change the operation object of metacharacters as a whole operation.

#(1)改变 +号 重复的对象
re.search(r'(ab)+',"ababababab").group()
 #'ababababab'
#(2)改变 |号 操作对象
re.search(r'(王|李)\w{1,3}',"王者荣耀").group()
#'王者荣耀'

(2) The content part corresponding to the sub-group in the accompanying content can be obtained through some excuses in the programming language

#获取url协议类型
re.search(r'(https|http|ftp|file)://\S+',"https://www.baidu.com").group(1)

Capture group:

You can give a name to the subgroup of the regular expression to express the meaning of the subgroup. This named subgroup is the capture group

Format: "(?Ppattern)"

#给子组命名为 "pig"
re.search(r'(?P<pig>ab)+',"ababababab").group('pig')
#'ab'

注意事项

A regular expression can contain multiple subgroups
Subgroups can be nested, but don’t overlap or the nesting structure is complex
The serial numbers of the subgroups are generally counted from outside to inside and from left to right

Insert picture description here

Four. Summary

4.1 Principles of Regular Expressions

Correctness, can correctly match the target string.
Exclusivity, except that the target string matches as little as possible other content unexpectedly.
Comprehensiveness, taking into account all the conditions of the target string as much as possible, without omission.

4.2`普通字符集的替换`

Insert picture description here

4.3`计数符`

Insert picture description here

4.4`巩固提升`

If you use regular expressions to match special characters, you need to add \ to indicate escape.

Special characters:. * +? ^ $ [] () () | \
Use when matching special characters. It means its own meaning.
Example 1:

import re
print(re.findall('-?\d+\.?\d*',"123,-123,1.23,-1.23"))
        #['123', '-123', '1.23', '-1.23']

   例子2：

import re
s="1992年5月2日出生于中国，中国内地影视女演员-12,12.34,1/0,40%,-1.6"
r=re.findall('-?\d+\.?/?\d*%?',s) 
print(r)
#['1992', '5', '2', '-12', '12.34', '1/0', '40%', '-1.6']

2. In programming languages, regular expressions are often written using native strings to avoid the trouble of multiple escaping.

 python字符串  -->    正则    -->    目标字符串
    "\\$\\d+"   解析为   \$\d+   匹配   "$100"
    "\\$\\d+"  等同于  r"\$\d+"