Data analysis in python crawler ------ detailed explanation of regular expressions

1. What is a regular expression

Regular expressions, also known as regular expressions, (English: Regular Expression, often abbreviated as regex, regexp or RE in code), a concept in computer science, regular expressions are usually used to retrieve and replace those that match a certain pattern ( the text of the rule).

Application scenarios of regular expressions:

1. Check the legality of the string

  • Ⅰ. Verify user name (az, 0-9, not all numbers, not all letters)
  • Ⅱ. Verify email format ([email protected])
  • Ⅲ. Verify phone number (11 digits)
  • Ⅳ. Verification ID card (18 digits)
  • Ⅴ. Verify the QQ number format (5-12 pure numbers, the first digit cannot be 0);

    2. Extract the information in the string

  • Ⅰ. Extract the number in a text message
  • Ⅱ. Extract the suffix of the file name
  • Ⅲ. Collector (web crawler)

    3. Replace string

  • Ⅰ. Replace illegal characters in the string
  • Ⅱ. Block the phone number; (18323876542)
  • Ⅲ. Replace the placeholder "hello{ {name}}" hello Wang Lao Er (version framework)

    4. Split the string

  • Ⅰ. Split a string according to the specified rules;

metacharacter

metacharacter
explain:
\d and [] (character set)
[123456zxcv] The character set can only match a value that appears in the set
\d represents all numbers from 0 to 9, [0123456789] is equivalent to \d
[a-zA-Z0- 9_] can only be written like this,Need notAdd commas and spaces or something,NoticeThere is also an underscore.

quantifier

insert image description here

exact match vs generic match

Explanation:
Pan match: Pan match is to match everything
Exact match: Exact match is to match the things inside the brackets

Greedy and non-greedy matching

Quantifiers in Python are greedy by default (or non-greedy by default in a few languages), always trying to match as many characters as possible;
non-greedy is the opposite, always trying to match as few characters as possible.
Add ? after "*", "?", "+", "{m,n}" , making greedy become non-greedy.

2.re module

The use of the re module can be divided into two types: the first is an object-based approach, and the second is a functional approach.

1.re.match

match() is used to find the head of the string (you can also specify the starting position), it is a match, as long as a matching result is found, it will return instead of finding all matching results. Its general usage is as follows:


match(pattern, string[, flag])


解释
其中, pattern 是正则表达式规则字符串, string 是待匹配的字符串, flag 是可选参数。
当匹配成功时,返回一个 Match 对象,如果没有匹配上,则返回 None

2 .re.search

search() is used to find any position of the string. It is also a match. As long as a matching result is found, it will return instead of finding all matching results. Its general usage is as follows:


search(pattern, string[, flag])

解释:
当匹配成功时,返回一个 Match 对象,如果没有匹配上,则返回 None

3.re.findall

The above match and search methods are both a match, as long as a matching result is found, it will return. However, most of the time, we need to search the entire string to get all matching results. The usage of findall() is as follows:


findall(pattern, string[, flag])

findall() returns all matching substrings in the form of a list, or an empty list if there is no match.

4.re.split

split() splits the string according to the substrings that can be matched and returns a list. Its usage is as follows:


split(pattern, string[, maxsplit, flags])

解释:
其中, maxsplit 用于指定最大分割次数,不指定将全部分割。

5.re.sub

sub() is used for replacement, the usage form is as follows:


sub(pattern, repl, string[, count, flags])

解释:
第一个参数为对应的正则表达式,第二个参数为要替换成的字符串,第三个参数为源字符串,第四个参数为可选项,代表最多替换的次数,如果忽略不写,则会将符合模式的结果全部替换。

6.re.compile

Use the compile() function to compile the string form of the regular expression into a Pattern object. Use a series of methods provided by this object
to match and search the text to obtain the matching result (Match object). Compilation can achieve more efficient matching search, etc.
The compile() function is used to compile the regular expression and generate a Pattern object. Its general usage is as follows:


import re
# 将正则表达式编译成 Pattern 对象
pattern = re.compile('\d+', re.S)

7. Usage of flags

insert image description here
insert image description here

8. Universal Regularity

(.*?) matches any string except a newline. No matter the length, match at most once, non-greedy match.
This regular expression can solve most of the data you want to extract. You can try this combination first when writing regular expressions, and it may achieve twice the result with half the effort. And often combined with the re.findall() function.

Guess you like

Origin blog.csdn.net/m0_74459049/article/details/130220352