Crawler Premise - Regular Expression Syntax and Use in Python

Regular expressions are a powerful tool for manipulating strings, they are not some kind of programming cloud.

Regular expressions have an independent endurance engine, and the syntax of regular expressions is the same regardless of the programming language.

 

Regular expression matching process

1. Take out the expression and compare the characters in the text at a time.

2. If every character can be matched, the match is successful; once there is an unsuccessful character, the match fails.

3. If there are two or convenient expressions in the expression, the process is slightly different.

 

Here are some examples of symbols

[....] 

Character set (character class). The corresponding position can be any character in the character set. Characters in the character set can be listed, or ranges can be given, such as [abc] or [ac]. If the first character is ^, it means negation, if [^abc] means other characters that are not abc. All special characters in the character set have their original special meaning. If you use ], - or ^ in the character set, you can add the transfer character backslash \ in front, or put ], - in the first character, and ^ in the non-first character.

Predefined charsets (can be written in charset[....]):

\d Digits: [0-9]

\D non-digit: [^\d]

\s whitespace: [<space>\t\r\n\f\v]

\S non-whitespace character: [^\s]

\w word character: [A-Za-z0-9_]

\W fly word character: [^\w]

Quantifiers (used after characters or (...))

* matches the previous character 0 or infinite times

+ matches the previous 1 time or infinite times

? Match the previous time 0 or 1 time

{m} matches the previous character m times

{m,n} matches the previous character m to n times (more than n times it fails)

    m and n can be omitted: if m is omitted, match 0 to n times; if n is omitted, match m to infinite times

Boundary matching (does not consume characters in the string to be matched)

^ matches the beginning of the string. Match the beginning of each line in multiline mode.

$ matches the end of the string. Matches the end of each line when modulo multiple lines.

\A matches only the beginning of the string.

\Z matches only the end of the string.

\b matches between \w and \W

\B  [^\B]

Logic, grouping:

| means to match any one of the left and right expressions. (Similar to C's or statement, it always matches the expression on the left first, and skips the expression on the right if it matches successfully. If | is not enclosed in (), its scope is the entire regular expression .)

(...) The enclosed expression will be used as a grouping, starting from the left side of the expression without encountering a grouped opening bracket '(', number +1. In addition, the fractional expression as a whole can be backstreet quantifiers. The expression is only valid in this group.

(?P<name>...) grouping, specifying an additional alias in addition to the original number.

\<number> refers to the string matched by the grouping with the number <number>.

(?P=name) Refers to the string matched by the group whose alias is <name>.

Special construct (not as grouping):

(?:...) The ungrouped version of (...), used to eat '|' or followed by a quantifier.

(?iLmsux) Each character in iLmsux represents a matching pattern, which can only be used at the beginning of a regular expression, and can be multiple.

(?#...) # will be ignored as a comment.

The string content after (?=...) requires a matching expression to successfully match. String content is not consumed.

The string content after (?!...) requires an unmatched expression to successfully match. Strings are not consumed.

The string content before (?<=...) requires a match expression to successfully match. String content is not consumed.

The string content before (?<!...) requires an unmatched expression to successfully match. String content is not consumed.

(?(id/name)yes-pattern|no-pattern) If the group numbered id/alias name matches the string, it needs to match yes-pattern, otherwise it needs to match no-=attern. [no-pattern] can be omitted.

 

Greedy and non-greedy modes of quantifiers

Regular expressions are often used to find matching strings in text.

Greedy mode: always try to skim as many characters as possible; (quantifiers in Python are greedy by default)

Non-greedy mode: always try to match as few characters as possible. (Add ? after * or + in greedy mode to become non-greedy mode)

 

How to use regular expressions in python

Regular expressions are supported in python through a package called "re".

The result is as follows:

Let's analyze the statement pattern = re.compile(r'\d+\.\d*')  :

\d means number [0-9]

+ means repeating the last match 1 or n times

\. represents the character '.'

* means repeating the last match 0 or n times

r is actually python telling the compiler that all escape characters in this string are invalid and are processed according to the original string.

 

So \d+.\d* is actually a rule to match some decimals. However, this expression doesn't correctly match all decimals. Characters like '0.' are also matched. This example is purely to add a few more symbols.

 

Since we have already established a pattern object that can match the '\d+.\d*' rule.

Through the findall method of pattern, we can match the string we want.

What is returned is a list of strings[].

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325565145&siteId=291194637