Python: regex -1

The original text is from FishC (very good website, no ads, becoming a member is a good choice), here are my notes.

Regular expressions have a module in Python called re. Personally, I think it is like a super wildcard - a collection of strings, we look for a square pen, and we search for text. These sets of strings may contain English sentences, e-mail addresses, Text commands, or something else.

The regular expression language is relatively small and limited, which means that not all possible string tasks are easily accomplished using regular expressions.

simple mode

The application of the simple pattern is character matching, of course, 1) it can match itself (case sensitive pattern), eg FishC will match FishC exactly, or FISHC, fishc.

2) A few characters we call meta-characters cannot match themselves, such as . ^ $ * + ? { } [ ] \ | ( ), which can be understood as special characters. But metacharacters in square brackets do not trigger special functions, in character classes, they only match themselves, eg [akm$] will match any character 'a', 'k', 'm', '$', $ is A metacharacter, but has no special meaning in square brackets, it just matches the $ character itself.

3) All other characters listed in parentheses can be matched by adding a caret ^ at the beginning of the class, such as [^5] will match any character except 5.

4) The most important metacharacter is the backslash \. If the backslash is followed by a meta character, the special function of the meta character will not be triggered, such as matching [, \, you can add a backslash in front of them to eliminate their special functions: \[, \\.

5) The backslash is followed by metacharacters to remove special functions, and to achieve special functions with ordinary characters: \w matches any word character. If the regular expression is expressed in bytes, this is equivalent to the character class [a-zA-Z0-9]; if the regular expression is a string, it will match characters marked as letters in the Unicode database.

    \d matches any decimal digit; equivalent to [0-9]

    \D Contrary to the above, matches any character that is not a decimal digit; equivalent to [^0-9]

    \s matches any whitespace character; equivalent to [\t\n\r\f\v]

    \S matches any non-whitespace character; equivalent to [^\t\n\r\f\v]

    \w matches any word character

    \W is the opposite of above

    \b matches the start or end of a word

    \B is the opposite of above

They can be contained in a character class and still have special meanings. [\s,.] is a character class that will match any whitespace character, ',' or '.'.

6) Metacharacter '.': matches any character except newline. If re.DOTALL is set, . will match any character including newlines.

7) In addition to matching different character sets, the regular expression can also specify the number of times the RE part is repeated.

8) The meta character * is used to specify the number of times the previous character is matched ( 0 or more times ), such as ca*t will match ct (0 characters a), cat (1 character a), caaat (3 characters a) ).

9) The default repetition rule of regular expressions is greedy. When a RE is repeatedly matched, the matching engine will try to match as much as possible until it does not match or it reaches the end, the matching engine will go back one character, and then continue Try to match (this logic is a bit complicated).

10) Another metacharacter that implements repetition is +, which is used to specify that the previous character is matched one or more times .

11) Meta characters? Used to specify that the previous character is matched zero or one time . It can be thought that its role is to mark something as optional.

12) The most flexible metacharacter {m,n}, its meaning is that the previous character must match between m times and n times, such as a/{1,3}b, will match a/b, a//b , a///b; but will not match ab (no slashes) nor a////b (more than three slashes).

13) The above shorthand: {,n} is equivalent to {0,n}; {m,} is equivalent to {m,+infinity}; {n} is to repeat the previous character n times; {m, n} has The problem is that spaces cannot be added at will in regular expressions, otherwise the original meaning will be changed.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325644255&siteId=291194637