Detailed explanation of Python3 regular expressions (1)

This article is translated from: https://docs.python.org/3.4/howto/regex.html

The blogger has made some comments and modifications to this ^_^

Introduction to Regular Expressions

Regular expressions (also known as REs, or regexes or regex patterns) are essentially a tiny and highly specialized programming language. It is embedded in Python and made available to programmers through the re module. With regular expressions, you need to specify some rules to describe the set of strings you want to match. These collections of strings may contain English sentences, e-mail addresses, TeX commands, or whatever you want.

Regular expression patterns are compiled into a series of bytecodes, which are then executed by a matching engine written in C. For advanced usage, you may want to focus on how the matching engine executes a given RE, and write the RE in such a way that it produces a bytecode that can run faster. This article will not explain the details of optimization, because it requires you to have a good understanding of the internal mechanism of the matching engine. But the examples in this article are standard regular expression syntax.

Remarks: Python's regular expression engine is written in C language, so the efficiency is extremely high. In addition, the so-called regular expression, the RE mentioned here, is the "some rules" we mentioned above.

The regular expression language is relatively small and limited, so not all possible character processing tasks can be accomplished using regular expressions. There are also some special tasks that can be done with regular expressions, but the expressions can get very complicated because of this. In this case, you might be better off by writing your own Python code; although Python code will be slower to execute than a neat regular expression, it may be easier to understand.

Note: This may be what everyone often says "the ugly words come first", don't worry about it, the regular expression is very good, it can handle 98.3% of your text tasks, you must learn it~~~~

simple pattern

We will start with the simplest regular expression learning. Since regular expressions are often used to manipulate strings, let's start with the most common task: character matching.

character match

Most letters and characters match themselves. For example, the regular expression Fanfan will match the string "Fanfan" exactly (you can enable case-insensitive mode, which will make Fanfan match "FANFAN" or "fanfan", which we'll discuss later).

Of course, there are exceptions to this rule. There are a few special characters we call metacharacters, which do not match themselves. They define character classes, subgroup matching, and pattern repetition times. This article devoted a lot of time to discussing the various metacharacters and their roles.

Below is the full list of metacharacters (we'll go through each of them later):

. ^ $ * + ? {} [] \ | ()

 Note: Without these metacharacters, regular expressions would be as trivial as the find() method of strings...

Let's first look at the brackets [ ] below , which specify a character class to store the set of characters you need to match. The characters that need to be matched can be listed individually, or by two characters and a bar - specifying the range of matching. For example [abc] will match the characters a, b or c; [ac] will do the same. The latter uses ranges to represent the same set of characters as the former. If you wanted to match only lowercase letters, your RE might be written as [az].

One thing to note: metacharacters in square brackets do not trigger "special functions", in character classes they only match themselves. For example [akm$] will match any character 'a', 'k', 'm' or '$', '$' is a meta character, but in square brackets it has no special meaning, it only matches the '$' character itself.

You can also match all other characters not listed in square brackets by adding a caret ^ at the beginning of the class , eg [^5] will match any character except '5'.

Perhaps the most important metacharacter is the backslash \ . Like Python's string rules, if a backslash is followed by a metacharacter, the "special function" of the metacharacter will not be triggered. For example you need to match the symbols [ or \ , you can prepend them with a backslash to eliminate their special features: \[ , \\ .

A backslash followed by some characters can also represent special meanings, such as decimal numbers, all letters, or a set of non-whitespace characters.

Note: The backslash is really awesome. The backslash is followed by metacharacters to remove special functions, and the backslashes followed by ordinary characters implement special functions.

Let's take an example: \w matches any word character. If the regular expression is expressed in bytes, this is equivalent to the character class [a-zA-Z0-9_]; if the regular expression is a string, \w matches all letters marked as letters in the Unicode database (provided by the unicodedata module) character of. You can further restrict the definition of \w by providing the re.ASCII representation when compiling the regular expression .

Note: The re.ASCII flag makes \w only match ASCII characters, don't forget, Python3 is Unicode.

The following lists some special meanings formed by the backslash and the character:

Special characters meaning
\d matches any decimal digit, equivalent to class [0-9]
\D  Opposite of \d, matches any character that is not a decimal digit, equivalent to [^0-9]
\s  matches any whitespace character (including spaces, newlines, tabs, etc.), equivalent to the class [\t\n\r\f\v]
\S  Contrary to \s, matches any non-whitespace character, when equivalent to class [^\t\n\r\f\v]
\w  matches any word character, as explained above
\W  Opposite of \w
\b  match the beginning or end of a word
\B  Opposite of \b

They can be contained in a character class and still have special meanings. For example [\s,.] is a character class that will match any whitespace character ( special meaning of /s ), ',' or '.'.

The last metacharacter we'll cover is . , which matches any character except a newline. If the re.DOTALL flag is set, . will match any character including newlines.

repetitive things

Using regular expressions can easily match different sets of characters, but the existing methods of Python strings cannot. However, if you think that's the only advantage of regular expressions, you're too young too native . Another powerful feature of regular expressions is that you can specify the number of times the RE part is repeated.

Let's take a look at the * metacharacter, of course it does not match the '*' character itself (we said that metacharacters have special abilities), it is used to specify that the previous character is matched zero or more times.

For example ca*t will match ct (0 characters a), cat (1 character a), caaat (3 characters a), and so on. It should be noted that due to the internal limitation of the size of the int type of the C language, the regular expression engine will limit the number of repetitions of the character 'a' to no more than 2 billion; however, we usually do not use such large data for work. .

The default repetition rule for regular expressions is greedy. When you match an RE repeatedly, the matching engine will try to match as many as possible. Until the RE doesn't match or the end is reached, the matching engine will back off one character before continuing to try to match.

We will explain to you what is "greedy" step by step through an example: first consider the expression a[bcd]*b, first you need to match the character 'a', then zero or more [bcd], and finally 'b' end. Now imagine what would happen if this RE matched the string abcbd?

step match illustrate
1 a matches the first character 'a' of RE
2 abcbd The engine matches [bcd]* as much as possible until the end of the string
3 fail The engine tries to match the last character 'b' of RE, but the current position is already the end of the string, so it fails
4 abcb fallback, so [bcd]* matches one less character
5 fail Another attempt to match the last character 'b' of the RE, but the last character of the string is a 'd', so it fails
6 abc Fallback again, so [bcd]* only matches 'bc' this time
7 abcb Try to match the character 'b' again, this time the character pointed to by the current position of the string is exactly 'b', the match is successful

Ultimately, the result of the RE match is abcb.

Note: The default matching rule of regular expressions is greedy, and there are instructions on how to use non-greedy methods to match later.

Another metacharacter that implements repetition is + , which specifies that the previous character is matched one or more times.

Pay special attention to the difference between * and + : * matches zero or more times, so the repeated content may not appear at all; + needs to appear at least once. For example ca+t will match cat and caaat, but not ct.

There are also two metacharacters that represent repetition, one of which is the question mark ? , which specifies that the previous character is matched zero or one time. You can think of it this way, what it does is make something optional.

The most flexible should be the metacharacter {m,n} (m and n are both decimal numbers), the above-mentioned metacharacters can be used to express, its meaning is that the previous character must match m times to n between times. For example a/{1,3}b would match a/b, a//b and a///b. But will not match ab (no slashes); nor will it match a////b (more than three slashes).

You can omit m or n, in which case the engine will assume a reasonable value instead. If m is omitted, it will be interpreted as the lower limit of 0; if n is omitted, it will be interpreted as infinity (in fact, the 2 billion we mentioned above).

Note: If it is {,n}, it is equivalent to {0,n}; if it is {m,}, it is equivalent to {m, positive infinity}; if it is {n}, it is to repeat the previous character n times. Another super error-prone thing is to write it as {m, n}, which looks pretty, but note that spaces cannot be added at will in regular expressions, otherwise the original meaning will be changed.

Finally, * , + and ? can all be replaced by {m,n}. {0,} is the same as * ; {1,} is the same as +; {0,1} is the same as ? . However, you are encouraged to remember and use * , + and ? as these characters are shorter and easier to read.

Note: Another reason is that the matching engine optimizes * + ? , which is more efficient.

(End of this article)

Next: Detailed explanation of Python3 regular expressions (2)

If you like this article, please give me encouragement through the "comment" below ^_^

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324810128&siteId=291194637
Recommended