Python regular expression -re module explanation

The
concept of
regular expressions As a concept of computer science, regular expressions are usually used to retrieve and replace text that meets a certain rule. Regular expression is a kind of logical formula for string manipulation, which uses pre-defined rule strings to filter and logically process strings.

Regular expression is essentially a small, highly specialized programming language. In Python, regular expressions are implemented through the re module. Regular expressions can first specify rules for the matched set of corresponding strings, and then modify or separate strings in some way through the re module.

The regular expression pattern is first compiled into a series of bytecodes, and then executed by the matching engine written in C language, so to some extent it is faster than writing Python string processing code directly. But not all string matching can be done with regular expressions. Even if it can handle single expressions, it becomes very complicated and poorly readable. It is recommended to write Python code directly.

Composition
Regular expressions are composed of two types of characters, one is metacharacters with special meaning in regular expressions, and the other is ordinary
characters.

Characters and syntax:

Syntax Description Expression example Matching example
Common character Match itself Abc abc
. Match any character except newline \n ac abc
\ escape character a\.c ac
[abcd] Match a or b or c or d [abc] a
[ 0-9] matches any number from 0 to 9, which is equivalent to [0123456789] [0-3] 1
[\u4e00-\u9fa5] matches any Chinese character [\u4e00-\u9fa5] Chinese
[^a0=] matches except a Any character other than, 0, = [^abc] d
[^az] matches any character except lowercase characters [^az] A
\d matches any digit, equivalent to [0-9] a\dc a6c
\D matches any A non-digit character, equivalent to [^0-9] a\Dc abc
\s matching any blank character, equivalent to [\r\n\f\t\v] a\sc ac
\S matching any non-blank character, Equivalent to [^\r\n\f\t\v] a\Sc aYc
\w matches any letter, number or underscore, equivalent to [a-zA-Z0-9_] a\wc a_c
\W matches any one Not a letter, number or underscore, equivalent to [^a-zA-Z0-9_] a\wc a*c
* Match the previous character 0 times or unlimited times a*c c
+ Match the previous character 1 time or unlimited times a+c aaaac
? Match the previous character 0 or 1 time a?c ac
{m} Match the previous character m times a{3}c aaac
{m,n} Before matching A character m to n times, mn can be omitted. The default value of mn is 0 times and unlimited times respectively. a{1,2}c aac
^ matches the beginning of the string, does not match any character ^abc abc
$ matches the end of the string Position, does not match any characters abc$ abc
| Sub-expression or relation matching abc|def def
(…) Matching group (abc){2} abcabc
(?P<name>…) Matching group, specify in addition to the original number An additional alias (?P<id>abc){2) abcabc
\<number> matches the group with reference number <number> into the string (\d)abc\1 1abc1
(?..) matches the ungrouped (...), followed by a quantifier (?:abc){2} abcabc
(?iLmsux) Each character of iLmsux represents a matching pattern, which can only be used at the beginning of the string, and multiple (?i)abc AbC can be selected
(?#…) The content after # will be commented out a(?#test)bc abc
(?(id/name)yes-pattern|no-pattern) To match the group with id or alias name, it needs to match yes-pattern, otherwise it needs to match no-pattern, similar to the ternary operator (\d)abc(? (1)\d|abc) 1abc2 The
above rules are only for string matching. In actual applications, it will be a combination of multiple single matching, so it is best to master it so that Python can be used proficiently at the beginning. However, it is boring to memorize these rules directly, and I will explain them in combination with Python's re module in order to master them proficiently.

Re module application
1. View version
Python has added the re module since version 1.5, providing a regular expression pattern in Perl style.
The re module is embedded in Python, so it can be imported directly. Use the method __version__ to view the version, and the method __all__ to view the attributes:

import re
print(re.__version__)
print(re.__all__) #The
output results are as follows:
2.2.1
['match','fullmatch','search','sub','subn','split','findall' ,'finditer','compile','purge','template','escape','error','Pattern','Match','A','I','L','M', ' S','X','U','ASCII','IGNORECASE','LOCALE','MULTILINE','DOTALL','VERBOSE','UNICODE'] The

above code shows that the re module does not involve functions There are many functions. One is to find patterns in the text, the other is to compile expressions, and the third is to match multiple layers. At the same time, some constants are defined.

Second, search () search to
find patterns in the text mainly use the search () function. This function has three parameters: pattern, string, and flags;

pattern represents the expression string used during compilation.
string represents the string used for matching.
flags represents the compilation flag, which is used to modify the matching method of regular expressions, such as case-sensitive, multi-line matching, etc. The default value is 0.
The commonly used flag values ​​are as follows:
Flag Meaning
re.S (DOTALL) makes "." match all characters including line
breaks re.I (IGNORECASE) makes the matching insensitive to case
re.L (LOCALE) does localized recognition matching et
re.M (MULTILINE) match multiple rows, and the influence ^ $
re.X (the VERBOSE) through more flexible format for ease of understanding regex
re.U the parsing character Unicode character set, influence \ w, \ W, \ b. The \B
re.serach() function takes the pattern and scanned text as input and returns the matching object. If no matching pattern is found, it returns None:

It can be seen from the above that match is the returned matching object, which contains information about the nature of the match. When using regular expressions, the position where the pattern appears in the original string has methods such as start(), end(), group(), span(), groups():

start() returns the start position of the match
end() returns the end position of the match
group() returns the matched string
span() returns a tuple
containing the position of the match (start, end) groups() returns a list that contains all of the regular expressions A tuple of group strings, from 1 to the group number contained, usually without parameters. In addition, there is a group(n,m) method, which returns the string matched by the group number (n,m).


3. Compile() pre-compilation
Use the function compile() to compile regular expressions into regular expression objects to improve execution efficiency. This function returns an object pattern with two parameters pattern and flags=0. The meaning is the same as that mentioned in search() above.

Usually the compiled expressions are frequently used expressions in the program, so the compilation will be more efficient, but it will also cost a certain amount of cache. Another advantage of using compiled expressions is that all expressions are compiled when the module is loaded, rather than when the program responds to user actions.

Fourth, match() start position matching
Use the function match() to match at the beginning of the text string. This method is not a complete match. It only matches the start position/head of the string. It does not matter whether there is a string after it, although The following may match, but all are ignored.


Five, findall() and finditer() traversal matching
Use the function findall() to traverse matching, get all matched strings in the string, and return a list. The function of this function and the parameters are the same as the search() function, but it returns all matching and non-overlapping substrings.

The function finditer() is used in the same way as findall(), except that it returns an iterator instead of a list. It will generate Match instances.


(Interstitial anti-climbing information) Blogger CSDN address: https://wzlodq.blog.csdn.net/

6. Split() Split The
function split() can split the string to be matched in the matched substring and return a list.

7. Sub() and subn() replacement The
function sub uses pattern to replace each matched substring in string and returns the replaced substring in the format of re.sub(pattern,repl,string,count,flag). The function subn() returns one more replacement times on this basis.

pattern is the expression string
repel is the replaced character
string is the string used for matching
count is the maximum number of replacements, after exceeding, the
flag will not be replaced. Same as above


Commonly used regular expressions
Numerical
expression verification is mainly for the matching of regular expression verification for the numbers that appear in the text. The following will explain some commonly used expressions and use the re module to process and demonstrate them.

^[0-9]*$
and the ^ mentioned above are the starting position of the matching string; $ is the ending position of the matching character; [0-9] represents any number; * matches the previous character 0 or countless times. The following will not be repeated, please refer to the above if you are not sure.
In summary, this expression is used to match numbers.


^[1-9]\d*$ matches non-zero positive integers


^\d{n}$ matches n digits


^\d{n,}$ matches at least n digits


^\d{m,n}$ matches numbers from m to n


^[1-9][0-9]*.[0-9]{1,2}?$ matches numbers with up to two decimal places that do not start with 0


Characters In
text analysis, the processing of character expressions is often involved, such as extracting Chinese characters and deleting the length of characters.

[\u4e00-\u9fa5]Chinese character matching


^[A-Za-z0-9]+$ English and number matching


^[\u4E00-\u9FA5AA-Za-z0-9]+$ Chinese and English numbers match


Other
^\w+(-+.\w)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$E-mail address verification


^1[34589]\\d{9}$ mobile phone number


^[1-9]\d{5}(18|19|([23]\d))\d{2}((0[1-9])|(10|11|12))(([ 0-2][1-9])|10|20|30|31)\d{3}[0-9Xx]$ ID number


^[1-9]\d{5}(?!\d)$Postal code


(? i) ^ ([a-z0-9] + (-[a-z0-9] +) * \.) + [Az] {2,} $ Domain name


Summary The
most important function of the regular expression re module is filtering, filtering out the required data from the target, and then filtering out any characteristic data from the string through function combination, etc., which is the basis for subsequent Python crawlers to parse the data.

Guess you like

Origin blog.csdn.net/qq_26280383/article/details/114367775