Python regular expression knowledge combing

1. Regular expression syntax

1.1 Characters and character classes

(1) Special characters: \ . ^ $ ? + * { } [ ] ( ) |

If the above special characters want to use literal values, they must be escaped with \

(2) Character class

  • One or more characters contained in [] are called a character class, and if a character class does not specify a quantifier when matching, only one of them will be matched.
  • A range can be specified within a character class, such as [a-zA-Z0-9], which means any character between a to z, A to Z, and 0 to 9.
  • A left square bracket followed by one means to negate a character class, such as [ 0-9], means to match any non-numeric character.
  • Inside the character class, except for \, other special characters no longer have special meanings, and all represent literal values. Prompt, putting it in the first position means negation, putting it in other positions means itself, - putting it in the middle means range, putting it in the first character in the character class means - itself.
  • Shorthands such as \d \s \w can be used inside character classes.

(3) Shorthand method

  • . matches any character except a newline, or if the re.DOTALL flag is present, matches any character, including a newline.
  • \d matches a Unicode number, if with re.ASCII, it matches 0-9.
  • \D matches Unicode non-digits.
  • \s matches Unicode whitespace, or one of \t\n\r\f\v if with re.ASCII.
  • \S matches Unicode non-whitespace.
  • \w matches a Unicode word character, or one of [a-zA-Z0-9_] if with re.ASCII.
  • \W matches Unicode non-word characters.

1.2 Quantifiers

  • ? Matches the preceding character 0 or 1 time.
  • * matches the preceding character 0 or more times.
  • + Matches the preceding character 1 or more times.
  • {m} matches the preceding expression m times.
  • {m,} matches the preceding expression at least m times.
  • {,n} matches the preceding regular expression up to n times.
  • {m,n} matches the preceding regular expression at least m and at most n times.

1.3 Laziness and Greed

The above quantifiers are all greedy modes, which will match as many as possible. If you want to change to a non-greedy mode, you can implement it by following the quantifier with a ?, which is defined as lazy matching.

1.4 Group and Capture

(1) The function of ():

Capture the content of the regular expression in () for further processing. You can turn off the capture function of this bracket by following the opening bracket with ?:.

Combine parts of regular expressions to use quantifiers or |.

(2) The echo references the content captured in the preceding ():

Backreferences by number:

Each parenthesis that does not use ?: will be assigned a number, starting from 1 and increasing from left to right. You can use \i to refer to the content captured by the expression in the previous ().

Backreference the content captured in the preceding parentheses by the group name:

A group can be aliased by following the opening bracket with ?P, putting the name in angle brackets, followed by (?P=name) to refer to the previously captured content. Such as (? P\w+)\s+(?P=word) to match repeated words.

(3) Note:

Backreferences cannot be used in character classes [].

1.5 Assertions and flags

Assertions don't match any text, they just impose certain constraints on the text in which the assertion is made.

(1) Common assertions:

  • \b matches the boundary of a word, and puts it in the character class [] to represent backspace.
  • \B matches a non-word boundary, subject to ASCII markup.
  • \A matches at the beginning.
  • ^ matches at the beginning, or after each newline if the MULTILINE flag is present.
  • \Z matches at the end.
  • $ matches at the end and, if the MULTILINE flag is present, before each newline.
  • (?=e) Positive lookahead.
  • (?!e) Negative lookahead.
  • (?<=e) is looking back.
  • (?<!e) Negative lookback.

(2) Explanation of forward-looking and retrospective

  • Look ahead: exp1(?=exp2) The content behind exp1 must match exp2
  • Negative lookahead: exp1(?!exp2) The content after exp1 cannot match exp2
  • Looking back: (?<=exp2)exp1 The content in front of exp1 should match exp2
  • Negative lookbehind: (?<!exp2)exp1 The content before exp1 cannot match exp2

For example: to find hello, but hello must be followed by world, the regular expression can be written like this: "(hello)\s+(?=world)", used to match "hello wangxing" and "hello world", etc., but only matches hello.

1.6 Condition matching

(?(id)yes_exp|no_exp): If the subexpression corresponding to id matches the content, then match yes_exp here, otherwise match no_exp.

1.7 Flags for regular expressions

(1) There are two ways to use the flags of regular expressions:

  • By passing flag parameters to the compile() method, multiple flags can be separated by |, such as re.compile(r"#[\da-f]{6}\b", re.IGNORECASE|re.MULTILINE).
  • Add flags to the regular expression by adding (? flag) in front of the regular expression, such as (?ms)#[\da-z]{6}\b.

commonly used signs

  • re.A or re.ASCII, make \b \B \s \S \w \W \d \D all assume the string is assumed to be ASCII.
  • re.I or re.IGNORECASE Make the regular expression ignore case.
  • re.M or re.MULTILINE Multi-line matching, so that each ^ matches after each carriage return, and each $ matches before each carriage return.
  • re.S or re.DOTALL enable. Can match any character, including carriage return.
  • re.X or re.VERBOSE In this way, the regular expression can span multiple lines, and comments can also be added, but the blanks need to be represented by \s or [ ], because the default blanks are no longer interpreted. like:
re.compile(r"""
<img\s  +)   #标签的开始
[^>]*?       #不是src的属性
src=         #src属性的开始
(?:
(?P<quote>["'])                #左引号
(?P<image_name>[^\1>]+?)  #图片名字
(?P=quote)                     #右括号
""",re.VERBOSE|re.IGNORECASE)

2. Regular expression module

2.1 There are four main functions of regular expression processing strings

  • Match checks whether a string conforms to the syntax of a regular expression, and generally returns true or false.
  • Get the regular expression to extract the required text in the string.
  • Replace finds text in a string that matches a regular expression and replaces it with the corresponding string.
  • Split uses regular expressions to split strings.

2.2 Two ways for modules to use regular expressions

  • Use the re.compile(r, f) method to generate a regular expression object, and then call the corresponding method of the regular expression object. The advantage of this approach is that regular objects can be used multiple times after they are generated.
  • Each object method of the regular expression object in the re module has a corresponding module method, the only difference is that the first parameter passed in is a regular expression string. This approach is suitable for regular expressions that are used only once.

2.3 Common methods of regular expression objects

(1) rx.findall(s,start, end):

Returns a list. If there is no grouping in the regular expression, the list contains all matched content. If there is a grouping in the regular expression, each element in the list is a tuple, and the tuple contains subgroups Matched content, but does not return the content matched by the entire regular expression.

(2) rx.finditer(s, start, end):

Return an iterable object.

Iterate over the iterable object and return a matching object each time. You can call the group() method of the matching object to view the content matched by the specified group. 0 means the content matched by the entire regular expression.

(3) rx.search(s, start, end):

Returns a match object, or None if no match is found.

The search method only matches once and then stops, and will not continue to match later.

(4) rx.match(s, start, end):

Returns a match object if the regular expression matches at the beginning of the string, otherwise returns None.

(5) rx.sub(x, s, m):

Returns a string. Each matching place is replaced with x, and the replaced string is returned. If m is specified, it will be replaced up to m times. For x you can use /i or /gid can be a group name or number to refer to the captured content.

A function is available for x in the module method re.sub(r, x, s, m) . At this point, we can push the captured content through this function for processing and then replace the matched text.

(6) rx.subn(x, s, m):

Same as the re.sub() method, the difference is that it returns a two-tuple, one of which is the result string, and one is the number of replacements.

(7) rx.split(s, m): split string

Returns a list.

Split the string with the content matched by the regular expression.

If there is a group in the regular expression, put the content matched by the group in the middle of every two divisions in the list as part of the list, such as:

import re
rx =  re.compile(r"(\d)[a-z]+(\d)")
s =  "ab12dk3klj8jk9jks5"
result =  rx.split(s)
print(result)
#返回['ab1',  '2', '3', 'klj', '8', '9', 'jks5']

(8) rx.flags(): Flags set when compiling regular expressions

(9) rx.pattern(): The string used when compiling the regular expression.

(10) m.group(g, …)

Return the content matched by the number or group name. The default or 0 means the content matched by the entire expression. If more than one is specified, a tuple will be returned.

(11) m.groupdict(default)

Return a dictionary. The keys of the dictionary are the group names of all named groups, and the values ​​are whatever the named group captures.

If there is a default parameter, it is used as the default value for those groups that did not participate in the match.

(12) m.groups(default)

Return a tuple. Contains all subgroups that capture content, starting from 1. If a default value is specified, this value is used as the value of those groups that do not capture content.

(13) m.lastgroup()

The name of the highest-numbered capturing group that matched the content, or None (not commonly used) if none or no name is used.

(14) m.lastindex()

The number of the highest-numbered capturing group that matched the content, or None if none.

(15) m.start(g):

The subgroup of the current matching object is matched from that position of the string, and -1 is returned if the current group does not participate in the match.

(16) m.end(g)

The subgroup of the current matching object is matched from that position of the string. If the current group does not participate in the match, -1 is returned.

(17) m.span()

Return a 2-tuple, the contents are the return values ​​of m.start(g) and m.end(g) respectively.

(18) m.re()

The regular expression that produces this match object.

(19) m.string()

The string passed to match or search for matching.

(20) m.pos()

The starting position of the search. That is, the beginning of the string, or the position specified by start (not commonly used).

(21) m.endpos()

The end position of the search. That is, the end position of the string, or the position specified by end (not commonly used).

2.4 Summary

  • For the matching function of regular expressions, Python does not have a method to return true and false, but it can be judged by whether the return value of the match or search method is None.
  • For the search function of regular expressions, if you search only once, you can use the matching object returned by the search or match method to obtain it. For multiple searches, you can use the iterable object returned by the finditer method to iteratively access.
  • For the regular expression replacement function, you can use the sub or subn method of the regular expression object, or you can use the sub or subn method of the re module. The difference is that the replacement text of the sub method of the module can be generated using a function.
  • For the regular expression splitting function, you can use the split method of the regular expression object. It should be noted that if the regular expression object has grouping, the content captured by the grouping will also be placed in the returned list.

Attachment: The meaning of characters commonly used in regular expressions

Regular expression itself is a small, highly specialized programming language, and in Python, by embedding and integrating the re module, it can be called directly to achieve regular matching. Regular expression patterns are compiled into a series of bytecodes, which are then executed by a matching engine written in C.

1. Common characters and 11 metacharacters

Special characters illustrate expression matching result
. Match any character except the newline character "\n", it can also match the newline character in DOTALL mode a.c abc
\ Escape characters, so that the latter character changes the original meaning a.c;a\c a.c;a\c
* Matches the previous character 0 or more times abc* ab;abccc
+ Match the previous character 1 or unlimited times abc+ abc;abccc
? Match a character 0 or 1 time abc? ab;abc
^ Matches the beginning of a string. matches the beginning of each line in multiline mode ^abc abc
$ Matches the end of a string, or the end of each line in multiline mode abc$ abc
| or. Match|any one of the left and right expressions, match from left to right, if | is not included in (), its scope is the entire regular expression abc|def abc;def
{} {m} matches the previous character m times, {m,n} matches the previous character m to n times, if n is omitted, it matches m to infinite times ab{1,2}c abc;abbc
[] character set. The corresponding character can be any character in the character set. The characters in the character set can be listed one by one, or a range can be given, such as [abc] or [ac]. [^abc] means negation, that is, not abc. Note that all special characters lose their original special meaning in the character set. Escaping with a \ backslash restores the special meaning of special characters. a[bcd]e abe;ace;ade
() The enclosed expression will be regarded as a group, starting from the left side of the expression, every time a left parenthesis "(" of a group is encountered, the number +1. The group expression as a whole can be followed by quantifiers. The | in the expression is only in valid in this group. (abc){2};a(123|456)c abcabc; a456c

Note the effect of the backslash \:

  • Backslashes followed by metacharacters remove special features (i.e. escape special characters into normal characters).
  • A backslash followed by an ordinary character implements a special function (that is, a predefined character).
  • Reference the character string matched by the word group corresponding to the serial number.
import re
a=re.search(r'(tina)(fei)haha\2','tinafeihahafei tinafeihahatina').group()
print(a)
# 结果:
# tinafeihahafei

2. Predefined character set (can be written in character set[…])

Special characters illustrate expression matching result
\d Number: [0-9] a\bc a1c
\D Not a number:[^\d] to\Dc abc
\s Matches any whitespace character: [<space>\t\r\n\f\v] a\sc a c
\S Non-blank characters:[^\s] a\Sc abc
\w Match any character including underscore: [A-Za-z0-9_] a\wc abc
\W Match non-alphabetic characters, i.e. match special characters a\Wc a c
\A Only match the beginning of the string, same as ^ \Aabc abc
\Z Only match the end of the string, same as $ abc\Z abc
\b Match between \w and \W, that is, match the word boundary, that is, the position between the word and the space. For example, 'er\b' matches 'er' in "never", but not 'er' in "verb". \babc\b; a\b!bc space abc space; a!bc
\B Match non-boundary: [^\b] a\Bbc abc
import re
w = re.findall('\btina','tian tinaaaa')
print(w)
s = re.findall(r'\btina','tian tinaaaa')
print(s)
v = re.findall(r'\btina','tian#tinaaaa')
print(v)
a = re.findall(r'\btina\b','tian#tina@aaa')
print(a)
'''执行结果如下:
[]
['tina']
['tina']
['tina']'''

test code

3. Special group usage

Special characters illustrate expression matching result
(?P) Group, specify an additional alias in addition to the original number (?Pabc){2} abcabc
(?P=name) Quoting the group with the alias to match to the string (?P\d)abc(?P=id) 1abc1; 5abc5
<number> Groups with reference numbers matched to the string (\d)abc\1 1abc1; 5abc5

Guess you like

Origin blog.csdn.net/weixin_61587867/article/details/132363621