Data Structure Mastery of Regular Expressions in Five Minutes (Python)
PAY ATTENTION!
⭐1. re.search() re.match() re.findall() re.finditer() Instead of directly returning the found content or the found location, it returns a matching object.
⭐2. The span() function returns the matched start position and end position. (including the start, excluding the end ----[3,5) )
⭐3. The group() function directly returns the matched content
PAY ATTENTION!
-
0. Regular expression rule table
-
1. re.search()------match search
-
2. re.match() ----- match from the first
-
3. re.findall()------find all
-
4. re.finditer()----return iterator
-
5. re.split()---------regularized segmentation
-
6. Expression matching rules – quick search
-
7. Expression matching case – quick understanding
0. Regular expression rule table
0) Regex modifiers - optional flags
Modifier | describe |
---|---|
re.I | Make matching case insensitive |
re.L | Do locale-aware matching |
re.M | multiline match, affects ^ and $ |
re.S | make . match all characters including newlines |
re.U | Parse characters according to the Unicode character set. This flag affects \w, \W, \b, \B. |
re.X | This flag allows you to write regular expressions that are easier to understand by giving you more flexible formatting. |
1) Special characters
example | describe |
---|---|
. | Matches any single character except "\n". To match any character including '\n', use a pattern like '[.\n]'. |
\d | Matches a numeric character. Equivalent to [0-9]. |
\D | Matches a non-numeric character. Equivalent to [^^0-9]. |
\s | Matches any whitespace character, including spaces, tabs, form feeds, and so on. Equivalent to [ \f\n\r\t\v]. |
\S | Matches any non-whitespace character. Equivalent to [^^ \f\n\r\t\v]. |
\w | Matches any word character including an underscore. Equivalent to '[A-Za-z0-9_]'. |
\W | Matches any non-word character. Equivalent to '[^^A-Za-z0-9_]'. |
3) Must remember
^ | matches the beginning of the string |
---|---|
$ | Matches the end of a string. |
. | Matches any character, except newline, and when the re.DOTALL flag is specified, matches any character including newline. |
[…] | Used to represent a group of characters, listed separately: [amk] matches 'a', 'm' or 'k' |
[^…] | Characters not in []: [^abc] matches characters other than a, b, c. |
re* | Matches 0 or more expressions. |
re+ | Matches 1 or more expressions. |
re? | Match 0 or 1 fragment defined by the preceding regular expression, non-greedy |
re{ n} | Matches exactly n preceding expressions. For example, o{2} would not match the "o" in "Bob", but would match both o's in "food". |
re{ n,} | Matches n occurrences of the preceding expression. For example, o{2,} would not match the "o" in "Bob", but would match all o's in "foooood". "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*". |
re{ n, m} | Match n to m times the segment defined by the preceding regular expression, greedily |
a| b | match a or b |
(re) | Group regular expressions and remember matched text |
(?imx) | Regular expressions contain three optional flags: i, m, or x. Only the area enclosed in parentheses is affected. |
(?-imx) | The regular expression turns off the i, m, or x optional flags. Only the area enclosed in parentheses is affected. |
(?: re) | like (…), but does not denote a group |
(?imx: re) | Use i, m, or x optional flags in parentheses |
(?-imx: re) | Do not use i, m, or x optional flags in parentheses |
(?#…) | note. |
(?= re) | Forward positive delimiter. If the contained regular expression, denoted by ... , succeeds if it matches successfully at the current position, fails otherwise. But once the contained expression has been tried, the matching engine does not advance at all; the remainder of the pattern also tries the right side of the delimiter. |
(?! re) | Lookahead negation delimiter. The opposite of a positive delimiter; succeeds when the contained expression cannot be matched at the current position in the string |
(?> re) | Independent pattern for matching, omitting backtracking. |
\w | Matches alphanumerics and underscores |
\W | Match non-alphanumeric and underscore |
\s | 匹配任意空白字符,等价于 [ \t\n\r\f]。 |
\S | 匹配任意非空字符 |
\d | 匹配任意数字,等价于 [0-9]. |
\D | 匹配任意非数字 |
\A | 匹配字符串开始 |
\Z | 匹配字符串结束,如果是存在换行,只匹配到换行前的结束字符串。 |
\z | 匹配字符串结束 |
\G | 匹配最后匹配完成的位置。 |
\b | 匹配一个单词边界,也就是指单词和空格间的位置。例如, ‘er\b’ 可以匹配"never" 中的 ‘er’,但不能匹配 “verb” 中的 ‘er’。 |
\B | 匹配非单词边界。‘er\B’ 能匹配 “verb” 中的 ‘er’,但不能匹配 “never” 中的 ‘er’。 |
\n, \t, 等. | 匹配一个换行符。匹配一个制表符。等 |
\1…\9 | 匹配第n个分组的内容。 |
\10 | 匹配第n个分组的内容,如果它经匹配。否则指的是八进制字符码的表达式。 |
1、re.search()------匹配查找
-
目的:
扫描整个字符串并返回第一个成功的匹配
-
使用方法
re.search(pattern, string)
pattern 匹配的正则表达式
string 需要匹配的字符串
-
使用案例
import re re.search('go*gle','www.google.com') >>> <_sre.SRE_Match object; span=(4, 10), match='google'> re.search('go*gle','www.google.com').span() >>> (4, 10) re.search('go*gle','www.google.com').span()[0] >>> 4 re.search('go*gle','www.google.com').group() >>> google
-
注意事项
1、返回值是一个对象,不能直接使用,需要配合span()、group()等函数才可以完成任务。
2、search()函数找不到的话,返回的是None,而None是没有span()等方法的,而直接调用是会出错的,所以需要在写程序的时候增加一条判空语句
2、re.match() -----从第一位匹配
-
目的:
尝试从字符串的起始位置匹配一个模式,如果不是起始位置匹配成功的话,match() 就返回 none。
-
使用方法
re.match(pattern, string)
pattern 匹配的正则表达式
string 需要匹配的字符串
-
使用案例
import re re.match('google.','www.google.com') >>> None re.match('w..','www.google.com') >>> <_sre.SRE_Match object; span=(0, 2), match='ww'> re.match('w.','www.google.com').span() >>> (0, 2) re.match('w.','www.google.com').span()[0] >>> 0 re.match('w.','www.google.com').group() >>> ww
-
注意事项
只能从第一个字符开始匹配,如果第一个字符不能相互对应,则返回None。故一般不用
3、re.findall()------查找所有
-
目的:
找到正则表达式所匹配的所有子串,并返回一个列表
-
使用方法
str.findall(string , pos = 0,endpos = len(str))
str 匹配的正则表达式
string 需要匹配的字符串
pos 指定字符串的起始位置,默认为 0—可选参数
endpos 指定字符串的结束位置,默认为字符串的长度—可选参数
-
使用案例
import re pattern = re.compile(r'\d+') str = 'asd123qwe456opi789mnb012' pattern.findall(str) >>> ['123', '456', '789', '012'] pattern.findall(str,0,12) >>> ['123', '456'] re.findall(r'(\w+)=(\d+)', 'set w=30 and h=40') # 多个匹配模式,返回元组列表 >>> [('w','30'),('h','40')]
注意事项
不是返回找到字串的位置,而是返回字串的值。
4、re.finditer()----返回迭代器
-
目的:
扫描整个字符串并返回第一个成功的匹配
-
使用方法
re.finditer(pattern, string, flags=0)
- 使用案例
import re
it = re.finditer(r"\d+","78a32bc43jf3")
for iter in it:
print (iter.group() )
- 注意事项
返回的是迭代器无法直接输出
5、re.split()---------规则化分割
-
目的:
扫描整个字符串并返回第一个成功的匹配,按照能够匹配的子串将字符串分割后返回列表
-
使用方法
re.split(pattern, string[, maxsplit=0, flags=0])
- 使用案例
>>>import re
>>> re.split('\W+', 'baidu, google, sogo.')
['baidu', 'google', 'sogo', '']
- 注意事项
将所有匹配到的事项,存储到一个list当中,并返回。