Regular expression:
The re module can read the regular expressions you write and perform tasks according to the expressions you write.
Regular expressions: manipulation of strings.
Use some rules to detect if a string matches my requirements - form validation
Find content that matches my requirements from a string - crawler
Character group: A character group represents everything that can appear at a character position.
1. According to the ASCII code, the range must be pointed from small to large.
2. A character group can have multiple ranges.
Character group: [character group]
Various characters that may appear in the same position form a character group, which is represented by [] in regular expressions.
Characters are divided into many categories, such as numbers, letters, punctuation, etc.
If you now ask for a position, 'only one number can appear', then the character in this position can only be one of 10 numbers 0, 1, 2...9.
metacharacter | match content |
. | matches any character except newline |
\w | Match letters or numbers or underscores |
\s |
matches any whitespace |
\d | match numbers |
\n | matches a newline |
\t | matches a tab character (tap) |
\b | match the end of a word |
^ | matches the beginning of the string |
$ | matches the end of the string |
\W | match non-alphanumeric or underscore |
\D | match non-digits |
\S | match non-whitespace |
a|b | matches character a or character b |
() | Matches expressions within parentheses, also denoting a group |
[...] | matches characters in a character group |
[^...] | matches all characters except the characters in the character group |
quantifier:
quantifier | Instructions for use |
* | Repeat zero or more times |
+ | repeat one or more times |
? | repeat zero or one time |
{n} | repeat n times |
{n,} | Repeat n or more times |
{n,m} | Repeat n to m times |
. ^ $ :
Regular | with matching characters | match result | illustrate |
Ocean. | Haiyan Haijiao Haidong | Haiyan Haijiao Haidong | matches all 'sea.' characters |
^ Sea. | Haiyan Haijiao Haidong | Haiyan | Only matches "sea." from the beginning |
sea.$ | Haiyan Haijiao Haidong | Haidong | Only matches "sea.$" at the end |
* + ? { }:
Regular | character to match | match result |
illustrate |
plum.? | Li Jie and Li Lianying and Li Ergou | Li Jie |
? means repeat zero or one time, that is, only match any character after "Li" |
plum.* | Li Jie and Li Lianying and Li Ergou | Li Jie and Li Lianying and Li Ergou | * means repeat zero or more times, that is, match zero or more arbitrary characters after "Li" |
Li.+ | Li Jie and Li Lianying and Li Ergou | Li Jie and Li Lianying and Li Ergou | + means to repeat one or more times, that is, only match one or more arbitrary characters after "Li" |
Li.{1,2} | Li Jie and Li Lianying and Li Ergou | Li Jie and |
{1,2} matches any character 1 or 2 times |
Note: The preceding *,+,?, etc. are all greedy matching, that is, match as much as possible, and add a? sign after it to make it a lazy match
Regular | with matching characters | match result | illustrate |
plum.*? | Li Jie and Li Lianying and Li Ergou | plum plum plum |
lazy matching |
Li.+? | Li Jie and Li Lianying and Li Ergou | Li Jie Li Lian Li Er |
lazy matching |
Character set [ ] [^ ]:
Regular | character to match | match result |
illustrate |
Lee [Jie Lianying two sticks]* | Li Jie and Li Lianying and Li Ergou | Li Jie |
It means to match the character after the word "Li" [Jie Lianying two sticks] any number of times |
Lee [^ Wa] * | Li Jie and Li Lianying and Li Ergou | Li Jie |
means match a character other than "and" any number of times |
[\d] | 456bdha3 | 4 |
Indicates matching any number, matching to 4 results |
[\d]+ | 456bdha3 | 456 |
Indicates matching any number, matching 2 results |
Grouping() with |(or)[^]:
The ID card number is a string of 15 or 18 characters. If it is 15 digits, it consists of numbers, and the first digit cannot be 0; if it is 18 digits, the first digit cannot be 0, and the first 17 digits are all digits. Bits may be numbers or x, let's try to represent them with regular expressions:
Regular | character to match | match result |
illustrate |
^[1-9]\d{13,16}[0-9x]$ | 110101198001017032 | 110101198001017032 |
Indicates that it can match a correct ID number |
^[1-9]\d{13,16}[0-9x]$ | 1101011980010170 | 1101011980010170 |
Indicates that this string of numbers can also be matched, but this is not a correct ID number, it is a 16-digit number |
^[1-9]\d{14}(\d{2}[0-9x])?$ | 1101011980010170 | False |
Now it will not match the wrong ID number. |
^([1-9]\d{16}[0-9x]|[1-9]\d{14})$ | 110105199812067023 | 110105199812067023 |
It means to match [1-9]\d{16}[0-9x] first, if there is no match, then match [1-9]\d{14} |
Escapes:
In regular expressions, there are many metacharacters with special meaning, such as \d and \s, etc. If you want to match the normal '\d' instead of 'number' in the regular expression, you need to convert '\' Meaning, programming '\\'.
In python, whether it is a regular expression or the content to be matched, it all appears in the form of a string. In the string, '\' also has a special meaning and needs to be escaped. So if you match '\d' once, the string should be written as '\d', then the regular should be written as '\\\d', which is too troublesome, this time we use r'\d' this concept, the regularity at this time is r'\\d'.
Regular | character to match | match result |
illustrate |
\d | \d | False | Because \ is a character with special meaning in regular expressions, to match \d itself, the expression \d cannot match |
\\d | \d | True | After escaping \, it becomes \\ to match |
"\\\\d" | '\\d' | True | If in python, the '\' in the string also needs to be escaped, so each string '\' needs to be escaped again |
r'\\d' | r'\d' | True | Add r before the string to make the entire string unescape |
Greedy match:
When a match is satisfied, match the longest possible string. By default, greedy matching is used.
Regular | character to match | match result |
illustrate |
<.*> | <script>...<script> |
<script>...<script> | The default is greedy matching mode, which will match the longest possible string |
<.*a?> | r'\d' | <script> |
加上?为将贪婪匹配模式转为非贪婪匹配模式,会匹配尽量短的字符串 |
几个常用的非贪婪匹配Pattern:
*? 重复任意次,但尽可能少重复。 +? 重复1次或更多次,但尽可能少重复。 ?? 重复0次或1次,但尽可能少重复。 {n,m}? 重复n到m次,但尽可能少重复。 {n,}? 重复n次以上,但尽可能少重复。
.*? 的用法:
. 是任意字符 * 是取0至 无限长度 ? 是非贪婪模式 合在一起就是 取尽量少的任意字符,一般不会单独写,例如: .*?x :就是取前面任意长度的字符,直到一个x出现。
re模块下的常用方法:
findall:
import re # findall接受两个参数:正则表达式 要匹配的字符串 ret = re.findall('a','eva egon yuan') # 一个列表数据列星的返回值:所有和这条正则匹配的结果。 print(ret) # ['a', 'a'] 返回所有满足匹配条件的结果,放在列表里。
search:
import re ret = re.search('a','eva egon yuan') if ret: print(ret) # <_sre.SRE_Match object; span=(2, 3), match='a'> print(ret.group()) # a # 找到一个就返回,从结果对象中获取结果。 # 如果匹配到就返回一个结果对象。 # 若是没有匹配到就返回一个None.
findall 和 search 的区别:
1,search找到一个就返回,findall是找到所有的才返回。
2,findall是直接返回一个结果的列表,search是返回一个对象。
match: 意味着在正则表达式中添加了一个 ^ 'a' ---> '^a'
import re ret = re.match('a','ava egon yuan') print(ret) # <_sre.SRE_Match object; span=(0, 1), match='a'> print(ret.group()) # a
1,意味着在正则表达式中添加了一个 ^
2,和search一样,匹配到 返回的结果对象,没匹配到,返回None.
3,和search一样,从结果对象中,获取值,仍然用group.
compile:
1,正则表达式——> 根据规则匹配字符串。
2,从一个字符串中找到符合规则的字符串——> python
3,正则规则 ——编译——> python能理解的语言。
4,多次执行,就需要多次编译,浪费时间。
5,编译 re.compile() 可以节省时间。
import re obj = re.compile('\d{3}') ret = obj.search('abc123eeee') print(ret.group()) # 123
finditer: 返回一个迭代器可以节省空间
import re ret = re.finditer('\d','dsfd24sdf324sf') # 返回一个存放结果的迭代器 print(ret) # <callable_iterator object at 0x0000016DBB712860> # print(ret.__next__()) # <_sre.SRE_Match object; span=(4, 5), match='2'> for i in ret: print(i.group())
split:
import re ret = re.split('[ab]','abcd') # 先按‘a’分割得到‘’和‘bcd’在分别按‘b’分割 print(ret) # ['', '', 'cd']
sub:
import re ret1 = re.sub('\d','H','eva3egon4alex5') # 若字符串后没有写次数,则默认全部替换。 print(ret1) # evaHegonHalexH ret2 = re.sub('\d','H','eva3egon4alex5',2) # 替换两次 print(ret2) # evaHegonHalex5
subn:
import re ret1 = re.subn('\d','H','eva3egon4yuan5') # 默认全部替换并返回一个元祖。(替换后的结果,替换了多少次) print(ret1) # ('evaHegonHyuanH', 3) ret2 = re.subn('\d','H','eva3egon4yuan5',1) # 替换一次 print(ret2) # ('evaHegon4yuan5', 1)
findall的优先级查询:
import re ret1 = re.findall('www\.(baidu|oldboy)\.com','www.oldboy.com') # 因为findall会优先把匹配结果组里内容返回,如果想要匹配结果,取消权限即可。 print(ret1) # ['oldboy'] # 取消findall中分组的优先权限 ret2 = re.findall('www\.(?:baidu|oldboy)\.com','www.oldboy.com') # 在分组里的起始,加上 ?: 就可以取消findall中分组的优先权限 print(ret2) # ['www.oldboy.com']
split 的优先级查询:
import re ret1 = re.split('\d+','eva3egon4yuan5') print(ret1) # ['eva', 'egon', 'yuan', ''] ret2 = re.split('(\d+)','eva3egon4yuan5') print(ret2) # ['eva', '3', 'egon', '4', 'yuan', '5', ''] # 在匹配部分加上()之后所切出的结构是不同的。 # 没有()的没有保留所匹配的项,但是有()的却能够保留了匹配的项。 # 这个在某些需要保留部分的使用过程是非常重要的。
匹配标签:
import re ret = re.search('<(?P<tag_name>\w+)>\w+</(?P=tag_name)>','<h1>hello</h1>') # 还可以在分组中利用?<name>的形式给分组起名字 # 获取的匹配结果可以直接用group('名字')拿到对应的值 print(ret.group('tag_name')) # h1 print(ret.group()) # <h1>hello</h1> # 如果不给组起名字,也可以用\序号来找到对应的组,表示要找的内容和前面的组内容一致。 # 获取的匹配结果可以直接用group(序号)拿到对应的值 ret = re.search(r'<(\w+)>\w+</\1>','<h1>hello</h1>') print(ret) # <_sre.SRE_Match object; span=(0, 14), match='<h1>hello</h1>'> print(ret.group(0)) # <h1>hello</h1> print(ret.group(1)) # h1 print(ret.group()) # <h1>hello</h1> 默认是 0