python's thirty-first day for me re模块

Regular expression:

  The re module can read the regular expressions you write and perform tasks according to the expressions you write.

  Regular expressions: manipulation of strings.

  Use some rules to detect if a string matches my requirements - form validation

  Find content that matches my requirements from a string - crawler

  Character group: A character group represents everything that can appear at a character position.

    1. According to the ASCII code, the range must be pointed from small to large.

    2. A character group can have multiple ranges.

Character group: [character group]
Various characters that may appear in the same position form a character group, which is represented by [] in regular expressions.
Characters are divided into many categories, such as numbers, letters, punctuation, etc.
If you now ask for a position, 'only one number can appear', then the character in this position can only be one of 10 numbers 0, 1, 2...9.
character group
 
metacharacter match content
. matches any character except newline
\w Match letters or numbers or underscores

\s

matches any whitespace
\d match numbers
\n matches a newline
\t matches a tab character (tap)
\b match the end of a word
^ matches the beginning of the string
$ matches the end of the string
\W match non-alphanumeric or underscore
\D match non-digits
\S match non-whitespace
a|b matches character a or character b
() Matches expressions within parentheses, also denoting a group
[...] matches characters in a character group
[^...]

matches all characters except the characters in the character group

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

quantifier:

quantifier Instructions for use
* Repeat zero or more times
+ repeat one or more times
repeat zero or one time
{n} repeat n times
{n,} Repeat n or more times
{n,m} Repeat n to m times

 

 

 

 

 

 

 

 

.  ^  $ :

Regular with matching characters match result illustrate
Ocean. Haiyan Haijiao Haidong Haiyan Haijiao Haidong matches all 'sea.' characters
^ Sea. Haiyan Haijiao Haidong Haiyan Only matches "sea." from the beginning
sea.$ Haiyan Haijiao Haidong Haidong Only matches "sea.$" at the end

 

 

 

 

 

 

 

 

* + ? { }:

Regular character to match match
result
illustrate
plum.? Li Jie and Li Lianying and Li Ergou

Li Jie
Li Lian
Li Er

 
? means repeat zero or one time, that is, only match any character after "Li"
 
plum.* Li Jie and Li Lianying and Li Ergou Li Jie and Li Lianying and Li Ergou
* means repeat zero or more times, that is, match zero or more arbitrary characters after "Li"
Li.+ Li Jie and Li Lianying and Li Ergou Li Jie and Li Lianying and Li Ergou
+ means to repeat one or more times, that is, only match one or more arbitrary characters after "Li"
Li.{1,2} Li Jie and Li Lianying and Li Ergou

Li Jie and
Li Lianying
Li Ergu

{1,2} matches any character 1 or 2 times

 

 

 

 

 

 

 

 

 

Note: The preceding *,+,?, etc. are all greedy matching, that is, match as much as possible, and add a? sign after it to make it a lazy match

Regular with matching characters match result illustrate
plum.*? Li Jie and Li Lianying and Li Ergou

plum

plum

plum

lazy matching
Li.+? Li Jie and Li Lianying and Li Ergou

Li Jie

Li Lian

Li Er

lazy matching

 

 

 

 

 

 

 

 

 

Character set [ ] [^ ]:

Regular character to match match
result
illustrate
Lee [Jie Lianying two sticks]* Li Jie and Li Lianying and Li Ergou

Li Jie
Li Lianying
Li Er stick

 
It means to match the character after the word "Li" [Jie Lianying two sticks] any number of times
 
Lee [^ Wa] * Li Jie and Li Lianying and Li Ergou

Li Jie
Li Lianying
Li Er stick

means match a character other than "and" any number of times
[\d] 456bdha3

4
5
6
3

Indicates matching any number, matching to 4 results
[\d]+ 456bdha3

456
3

Indicates matching any number, matching 2 results

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Grouping() with |(or)[^]:

 The ID card number is a string of 15 or 18 characters. If it is 15 digits, it consists of numbers, and the first digit cannot be 0; if it is 18 digits, the first digit cannot be 0, and the first 17 digits are all digits. Bits may be numbers or x, let's try to represent them with regular expressions:

Regular character to match match
result
illustrate
^[1-9]\d{13,16}[0-9x]$ 110101198001017032

110101198001017032

   Indicates that it can match a correct ID number
^[1-9]\d{13,16}[0-9x]$ 1101011980010170

1101011980010170

Indicates that this string of numbers can also be matched, but this is not a correct ID number, it is a 16-digit number
^[1-9]\d{14}(\d{2}[0-9x])?$ 1101011980010170

False

Now it will not match the wrong ID number. 
() means grouping. By dividing \d{2}[0-9x] into a group, the number of occurrences of them can be restricted to 0-1 times as a whole.
^([1-9]\d{16}[0-9x]|[1-9]\d{14})$ 110105199812067023

110105199812067023

It means to match [1-9]\d{16}[0-9x] first, if there is no match, then match [1-9]\d{14}

 

 

 

 

 

 

 

 

 

 

Escapes:

  In regular expressions, there are many metacharacters with special meaning, such as \d and \s, etc. If you want to match the normal '\d' instead of 'number' in the regular expression, you need to convert '\' Meaning, programming '\\'.

  In python, whether it is a regular expression or the content to be matched, it all appears in the form of a string. In the string, '\' also has a special meaning and needs to be escaped. So if you match '\d' once, the string should be written as '\d', then the regular should be written as '\\\d', which is too troublesome, this time we use r'\d' this concept, the regularity at this time is r'\\d'.

Regular character to match match
result
illustrate
\d \d  False
Because \ is a character with special meaning in regular expressions, to match \d itself, the expression \d cannot match
\\d \d  True
After escaping \, it becomes \\ to match
"\\\\d" '\\d'  True
If in python, the '\' in the string also needs to be escaped, so each string '\' needs to be escaped again
r'\\d' r'\d'  True
Add r before the string to make the entire string unescape

 

 

 

 

 

 

 

 

 

Greedy match:

  When a match is satisfied, match the longest possible string. By default, greedy matching is used.

Regular character to match match
result
illustrate
<.*>

<script>...<script>

<script>...<script>
The default is greedy matching mode, which will match the longest possible string
<.*a?> r'\d'  

<script>
<script>

加上?为将贪婪匹配模式转为非贪婪匹配模式,会匹配尽量短的字符串

 

 

 

 

 

 

 

几个常用的非贪婪匹配Pattern:

*?    重复任意次,但尽可能少重复。
+?     重复1次或更多次,但尽可能少重复。
??     重复0次或1次,但尽可能少重复。
{n,m}?    重复n到m次,但尽可能少重复。
{n,}?    重复n次以上,但尽可能少重复。

.*? 的用法:

.    是任意字符
*    是取0至 无限长度
?    是非贪婪模式
合在一起就是 取尽量少的任意字符,一般不会单独写,例如:
.*?x
    :就是取前面任意长度的字符,直到一个x出现。

re模块下的常用方法:

  findall:

import re
# findall接受两个参数:正则表达式 要匹配的字符串
ret = re.findall('a','eva egon yuan')
# 一个列表数据列星的返回值:所有和这条正则匹配的结果。
print(ret)  # ['a', 'a'] 返回所有满足匹配条件的结果,放在列表里。

  search:

import re

ret = re.search('a','eva egon yuan')
if ret:
    print(ret)  # <_sre.SRE_Match object; span=(2, 3), match='a'>
    print(ret.group())  # a # 找到一个就返回,从结果对象中获取结果。
# 如果匹配到就返回一个结果对象。
# 若是没有匹配到就返回一个None.

findall 和 search 的区别:

  1,search找到一个就返回,findall是找到所有的才返回。

  2,findall是直接返回一个结果的列表,search是返回一个对象。

  match:  意味着在正则表达式中添加了一个 ^     'a' ---> '^a'

import re
ret = re.match('a','ava egon yuan')
print(ret)  # <_sre.SRE_Match object; span=(0, 1), match='a'>
print(ret.group())  # a

  1,意味着在正则表达式中添加了一个 ^ 

  2,和search一样,匹配到 返回的结果对象,没匹配到,返回None.

  3,和search一样,从结果对象中,获取值,仍然用group.

  compile: 

  1,正则表达式——> 根据规则匹配字符串。

  2,从一个字符串中找到符合规则的字符串——> python

  3,正则规则 ——编译——> python能理解的语言。

  4,多次执行,就需要多次编译,浪费时间。

  5,编译 re.compile()    可以节省时间。

import re
obj = re.compile('\d{3}')
ret = obj.search('abc123eeee')
print(ret.group())  # 123

  finditer:  返回一个迭代器可以节省空间

import re
ret = re.finditer('\d','dsfd24sdf324sf')
# 返回一个存放结果的迭代器
print(ret)  # <callable_iterator object at 0x0000016DBB712860>
# print(ret.__next__())   # <_sre.SRE_Match object; span=(4, 5), match='2'>
for i in ret:
    print(i.group())

  split:

import re
ret = re.split('[ab]','abcd') # 先按‘a’分割得到‘’和‘bcd’在分别按‘b’分割
print(ret)  # ['', '', 'cd']

  sub:

import re

ret1 = re.sub('\d','H','eva3egon4alex5')
# 若字符串后没有写次数,则默认全部替换。
print(ret1)  # evaHegonHalexH

ret2 = re.sub('\d','H','eva3egon4alex5',2)
# 替换两次
print(ret2) # evaHegonHalex5

  subn:

import re

ret1 = re.subn('\d','H','eva3egon4yuan5')
# 默认全部替换并返回一个元祖。(替换后的结果,替换了多少次)
print(ret1)  # ('evaHegonHyuanH', 3)

ret2 = re.subn('\d','H','eva3egon4yuan5',1)
# 替换一次
print(ret2) # ('evaHegon4yuan5', 1)

findall的优先级查询:

import re

ret1 = re.findall('www\.(baidu|oldboy)\.com','www.oldboy.com')
# 因为findall会优先把匹配结果组里内容返回,如果想要匹配结果,取消权限即可。
print(ret1)  # ['oldboy']

# 取消findall中分组的优先权限
ret2 = re.findall('www\.(?:baidu|oldboy)\.com','www.oldboy.com')
# 在分组里的起始,加上 ?: 就可以取消findall中分组的优先权限
print(ret2) # ['www.oldboy.com']

split 的优先级查询:

import re

ret1 = re.split('\d+','eva3egon4yuan5')

print(ret1)  # ['eva', 'egon', 'yuan', '']

ret2 = re.split('(\d+)','eva3egon4yuan5')
print(ret2) # ['eva', '3', 'egon', '4', 'yuan', '5', '']

# 在匹配部分加上()之后所切出的结构是不同的。
# 没有()的没有保留所匹配的项,但是有()的却能够保留了匹配的项。
# 这个在某些需要保留部分的使用过程是非常重要的。

匹配标签:

import re

ret = re.search('<(?P<tag_name>\w+)>\w+</(?P=tag_name)>','<h1>hello</h1>')
# 还可以在分组中利用?<name>的形式给分组起名字
# 获取的匹配结果可以直接用group('名字')拿到对应的值
print(ret.group('tag_name'))    # h1
print(ret.group())  # <h1>hello</h1>

# 如果不给组起名字,也可以用\序号来找到对应的组,表示要找的内容和前面的组内容一致。
# 获取的匹配结果可以直接用group(序号)拿到对应的值
ret = re.search(r'<(\w+)>\w+</\1>','<h1>hello</h1>')
print(ret)  # <_sre.SRE_Match object; span=(0, 14), match='<h1>hello</h1>'>
print(ret.group(0)) # <h1>hello</h1>
print(ret.group(1)) # h1
print(ret.group())  # <h1>hello</h1> 默认是 0

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325044343&siteId=291194637