Python regular expressions (complete) ------- Detailed analysis of LeetCode real questions

Regular expression, also known as regular expression, (Regular Expression, often abbreviated as regex, regexp or RE in code), is a text pattern, including ordinary characters (for example, letters between a and z) and special characters ( called "metacharacters"), a concept in computer science. Regular expressions use a single string to describe and match a series of strings that match a certain syntax rule, and are usually used to retrieve and replace text that matches a certain pattern (rule).

Python with 'RE':

Python has added the re module since version 1.5, which provides Perl-style regular expression patterns.

The re module brings full regular expression functionality to the Python language.

The compile function generates a regular expression object from a pattern string and optional flags arguments. This object has a set of methods for regular expression matching and replacement.

The re module also provides functions that do exactly what these methods do, taking a pattern string as their first argument.

Simply put, regular expressions are...

An essential tool in python, mainly used to find and match strings, especially in crawlers.

How to use ‘RE’?

First of all, we need to import the RE module. This module is built into Python. It is very convenient to import it. Just import re and it will succeed~

import re

There are six built-in methods in the RE module

  • re.compile: Compile a regular expression pattern (pattern)
  • re.match: start matching from the beginning, use the group() method to get the first matching value
  • re.search: match with containment, use the group() method to get the first matching value
  • re.findall: Match in an inclusive way, put all the matched characters into the elements in the list and return multiple matching values
  • re.sub: match characters and replace
  • re.split: use the matched character as the list separator and return the list

regular expression pattern

Pattern strings use a special syntax to represent a regular expression:
letters and numbers represent themselves. Letters and numbers in a regular expression pattern match the same string.
Most letters and numbers have different meanings when preceded by a backslash.
Punctuation marks match themselves only if they are escaped, otherwise they have a special meaning.
The backslash itself needs to be escaped with a backslash.
Since regular expressions often contain backslashes, you're better off representing them using raw strings. Pattern elements (such as r'\t', equivalent to '\t') match the corresponding special characters.

Python regular expression symbol meaning

insert image description here

regular expression object

re.RegexObject

  • re.compile() returns a RegexObject object.

re.MatchObject

  • group() returns the strings matched by the RE.
  • start() returns the position where the match starts
  • end() returns the position where the match ends
  • span() returns a tuple containing the position of the match (start, end)

regex modifiers - optional flags:

A regular expression can contain some optional flag modifiers to control the pattern matched. Modifiers are specified as an optional flag. Multiple flags can be specified by bitwise OR(|) them. If re.I | re.M is set to the I and M flags:

Modifier describe
re.I Make matching case insensitive
re.L Do locale-aware matching
re.M multiline match, affects ^ and $
re.S make . match all characters including newlines
re.U Parse characters according to the Unicode character set. This flag affects \w, \W, \b, \B.
re.X This option ignores whitespace and comments in regular expressions, and allows '#' to lead a comment. This allows you to write rules more beautifully.

Regular expression example

  • character match
example describe
python matches "python".
  • character class
example describe
[Pp]ython matches "Python" or "python"
rub[ye] matches "ruby" or "rube"
[credit] matches any letter in brackets
[0-9] matches any number. Similar to [0123456789]
[a-z] matches any lowercase letter
[A-Z] matches any uppercase letter
[a-zA-Z0-9] matches any letter and number
[^aiyou] All characters except aeiou letters
[^0-9] matches any character except digits
  • special character class
example describe
. Matches any single character except "\n". To match any character including '\n', use a pattern like '[.\n]'.
\d Matches a numeric character. Equivalent to [0-9].
\D Matches a non-numeric character. Equivalent to [^0-9].
\s Matches any whitespace character, including spaces, tabs, form feeds, and so on. Equivalent to [ \f\n\r\t\v].
\S Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v].
\w Matches any word character including an underscore. Equivalent to '[A-Za-z0-9_]'.
\W Matches any non-word character. Equivalent to '[^A-Za-z0-9_]'.

1. re.compile function

The compile function is used to compile the regular expression and generate a regular expression (Pattern) object for use by the two functions of match() and search().

re.compile(pattern[, flags])

parameter:

pattern : a regular expression in the form of a string
flags : optional, indicating the matching pattern, such as ignoring case, multi-line mode, etc. It has been explained above and will not be repeated below.

import re
pattern = re.compile(r'\d+')  
m = pattern.findall('one12twothree34four')
print(m)
#Output
#['12', '34']

2. re.match function

re.match Attempts to match a pattern from the beginning of the string, if not the beginning of the match, match() returns none.

grammar:

re.match(pattern, string, flags=0)

Parameter Description:

parameter describe
pattern match regular expression
string The string to match.
flags The flag bit is used to control the matching mode of the regular expression, such as: whether to be case-sensitive, multi-line matching, etc.

The re.match method returns a matching object if the match is successful, otherwise it returns None.

We can use group(num) or groups() match object function to get match expression.

import re
pattern = re.compile(r'\d+')
print(re.match('on','one12twothree34four')) # 在起始位置匹配
print(re.match('our','one12twothree34four'))  # 不在起始位置匹配
print(re.match('on','one12twothree34four').span())  # 在起始位置匹配
#Output
#<re.Match object; span=(0, 2), match='on'>
#None
#(0, 2)
match object method describe
group(num=0) The string that matches the entire expression, group() can be fed multiple group numbers at once, in which case it will return a tuple containing the values ​​corresponding to those groups.
groups() 返回一个包含所有小组字符串的元组,从 1 到 所含的小组号。
import re
line = "Cats are smarter than dogs"

matchObj = re.match(r'(.*) are (.*?) .*', line, re.M | re.I)

if matchObj:
    print("matchObj.group() : ", matchObj.group())
    print("matchObj.group(1) : ", matchObj.group(1))
    print("matchObj.group(2) : ", matchObj.group(2))
else:
    print("No match!!")
#Output
# matchObj.group() :  Cats are smarter than dogs
# matchObj.group(1) :  Cats
# matchObj.group(2) :  smarter

3、re.search方法

  • re.search 扫描整个字符串并返回第一个成功的匹配。
  • re.search和re.match方法类似,唯一不同的是re.match从头匹配,re.search可以从字符串中任一位置匹配。如果有匹配对象match返回,可以使用match.group()提取匹配字符串。

语法:

re.search(pattern, string, flags=0)

参数说明:

参数 描述
pattern 匹配的正则表达式
string 要匹配的字符串。
flags 标志位,用于控制正则表达式的匹配方式,如:是否区分大小写,多行匹配等等。
  • 匹配成功re.search方法返回一个匹配的对象,否则返回None。

  • 我们可以使用group(num) 或 groups() 匹配对象函数来获取匹配表达式。

匹配对象方法 描述
group(num=0) 匹配的整个表达式的字符串,group() 可以一次输入多个组号,在这种情况下它将返回一个包含那些组所对应值的元组。
groups() 返回一个包含所有小组字符串的元组,从 1 到 所含的小组号。
import re
print(re.search('www', 'www.runoob.com').span())  # 在起始位置匹配
print(re.search('com', 'www.runoob.com').span())         # 不在起始位置匹配
#Output
#(0, 3)
#(11, 14)


import re
 
line = "Cats are smarter than dogs";
 
searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)
 
if searchObj:
   print "searchObj.group() : ", searchObj.group()
   print "searchObj.group(1) : ", searchObj.group(1)
   print "searchObj.group(2) : ", searchObj.group(2)
else:
   print "Nothing found!!"
#Output
#searchObj.group() :  Cats are smarter than dogs
#searchObj.group(1) :  Cats
#searchObj.group(2) :  smarter

4、re.findall方法

在字符串中找到正则表达式所匹配的所有子串,并返回一个列表,如果有多个匹配模式,则返回元组列表,如果没有找到匹配的,则返回空列表。

这个方法可谓是重量级的方法,当我们试图从一个字符串中提取所有符合正则表达式的字符串列表时需要使用re.findall方法。findall方法使用方法有两种,一种是pattern.findall(string) ,另一种是re.findall(pattern, string)。re.findall方法经常用于从爬虫爬来的文本中提取有用信息。

  • 注意: match 和 search 是匹配一次 findall 匹配所有。

语法:

findall(string[, pos[, endpos]])

参数:

参数 说明
string 待匹配的字符串。
pos 可选参数,指定字符串的起始位置,默认为 0。
endpos 可选参数,指定字符串的结束位置,默认为字符串的长度。

这里举个例子详细说明,查找字符串中的所有数字

import re
 
pattern = re.compile(r'\d+')   # 查找数字
result1 = pattern.findall('runoob 123 google 456')
result2 = pattern.findall('run88oob123google456', 0, 10)
 
print(result1)
print(result2)
#Output
#['123', '456']
#['88', '12']

多个匹配模式,返回元组列表:

import re

result = re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
print(result)
#Output
#[('width', '20'), ('height', '10')]

5、re.finditer方法

和 findall 类似,在字符串中找到正则表达式所匹配的所有子串,并把它们作为一个迭代器返回

语法:

re.finditer(pattern, string, flags=0)
参数 描述
pattern 匹配的正则表达式。
string 要匹配的字符串。
flags 标志位
import re
 
it = re.finditer(r"\d+","12a32bc43jf3") 
for match in it: 
    print (match.group() )
#Output
#12 
#32 
#43 
#3

6、re.split方法

split 方法按照能够匹配的子串将字符串分割后返回列表,它的使用形式如下:

re.split(pattern, string[, maxsplit=0, flags=0])

参数说明:

参数 描述
pattern 匹配的正则表达式
string 要匹配的字符串。
maxsplit 分隔次数,maxsplit=1 分隔一次,默认为 0,不限制次数。
flags 标志位
import re
re.split('\W+', 'runoob, runoob, runoob.')
#['runoob', 'runoob', 'runoob', '']

re.split('(\W+)', ' runoob, runoob, runoob.') 
#['', ' ', 'runoob', ', ', 'runoob', ', ', 'runoob', '.', '']

re.split('\W+', ' runoob, runoob, runoob.', 1) 
#['', 'runoob, runoob, runoob.']
 string1 = "1cat2dogs3cats4"
import re
list1 = re.split(r'\d+', string1)
print(list1)
#Output
#['', 'cat', 'dogs', 'cats', '']

re.split方法并不完美,比如下例中分割后的字符串列表首尾都多了空格,需要手动去除。

检索和替换

Python 的 re 模块提供了re.sub用于替换字符串中的匹配项。

语法:

re.sub(pattern, repl, string, count=0, flags=0)

参数:

  • pattern : 正则中的模式字符串。
  • repl : 替换的字符串,也可为一个函数。
  • string : 要被查找替换的原始字符串。
  • count : 模式匹配后替换的最大次数,默认 0 表示替换所有的匹配。
import re
 
phone = "2004-959-559 # 这是一个国外电话号码"
 
# 删除字符串中的 Python注释 
num = re.sub(r'#.*$', "", phone)
print "电话号码是: ", num
 
# 删除非数字(-)的字符串 
num = re.sub(r'\D', "", phone)
print "电话号码是 : ", num

#Output
#电话号码是:  2004-959-559 
#电话号码是 :  2004959559

repl 参数是一个函数

以下实例中将字符串中的匹配的数字乘以 2:

import re
 
# 将匹配的数字乘以 2
def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)
 
s = 'A23G4HFD567'
print(re.sub('(?P<value>\d+)', double, s))
#Output
#A46G8HFD1134

LeetCode真题鉴赏:

最后我们已一道力扣真题收尾,欣赏一下正则表达式在算法题中的妙用。

原题思路详情可以参考此链接

题面:8. 字符串转换整数 (atoi)

请你来实现一个 myAtoi(string s) 函数,使其能将字符串转换成一个 32 位有符号整数(类似 C/C++ 中的 atoi 函数)。

函数 myAtoi(string s) 的算法如下:

读入字符串并丢弃无用的前导空格
检查下一个字符(假设还未到字符末尾)为正还是负号,读取该字符(如果有)。 确定最终结果是负数还是正数。 如果两者都不存在,则假定结果为正。
读入下一个字符,直到到达下一个非数字字符或到达输入的结尾。字符串的其余部分将被忽略。
将前面步骤读入的这些数字转换为整数(即,“123” -> 123, “0032” -> 32)。如果没有读入数字,则整数为 0 。必要时更改符号(从步骤 2 开始)。
如果整数数超过 32 位有符号整数范围 [−231, 231 − 1] ,需要截断这个整数,使其保持在这个范围内。具体来说,小于 −231 的整数应该被固定为 −231 ,大于 231 − 1 的整数应该被固定为 231 − 1 。
返回整数作为最终结果。
注意:

本题中的空白字符只包括空格字符 ’ ’ 。
除前导空格或数字后的其余字符串外,请勿忽略 任何其他字符。

思路:此题要考虑很多限制条件,如两边的边界,正负号,首次匹配不是数字等情况,我们需要一一判断并筛选。首先定义边界条件,MIN = -2 ** 31、MAX = 2 ** 31 - 1,其次用到了s.lstrip()去掉了左边的空格,如果要去掉两边空格是s.strip()。下来就是一些判断,可以详细参考代码,然后是最重要的部分,对数字部分的识别和选择,用上了本文讨论到的正则表达式,首先设置正则规则,然后在字符串中查找匹配的内容,由于返回的是个列表,将其解包并转换成整数保存在num中,最后在提前设立的边界条件下将num输出。

代码实现(正则表达式):

import re
#str = "  -43423423 -fd 14646 455 "
INT_MAX = 2147483647
INT_MIN = -2147483648
str = str.lstrip()  # 清除左边多余的空格
num_re = re.compile(r'^[\+\-]?\d+')  # 设置正则规则
num = num_re.findall(str)  # 查找匹配的内容
num = int(*num)  # 由于返回的是个列表,解包并且转换成整数
print(max(min(num, INT_MAX), INT_MIN))  # 返回值

还有一种思路是使用ord() 内置函数将字符串的每个字符转化成ASCII码,逐个判断并输出,思路链接在本小节开头,感兴趣的读者可以点击查看,这里不再赘述。

小结

Regular expressions are mainly used to find and match strings, and are suitable for scenarios where multiple data needs to be obtained. It can filter, find and obtain the data we want in a faster way, and is an essential tool in python.

Guess you like

Origin blog.csdn.net/chenjh027/article/details/128120852