Regular expressions are the basic application in NLP. Regular expression is a characteristic sequence that defines a search pattern. It is mainly used for pattern matching of strings or character matching. The re module is a module for manipulating regular expressions.
One, re.match matching
1. The usage of re.match
re.match tries to match a pattern from the beginning of the string. If the match is not successful at the beginning, match() returns none.
a) Function syntax
re.match(pattern, string, flags=0)
# re.match(<正则表达式>,<需要匹配的字符串>)
b) Function parameter description
parameter | description |
---|---|
pattern | Matching regular expression |
string | The string to match. |
flags | The flag bit is used to control the matching mode of regular expressions, such as whether it is case-sensitive, multi-line matching, and so on. See: regular expression modifiers-optional flags |
c) Return the matching object
If the match is successful, the re.match method returns a matched object, otherwise it returns None.
We can use the group(num) or groups() matching object function to get the matching expression.
Match object method | description |
---|---|
group(num=0) | To match the entire expression string, group() can enter multiple group numbers at once, in which case it will return a tuple containing the values corresponding to those groups. |
groups() | Returns a tuple containing all group strings, from 1 to the group number contained. |
2. Examples of re.match
a) Example 1
import re
print(re.match('www', 'www.runoob.com')) # 在起始位置匹配
print(re.match('www', 'www.runoob.com').group()) # 返回匹配到的内容
print(re.match('www', 'www.runoob.com').span()) # 返回匹配到的内容在文本的索引
print(re.match('com', 'www.runoob.com'))
# ---output-----
<_sre.SRE_Match object; span=(0, 3), match='www'>
www
(0, 3)
None
note:
- The returned match object uses the span() method to return the match index.
- If there is no match, it will return None
b) Example 2
import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
if matchObj:
print ("matchObj.group() : ", matchObj.group())
print ("matchObj.group(1) : ", matchObj.group(1))
print ("matchObj.group(2) : ", matchObj.group(2))
print ("matchObj.groups() : ", matchObj.groups())
else:
print ("No match!!")
# ---output------------
matchObj.group() : Cats are smarter than dogs
matchObj.group(1) : Cats
matchObj.group(2) : smarter
matchObj.groups() : ('Cats', 'smarter')
Second, the regular expression pattern
1, match a single character
Match symbol | Match meaning |
---|---|
. | Match any 1 character (except \n, you can use re.S to include \n) |
[ ] | Match the characters listed in [] |
\d | Match numbers, i.e. 0-9 |
\D | Match non-digits, i.e. not digits |
\s | Match white space, i.e. space, tab key |
\S | Match non-blank |
\w | Match non-blank, i.e. az, AZ, 0-9, _, Chinese characters |
\W | Match special characters, i.e. non-letters, non-digits, non-Chinese characters |
note:
- '.' Can match only characters except \n. If you want to match \n, you can add re.S after the regular expression.
- \w can also match multiple languages, so use it with caution.
- \s can be matched to\n
- [] Matching 10 numbers is available [0-9], 26 letters are available [az]
- Matches in [] are matched except for the specified characters: [^abcde]
2. Match multiple characters
Match symbol | Match meaning |
---|---|
* | Match the previous character 0 or unlimited times, it can be dispensable |
+ | Match the previous character 1 time or unlimited times, that is, at least 1 time |
? | Match the previous character 1 or 0 times, that is, either 1 time or no |
{m} | Match the previous character m times |
{m,n} | Match the previous character from m to n times |
Note:
This can reflect the greedy nature of regular expressions. Under the same conditions, *, +,? will be automatically used. , {1,5} matches many characters, to cancel the greedy feature, you can use *?, +?, ??, {}?
3. Match the beginning and end, except for the specified characters
a) match the beginning and end
If there is ^ in the expression, the first character of the matched content should match the first character in the regular expression, otherwise there is no output.
If there is a $ in the expression, it means that the last character of the matched content should match the last character in the regular expression, otherwise there is no output.
Match symbol | Match meaning |
---|---|
^ | Match the beginning of the string |
$ | Match end of string |
b) All match except for the specified characters
[^指定字符]: 表示除了指定字符都匹配
# [^>]*> 表示 只要不是 字符> 就可以匹配多个,直到遇到>
# | 在此处表示 并
re.sub(r'<[^>]*>|\s| ','',strs) # 表示将strs中在匹配到的字符替换成无,并输出替换后的strs
4. Matching group
- The character'|' means or here, and the range of or is limited by brackets ()
- The characters in () are used as groups, and the num in group(num) specifies which group to take out
- \num refers to the characters matched by group num in the regular expression
- (?P) Grouping from aliases (?P=name) Quoting aliases as the string matched by name grouping
Match symbol | Match meaning |
---|---|
| | Match any one of the left and right expressions |
(from) | Use the characters in the brackets as a group |
\on one | Quoting the string matched by group num |
(?P) | Group its alias |
(?P=name) | Quote the string matched by the name group by alias |
Three, re.search matching
The difference with match is: do not match from the beginning, search for matching items in the text, only search once
import re
# 根据正则表达式查找数据,注意:只查找一次
match_obj = re.search("\d+","水果有20个,其中苹果10个.")
if match_obj:
# 获取匹配结果数据
print(match_obj.group())
else:
print("匹配失败")
#---output-----
20
Four, re.findall matching
Basically the same as search, but can be searched multiple times
import re
# 根据正则表达式查找数据,注意:只查找一次
result = re.findall("\d+","水果有20个,其中苹果10个.")
print(result)
# ---output------
['20', '10']
Five, re.sub will replace the matched data
1. Use string to replace
import re
# count=0 替换次数,默认全部替换,count=1根据指定次数替换
result = re.sub("\d+","2","评论数:10,点赞数:20",count=1)
print(result)
# ---output------
评论数:2,点赞数:20
2. Use functions to replace
import re
# match_obj:该参数系统自动传入
def add(match_obj):
# 获取匹配结果的数据
value = match_obj.group()
result = int(value) + 1
# 返回值必须是字符串类型
return str(result)
result = re.sub("\d+",add,"阅读数:10")
print(result)
# ---output-----
阅读数:11
Six, re.split (| means union)
Cut the string according to the match and return a list
import re
ret = re.split(r":| ",'info:xiaozhang 33 shangdong')
print(ret)
# ---output----
['info', 'xiaozhang', '33', 'shangdong']
Seven, greed and non-greed
Add after "*", "?", "+", "{m,n}"? , Making greed become non-greedy.
import re
s = "This is a number 234-235-22-423"
r = re.match(".+(\d+-\d+-\d+-\d+)",s)
print(r.group(1))
#---output------
4-235-22-423
正则表达式模式中使用到通配字,那它在从左到右的顺序求值时,会尽量“抓取”满足匹配最长字符串,在我们上面的例子里面,“.+”会从字符串的启始处抓取满足模式的最长字符,其中包括我们想得到的第一个整型字段的中的大部分,“\d+”只需一位字符就可以匹配,所以它匹配了数字“4”,而“.+”则匹配了从字符串起始到这个第一位数字4之前的所有字符。
import re
s = "This is a number 234-235-22-423"
r = re.match(".+?(\d+-\d+-\d+-\d+)",s)
print(r.group(1))
#---output------
234-235-22-423
解决方式:非贪婪操作符“?”,这个操作符可以用在"*","+","?"的后面,这样“?”前面的正则表达式不能匹配“?”后面正则表达式的数据
八,r的作用
- Python中字符串前面加上 r 表示原生字符串,数据里面的反斜杠不需要进行转义,针对的只是反斜杠。
- Python里的原生字符串很好地解决了这个问题,有了原生字符串,你再也不用担心是不是漏写了反斜杠,写出来的表达式也更直观。
- 建议: 如果使用使用正则表达式匹配数据可以都加上r,要注意r针对的只是反斜杠起作用,不需要对其进行转义
match_obj = re.search('e\\\\/','''i have one nee\/dle''')
match_obj.group()
#---output----
'e\\/'
import re
match_obj = re.match(r"<([a-zA-Z1-9]+)>.*</\1>", "<html>hh</html>")
if match_obj:
print(match_obj.group())
print(match_obj.group(1))
print(match_obj.groups())
else:
print("匹配失败")
# ---output------
<html>hh</html>
html
('html',)