Detailed explanation of python's regular expression method

Regular expressions are the basic application in NLP. Regular expression is a characteristic sequence that defines a search pattern. It is mainly used for pattern matching of strings or character matching. The re module is a module for manipulating regular expressions.

One, re.match matching

1. The usage of re.match

re.match tries to match a pattern from the beginning of the string. If the match is not successful at the beginning, match() returns none.

a) Function syntax

re.match(pattern, string, flags=0)
# re.match(<正则表达式>,<需要匹配的字符串>)

b) Function parameter description

parameter	description
pattern	Matching regular expression
string	The string to match.
flags	The flag bit is used to control the matching mode of regular expressions, such as whether it is case-sensitive, multi-line matching, and so on. See: regular expression modifiers-optional flags

c) Return the matching object

If the match is successful, the re.match method returns a matched object, otherwise it returns None.
We can use the group(num) or groups() matching object function to get the matching expression.

Match object method	description
group(num=0)	To match the entire expression string, group() can enter multiple group numbers at once, in which case it will return a tuple containing the values corresponding to those groups.
groups()	Returns a tuple containing all group strings, from 1 to the group number contained.

2. Examples of re.match

a) Example 1

import re
print(re.match('www', 'www.runoob.com'))  # 在起始位置匹配
print(re.match('www', 'www.runoob.com').group())  # 返回匹配到的内容
print(re.match('www', 'www.runoob.com').span())  # 返回匹配到的内容在文本的索引
print(re.match('com', 'www.runoob.com')) 

# ---output-----
<_sre.SRE_Match object; span=(0, 3), match='www'>
www
(0, 3)
None

note:

The returned match object uses the span() method to return the match index.
If there is no match, it will return None

b) Example 2

import re

line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
   print ("matchObj.groups() : ", matchObj.groups())
else:
   print ("No match!!")
   
# ---output------------
matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter
matchObj.groups() :  ('Cats', 'smarter')

Second, the regular expression pattern

1, match a single character

Match symbol	Match meaning
.	Match any 1 character (except \n, you can use re.S to include \n)
[ ]	Match the characters listed in []
\d	Match numbers, i.e. 0-9
\D	Match non-digits, i.e. not digits
\s	Match white space, i.e. space, tab key
\S	Match non-blank
\w	Match non-blank, i.e. az, AZ, 0-9, _, Chinese characters
\W	Match special characters, i.e. non-letters, non-digits, non-Chinese characters

note:

'.' Can match only characters except \n. If you want to match \n, you can add re.S after the regular expression.
\w can also match multiple languages, so use it with caution.
\s can be matched to\n
[] Matching 10 numbers is available [0-9], 26 letters are available [az]
Matches in [] are matched except for the specified characters: [^abcde]

2. Match multiple characters

Match symbol	Match meaning
*	Match the previous character 0 or unlimited times, it can be dispensable
+	Match the previous character 1 time or unlimited times, that is, at least 1 time
?	Match the previous character 1 or 0 times, that is, either 1 time or no
{m}	Match the previous character m times
{m,n}	Match the previous character from m to n times

Note:
This can reflect the greedy nature of regular expressions. Under the same conditions, *, +,? will be automatically used. , {1,5} matches many characters, to cancel the greedy feature, you can use *?, +?, ??, {}?

3. Match the beginning and end, except for the specified characters

a) match the beginning and end

If there is ^ in the expression, the first character of the matched content should match the first character in the regular expression, otherwise there is no output.
If there is a $ in the expression, it means that the last character of the matched content should match the last character in the regular expression, otherwise there is no output.

Match symbol	Match meaning
^	Match the beginning of the string
$	Match end of string

b) All match except for the specified characters

[^指定字符]: 表示除了指定字符都匹配
# [^>]*> 表示 只要不是 字符> 就可以匹配多个，直到遇到>
# | 在此处表示 并
re.sub(r'<[^>]*>|\s|&nbsp;','',strs)       # 表示将strs中在匹配到的字符替换成无，并输出替换后的strs

4. Matching group

The character'|' means or here, and the range of or is limited by brackets ()
The characters in () are used as groups, and the num in group(num) specifies which group to take out
\num refers to the characters matched by group num in the regular expression
(?P) Grouping from aliases (?P=name) Quoting aliases as the string matched by name grouping

Match symbol	Match meaning
\|	Match any one of the left and right expressions
(from)	Use the characters in the brackets as a group
\on one	Quoting the string matched by group num
(?P)	Group its alias
(?P=name)	Quote the string matched by the name group by alias

Three, re.search matching

The difference with match is: do not match from the beginning, search for matching items in the text, only search once

import re

# 根据正则表达式查找数据,注意:只查找一次
match_obj = re.search("\d+","水果有20个,其中苹果10个.")
if match_obj:
    # 获取匹配结果数据
    print(match_obj.group())
else: 
    print("匹配失败")

#---output-----
20

Four, re.findall matching

Basically the same as search, but can be searched multiple times

import re

# 根据正则表达式查找数据,注意:只查找一次
result = re.findall("\d+","水果有20个,其中苹果10个.")
print(result)

# ---output------
['20', '10']

Five, re.sub will replace the matched data

1. Use string to replace

import re 

# count=0 替换次数,默认全部替换,count=1根据指定次数替换
result = re.sub("\d+","2","评论数:10,点赞数:20",count=1)
print(result)

# ---output------
评论数:2,点赞数:20

2. Use functions to replace

import re 

# match_obj:该参数系统自动传入
def add(match_obj):
    # 获取匹配结果的数据
    value = match_obj.group()
    result = int(value) + 1 
    # 返回值必须是字符串类型
    return str(result)

result = re.sub("\d+",add,"阅读数:10")
print(result)

# ---output-----
阅读数:11

Six, re.split (| means union)

Cut the string according to the match and return a list

import re 

ret = re.split(r":| ",'info:xiaozhang 33 shangdong')
print(ret)

# ---output----
['info', 'xiaozhang', '33', 'shangdong']

Seven, greed and non-greed

Add after "*", "?", "+", "{m,n}"? , Making greed become non-greedy.

import re 

s = "This is a number 234-235-22-423"
r = re.match(".+(\d+-\d+-\d+-\d+)",s)
print(r.group(1))

#---output------
4-235-22-423

正则表达式模式中使用到通配字，那它在从左到右的顺序求值时，会尽量“抓取”满足匹配最长字符串，在我们上面的例子里面，“.+”会从字符串的启始处抓取满足模式的最长字符，其中包括我们想得到的第一个整型字段的中的大部分，“\d+”只需一位字符就可以匹配，所以它匹配了数字“4”，而“.+”则匹配了从字符串起始到这个第一位数字4之前的所有字符。

import re 

s = "This is a number 234-235-22-423"
r = re.match(".+?(\d+-\d+-\d+-\d+)",s)
print(r.group(1))

#---output------
234-235-22-423

解决方式：非贪婪操作符“？”，这个操作符可以用在"*","+","?"的后面，这样“？”前面的正则表达式不能匹配“？”后面正则表达式的数据

八，r的作用

Python中字符串前面加上 r 表示原生字符串，数据里面的反斜杠不需要进行转义，针对的只是反斜杠。
Python里的原生字符串很好地解决了这个问题，有了原生字符串，你再也不用担心是不是漏写了反斜杠，写出来的表达式也更直观。
建议: 如果使用使用正则表达式匹配数据可以都加上r，要注意r针对的只是反斜杠起作用，不需要对其进行转义

match_obj = re.search('e\\\\/','''i have one nee\/dle''') 
match_obj.group()

#---output----
'e\\/'

import re 

match_obj = re.match(r"<([a-zA-Z1-9]+)>.*</\1>", "<html>hh</html>")
if match_obj:
    print(match_obj.group())
    print(match_obj.group(1))
    print(match_obj.groups())
else:
    print("匹配失败")

# ---output------
<html>hh</html>
html
('html',)