Python uses regular expressions

  • Regular Expression (Regular Expression), also known as regular expression, is a concept of computer science, which is usually used to retrieve and replace text that meets certain rules.

1. Regular expression syntax

  • Regular expressions are codes that record the rules of text.

1. Row Locator

  • Line locators are used to describe the boundaries of strings. "^" indicates the beginning of the line, and "$" indicates the end of the line.
^tm     # 表示匹配行头为“tm”的字符串。
# tm equal Tomorrow Moon匹配,Tomorrow Moon equal tm不匹配。
tm$     # 匹配以“tm”为行尾的字符串。
# tm equal Tomorrow Moon不匹配,Tomorrow Moon equal tm匹配。
tm     # 匹配任意位置出现了“tm”的字符串
# tm equal Tomorrow Moon匹配,Tomorrow Moon equal tm匹配。

2. Metacharacters

\bmr\w*\b     # 匹配以字母mr开头的单词,先是从某个单词开始处(\b),然后匹配字母mr,接着是任意数量的字母或数字(\w*),最后是单词结束处(\b)。
# 该表达式可以匹配“mrsoft”、“mrbook”、“mr123456”等。

Common Metacharacters

the code illustrate
. Matches any character except newline.
\w Match letters or numbers or underscores or Chinese characters.
\s Matches any whitespace character.
\d Match numbers.
\b Matches the start or end of a word.
^ Matches the beginning of a string.
$ Matches the end of a string.

3. Repeat

  • "\w*" is used to match any number of letters or digits. If you want to match a specific number of numbers, how do you express it? Regular expressions provide us with qualifiers (specified number of characters) to achieve this functionality.
^\d{
    
    8}$       # 匹配8位QQ号

common qualifier

qualifier illustrate example
? Matches the preceding character 0 or 1 time. colou?r, the expression can match color and color.
+ Matches the preceding character 1 or more times. go+gle, the expression can match from gogle to goo...gle.
* Matches the preceding character 0 or more times. go*gle, the expression can match from ggle to goo...gle.
{n} Matches the preceding character at least n times. go{2}gle, the expression only matches google.
{n,} Matches the preceding character at least n times. go{2,}gle, the expression can match from google to goo...gle.
{n,m} Match the preceding character at least n times and at most m times. employee{0, 2}, this expression can match three situations: employee, employee, and employee.

4. Character classes

  • It is very simple for regular expressions to find numbers or letters, because there are already metacharacters for these character sets (such as \d, \w), but if you want to match a character set that has no predefined metacharacters (such as vowels a, e, i, o, u), what should I do?
  • It's very simple, just list them in square brackets, like [aeiou] matches any English vowel letter, [.?!] matches punctuation marks ".", "?" or "!". You can also easily specify a range of characters, like [0-9] means exactly the same as \d: it represents a single digit; similarly, [a-z0-9A-Z_] is also completely equivalent to \w ( If only English is considered).
  • If you want to match any Chinese character in the given string, you can use [\u4e00-\u9fa5]; if you want to match multiple consecutive Chinese characters, you can use [\u4e00-\u9fa5]+.

5. Exclude characters

  • The previous section matches strings that match the specified set of characters. Now in reverse, match strings that do not match the specified set of characters. Regular expressions provide a "^" character. This metacharacter has appeared before, indicating the beginning of the line, and it will be placed in square brackets here, indicating the meaning of exclusion.
[^a-zA-Z]    # 匹配一个不是字母的字符

6. Select characters

  • Just imagine, how to match the ID number? First of all, you need to understand the rules of the ID number. The length of the ID card number is 15 or 18 digits. If it is 15 digits, it is all numbers; if it is 18 digits, the first 17 digits are numbers, and the last digit is a check digit, which may be a number or the character X.
  • The above description contains the logic of conditional selection, which needs to be realized by using the selection character (|). This character can be understood as "or", and the expression matching the ID card can be written as follows:
(^\d{
    
    15}$)|(^\d{
    
    18}$)|(^\d{
    
    17})(\d|X|x)$  # 该表达式的意思是可匹配15位数字,或者18位数字,或者17位数字和最后一位,最后一位可以是数字或者X或者x。

7. Escape characters

  • The escape character (\) in the regular expression is similar to that in Python, which turns special characters (such as ".", "?", "\", etc.) into ordinary characters. To give an example of an IP address, use regular expressions to match IP addresses in the format of 127.0.0.1. If the dot character is used directly, the format is:
[1-9]{
    
    1,3}.[0-9]{
    
    1,3}.[0-9]{
    
    1,3}.[0-9]{
    
    1,3}
  • The above is obviously wrong, because "." can match any character. At this time, not only IPs like 127.0.0.1, but also strings like 127101011 will be matched. So when using ".", you need to use the escape character (\). The modified regular expression is as follows:
[1-9]{
    
    1,3}\.[0-9]{
    
    1,3}\.[0-9]{
    
    1,3}\.[0-9]{
    
    1,3}
  • Parentheses also count as metacharacters in regular expressions.

8. Grouping

  • Through the sixth example above, we already have a certain understanding of the role of parentheses. The first function of the parentheses is to change the scope of the qualifier , such as "|", "*", "^", etc.
(thir|four)th   # 该表达式的含义是匹配单词thirth或fourth,如果不使用小括号,那么就变成了匹配单词thir和fourth了。
  • The second function of parentheses is grouping , that is, subexpressions. For example (.[0-9]{1,3}){3} is to repeat the operation on the group ([0-9]{1,3}).

9. Using regular expression syntax in Python

  • When using regular expressions in Python, they are used as pattern strings . For example, to express a regular expression that matches a character that is not a letter as a pattern string, use the following code:
'[^a-zA-Z]'
  • If you convert a regular expression that matches words starting with the letter m into a pattern string, you cannot directly add quote delimiters around it. For example, the following code is incorrect:
'\bm\w*\b'
  • The "\" needs to be escaped, and the converted code is:
'\\bm\\w*\\b'
  • Since the pattern string may contain a large number of special characters and backslashes, it needs to be written as a native string, that is, add r or R before the pattern string. For example, the above pattern string is expressed as a native string:
r'\bm\w*\b'
  • When writing pattern strings, not all backslashes need to be converted. For example, the backslash in the regular expression "^\d{8}$" written earlier does not need to be escaped, because the \d has no special meaning. However, for the convenience of writing, the regular expressions written by myself are suggested to be represented by native strings.

2. Use the re module to implement regular expression operations

  • The grammar of regular expressions has been introduced earlier, and the following will introduce how to use regular expressions in Python. Python provides the re module for implementing regular expression operations. When implementing, you can use the methods provided by the re module (such as search(), match(), findall(), etc.) to process strings, or you can use the compile() method of the re module to convert the pattern string into regular expression object, and then use the related methods of the regular expression object to manipulate the string.
  • When the re module is used, it needs to be introduced with the import statement:
import re
  • If it is not introduced when used, it will throw an exception that the module is not defined:
    insert image description here

1. Match string

  • To match strings, you can use methods such as match(), search(), and findall() provided by the re module.

1. Use the match() method to match

  • The match() method is used to match from the beginning of the string. If the match is successful at the actual position, the Match object is returned, otherwise None is returned. The syntax is as follows:
re.match(pattern,string,[flags])
# pattern:表示模式字符串,由要匹配的正则表达式转换而来。
# string:表示要匹配的字符串。
# flags:可选参数,表示标志位,用于控制匹配方式,如是否区分字母大小写。常用的标志如下表所示。
the sign illustrate
A or ASCII Do ASCII-only matching for \w, \W, \b, \B, \d, \D, \s, and \S (Python 3.x only).
I or IGNORECASE Performs a case-insensitive match on letters.
M or MULTILINE Use ^ and $ for each line including the beginning and end of the entire string (by default, only at the beginning and end of the entire string).
S or DOTALL Use a "." string to match all characters, including newlines.
X or VERBOSE Unescaped whitespace and comments in the pattern string are ignored.

example one

  • For example, if the matching string starts with "mr_", it is not case-sensitive:
import re

pattern = r'mr_\w+'     # 模式字符串
string = 'MR_SHOP mr_shop'     # 要匹配的字符串
match = re.match(pattern, string, re.I)    # 匹配字符串,不区分大小写
print(match)
string = "项目名称 MR_SHOP mr_shop"
match = re.match(pattern, string, re.I)   # 匹配字符串,不区分大小写
print(match)    # 输出匹配结果
  • The execution results are as follows:
<re.Match object; span=(0, 7), match='MR_SHOP'>
None
  • The string "MR_SHOP" starts with "mr_", so a Match object is returned, while the string "item name MR_SHOP" does not start with "mr_", so it returns None. This is because the match() method starts matching from the beginning of the string . When the first letter does not meet the conditions, it will no longer match and return None directly.
  • The Match object contains the location of the matching value and the matching data. Among them, to obtain the start position of the matching value , you can use the start() method of the Match object ; to obtain the end position of the matching value , you can use the end() method ; the span() method can return the tuple of the matching position ; through the string attribute You can get the string to match .
import re

pattern = r'mr_\w+'     # 模式字符串
string = 'MR_SHOP mr_shop'    # 要匹配的字符串
match = re.match(pattern, string, re.I)    # 匹配字符串,不区分大小写
print('匹配值的起始位置:', match.start())
print('匹配值的结束位置:', match.end())
print('匹配位置的元组:', match.span())
print('要匹配的字符串', match.string)
print('匹配数据:', match.group())
  • operation result:
匹配值的起始位置: 0
匹配值的结束位置: 7
匹配位置的元组: (0, 7)
要匹配的字符串 MR_SHOP mr_shop
匹配数据: MR_SHOP

Example two

  • Verify that the entered phone number is legal.
import re

pattern = r'(13[4-9]\d{8})$|(15[01289]\d{8})$'
mobile = '13634222222'
match = re.match(pattern, mobile)     # 进行模式匹配
if match is None:     # 判断是否为None,为真表示匹配失败
    print(mobile, '不是有效的中国移动手机号码。')
else:
    print(mobile, "是有效的中国移动手机号码。")
mobile = '13144222221'
match = re.match(pattern, mobile)    # 进行模式匹配
if match is None:     # 判断是否为None,为真表示匹配失败
    print(mobile, "不是有效的中国移动手机号码。")
else:
    print(mobile, "是有效的中国移动手机号码。")
  • operation result:
13634222222 是有效的中国移动手机号码。
13144222221 不是有效的中国移动手机号码。

2. Use the search() method to match

  • The search() method is used to search for the first matching value in the entire string. If the match is successful at the starting position, it returns a Match object, otherwise it returns None. The syntax format is as follows:
re.search(pattern, string, [flags])
# pattern:表示模式字符串,由要匹配的正则表达式转换而来。
# string:表示要匹配的字符串。
# flags:可选参数,表示标志位,用于控制匹配方式,如是否区分字母大小写。

example one

  • Searches for the first string starting with "mr_", not case sensitive.
import re

pattern = r'mr_\w+'     # 模式字符串
string = 'MR_SHOP mr_shop'     # 要匹配的字符串
match = re.search(pattern, string, re.I)    # 搜索字符串,不区分大小写
print(match)    # 输出匹配结果
string = '项目名称 MR_SHOP mr_shop'
match = re.search(pattern, string, re.I)    # 搜索字符串,不区分大小写
print(match)   # 输出匹配结果
  • The result of the operation is as follows:
<re.Match object; span=(0, 7), match='MR_SHOP'>
<re.Match object; span=(5, 12), match='MR_SHOP'>
  • As can be seen from the above running results, the search() method not only searches at the starting position of the string, but also matches other positions.

Example two

  • Verify that dangerous characters are not present.
import re     # re模块

pattern = r'(黑客)|(抓包)|(监听)|(Trojan)'     # 模式字符串
about = '我是一名程序员,我喜欢看黑客方面的图书,想研究一下Trojan。'
match = re.search(pattern, about)   # 进行模式匹配
if match is None:
    print(about, '@ 安全!')
else:
    print(about, '@ 出现了危险词汇!')
about = '我是一名程序员,我喜欢看计算机网络方面的图书,喜欢开发网站。'
match = re.match(pattern, about)    # 进行模式匹配
if match is None:     # 判断是否为None,为真表示匹配失败
    print(about, '@ 安全!')
else:
    print(about, '@ 出现了危险词汇!')
  • The result of the operation is as follows:
我是一名程序员,我喜欢看黑客方面的图书,想研究一下Trojan。 @ 出现了危险词汇!
我是一名程序员,我喜欢看计算机网络方面的图书,喜欢开发网站。 @ 安全!

3. Use the findall() method to match

  • The findall() method is used to search for all strings that match the regular expression in the entire string and return them in the form of a list. Returns a list containing the matching structure if the match is successful, otherwise returns an empty list.
  • The syntax is as follows:
re.findall(pattern, string, [flags])
# pattern:表示模式字符串,由要匹配的正则表达式转换而来。
# string:表示要匹配的字符串。
# flags:可选参数,表示标志位,用于控制匹配方式,如是否区分字母大小写。

example one

  • Search for strings beginning with "mr_":
import re

pattern = r'mr_\w+'     # 模式字符串
string = 'MR_SHOP mr_shop'    # 要匹配的字符串
match = re.findall(pattern, string, re.I)    # 搜索字符串,不区分大小写
print(match)
string = '项目名称 MR_SHOP mr_shop'
match = re.findall(pattern, string)     # 搜索字符串,区分大小写
print(match)    # 输出匹配结果
  • operation result:
['MR_SHOP', 'mr_shop']
['mr_shop']

Example two

  • Returns a list of text that matches the group if the group is contained in the specified pattern string.
import re
pattern = r'[1-9]{1,3}(\.[0-9]{1,3}){3}'   # 模式字符串
str1 = '127.0.0.1 192.168.1.66'    # 要匹配的字符串
match = re.findall(pattern, str1)    # 进行模式匹配
print(match)
  • The result of the execution is as follows:
['.1', '.66']
  • As can be seen from the above results, no matching IP address is obtained, because there are groups in the pattern string, so the result obtained is the result of matching according to the group, that is, "(.[0-9]{ 1,3})" matching results. If you want to get matches for the entire pattern string, you can group the entire pattern string with a pair of parentheses. Then when getting the result, only the first element of each element (a tuple) of the return value list is taken.
import re
pattern = r'([1-9]{1,3}(\.[0-9]{1,3}){3})'    # 模式字符串
str1 = '127.0.0.1 192.168.1.66'    # 要匹配的字符串
match = re.findall(pattern, str1)     # 进行模式匹配
for item in match:
    print(item[0])
  • The result of the operation is as follows:
127.0.0.1
192.168.1.66

2. Replace string - sub() method

  • The sub() method is used to implement string replacement. The syntax format is as follows:
re.sub(pattern,  repl, string, count, flags)
# pattern:表示模式字符串,由要匹配的正则表达式转换而来。
# repl:表示替换的字符串。
# string:表示要被查找替换的原始字符串。
# count:可选参数,表示模式匹配后替换的最大次数,默认值为0,表示替换所有的匹配。
# flags:可选参数,表示标志位,用于控制匹配方式,如是否区分字母大小写。

example one

  • Hide the mobile phone number in the winning information:
import re
pattern = r'1[34578]\d{9}'     # 定义要替换的模式字符串
string = '中奖号码为:84978981 电话为:13611111111'
result = re.sub(pattern, '1XXXXXXXXXX', string)    # 替换字符串
print(result)
  • The result of the operation is as follows:
中奖号码为:84978981 电话为:1XXXXXXXXXX

Example two

  • Replace occurrences of dangerous strings:
import re
pattern = r'(黑客)|(抓包)|(监听)|(Trojan)'    # 模式字符串
about = "我是一名程序员,我喜欢看黑客方面的图书,想研究一下Trojan。"
sub = re.sub(pattern, '@_@', about)    # 进行模式替换
print(sub)
about = '我是一名程序员,我喜欢看计算机网络方面的图书,喜欢开发网站。'
sub = re.sub(pattern, '@_@', about)    # 进行模式替换
print(sub)
  • The result of the operation is as follows:
我是一名程序员,我喜欢看@_@方面的图书,想研究一下@_@。
我是一名程序员,我喜欢看计算机网络方面的图书,喜欢开发网站。

3. Use regular expressions to split strings - split() method

  • The split() method is used to split a string according to a regular expression and return it in the form of a list. Its function is similar to the split() method of the string object, the difference is that the split string is specified by the pattern string. The syntax format is as follows:
r.split(pattern, string, [maxsplit], [flags])
# pattern:表示模式字符串,由要匹配的正则表达式转换而来。
# string:表示要匹配的字符串。
# maxsplit:可选参数,表示最大的拆分次数。
# flags:可选参数,表示标志位,用于控制匹配方式,如是否区分大小写字母。

example one

  • Extract the request address and various parameters from the given URL address:
import re

pattern = r'[?|&]'    # 定义分隔符
url = 'http://www.mingrisoft.com/login.jsp?username="mr"&pwd="mrsoft"'
result = re.split(pattern, url)    # 分割字符串
print(result)
  • The result of the operation is as follows:
['http://www.mingrisoft.com/login.jsp', 'username="mr"', 'pwd="mrsoft"']
  • Scenario simulation: In the @Friends section of Weibo, enter "@未明技术@ Zuckerberg@门耶" (use a space to separate the names of friends), and you can @ three friends at the same time.

Example two

  • Output the names of friends who are @.
import re

str1 = '@明日科技 @扎克伯格 @盖茨'
pattern = r'\s*@'
list1 = re.split(pattern, str1)    # 用空格和@或单独的@分割字符串
print('您@的好友有:')
for item in list1:
    if item != '':     # 输出不为空的元素
        print(item)    # 输出每个好友名
  • operation result:
您@的好友有:
明日科技
扎克伯格
盖茨

Guess you like

Origin blog.csdn.net/ungoing/article/details/130780443