Regular Expression (Regular Expression), also known as regular expression, is a concept of computer science, which is usually used to retrieve and replace text that meets certain rules.
1. Regular expression syntax
Regular expressions are codes that record the rules of text.
1. Row Locator
Line locators are used to describe the boundaries of strings. "^" indicates the beginning of the line, and "$" indicates the end of the line.
Match letters or numbers or underscores or Chinese characters.
\s
Matches any whitespace character.
\d
Match numbers.
\b
Matches the start or end of a word.
^
Matches the beginning of a string.
$
Matches the end of a string.
3. Repeat
"\w*" is used to match any number of letters or digits. If you want to match a specific number of numbers, how do you express it? Regular expressions provide us with qualifiers (specified number of characters) to achieve this functionality.
^\d{
8}$ # 匹配8位QQ号
common qualifier
qualifier
illustrate
example
?
Matches the preceding character 0 or 1 time.
colou?r, the expression can match color and color.
+
Matches the preceding character 1 or more times.
go+gle, the expression can match from gogle to goo...gle.
*
Matches the preceding character 0 or more times.
go*gle, the expression can match from ggle to goo...gle.
{n}
Matches the preceding character at least n times.
go{2}gle, the expression only matches google.
{n,}
Matches the preceding character at least n times.
go{2,}gle, the expression can match from google to goo...gle.
{n,m}
Match the preceding character at least n times and at most m times.
employee{0, 2}, this expression can match three situations: employee, employee, and employee.
4. Character classes
It is very simple for regular expressions to find numbers or letters, because there are already metacharacters for these character sets (such as \d, \w), but if you want to match a character set that has no predefined metacharacters (such as vowels a, e, i, o, u), what should I do?
It's very simple, just list them in square brackets, like [aeiou] matches any English vowel letter, [.?!] matches punctuation marks ".", "?" or "!". You can also easily specify a range of characters, like [0-9] means exactly the same as \d: it represents a single digit; similarly, [a-z0-9A-Z_] is also completely equivalent to \w ( If only English is considered).
If you want to match any Chinese character in the given string, you can use [\u4e00-\u9fa5]; if you want to match multiple consecutive Chinese characters, you can use [\u4e00-\u9fa5]+.
5. Exclude characters
The previous section matches strings that match the specified set of characters. Now in reverse, match strings that do not match the specified set of characters. Regular expressions provide a "^" character. This metacharacter has appeared before, indicating the beginning of the line, and it will be placed in square brackets here, indicating the meaning of exclusion.
[^a-zA-Z]# 匹配一个不是字母的字符
6. Select characters
Just imagine, how to match the ID number? First of all, you need to understand the rules of the ID number. The length of the ID card number is 15 or 18 digits. If it is 15 digits, it is all numbers; if it is 18 digits, the first 17 digits are numbers, and the last digit is a check digit, which may be a number or the character X.
The above description contains the logic of conditional selection, which needs to be realized by using the selection character (|). This character can be understood as "or", and the expression matching the ID card can be written as follows:
The escape character (\) in the regular expression is similar to that in Python, which turns special characters (such as ".", "?", "\", etc.) into ordinary characters. To give an example of an IP address, use regular expressions to match IP addresses in the format of 127.0.0.1. If the dot character is used directly, the format is:
[1-9]{
1,3}.[0-9]{
1,3}.[0-9]{
1,3}.[0-9]{
1,3}
The above is obviously wrong, because "." can match any character. At this time, not only IPs like 127.0.0.1, but also strings like 127101011 will be matched. So when using ".", you need to use the escape character (\). The modified regular expression is as follows:
Parentheses also count as metacharacters in regular expressions.
8. Grouping
Through the sixth example above, we already have a certain understanding of the role of parentheses. The first function of the parentheses is to change the scope of the qualifier , such as "|", "*", "^", etc.
The second function of parentheses is grouping , that is, subexpressions. For example (.[0-9]{1,3}){3} is to repeat the operation on the group ([0-9]{1,3}).
9. Using regular expression syntax in Python
When using regular expressions in Python, they are used as pattern strings . For example, to express a regular expression that matches a character that is not a letter as a pattern string, use the following code:
'[^a-zA-Z]'
If you convert a regular expression that matches words starting with the letter m into a pattern string, you cannot directly add quote delimiters around it. For example, the following code is incorrect:
'\bm\w*\b'
The "\" needs to be escaped, and the converted code is:
'\\bm\\w*\\b'
Since the pattern string may contain a large number of special characters and backslashes, it needs to be written as a native string, that is, add r or R before the pattern string. For example, the above pattern string is expressed as a native string:
r'\bm\w*\b'
When writing pattern strings, not all backslashes need to be converted. For example, the backslash in the regular expression "^\d{8}$" written earlier does not need to be escaped, because the \d has no special meaning. However, for the convenience of writing, the regular expressions written by myself are suggested to be represented by native strings.
2. Use the re module to implement regular expression operations
The grammar of regular expressions has been introduced earlier, and the following will introduce how to use regular expressions in Python. Python provides the re module for implementing regular expression operations. When implementing, you can use the methods provided by the re module (such as search(), match(), findall(), etc.) to process strings, or you can use the compile() method of the re module to convert the pattern string into regular expression object, and then use the related methods of the regular expression object to manipulate the string.
When the re module is used, it needs to be introduced with the import statement:
import re
If it is not introduced when used, it will throw an exception that the module is not defined:
1. Match string
To match strings, you can use methods such as match(), search(), and findall() provided by the re module.
1. Use the match() method to match
The match() method is used to match from the beginning of the string. If the match is successful at the actual position, the Match object is returned, otherwise None is returned. The syntax is as follows:
The string "MR_SHOP" starts with "mr_", so a Match object is returned, while the string "item name MR_SHOP" does not start with "mr_", so it returns None. This is because the match() method starts matching from the beginning of the string . When the first letter does not meet the conditions, it will no longer match and return None directly.
The Match object contains the location of the matching value and the matching data. Among them, to obtain the start position of the matching value , you can use the start() method of the Match object ; to obtain the end position of the matching value , you can use the end() method ; the span() method can return the tuple of the matching position ; through the string attribute You can get the string to match .
import re
pattern =r'mr_\w+'# 模式字符串
string ='MR_SHOP mr_shop'# 要匹配的字符串
match = re.match(pattern, string, re.I)# 匹配字符串,不区分大小写print('匹配值的起始位置:', match.start())print('匹配值的结束位置:', match.end())print('匹配位置的元组:', match.span())print('要匹配的字符串', match.string)print('匹配数据:', match.group())
import re
pattern =r'(13[4-9]\d{8})$|(15[01289]\d{8})$'
mobile ='13634222222'
match = re.match(pattern, mobile)# 进行模式匹配if match isNone:# 判断是否为None,为真表示匹配失败print(mobile,'不是有效的中国移动手机号码。')else:print(mobile,"是有效的中国移动手机号码。")
mobile ='13144222221'
match = re.match(pattern, mobile)# 进行模式匹配if match isNone:# 判断是否为None,为真表示匹配失败print(mobile,"不是有效的中国移动手机号码。")else:print(mobile,"是有效的中国移动手机号码。")
The search() method is used to search for the first matching value in the entire string. If the match is successful at the starting position, it returns a Match object, otherwise it returns None. The syntax format is as follows:
As can be seen from the above running results, the search() method not only searches at the starting position of the string, but also matches other positions.
Example two
Verify that dangerous characters are not present.
import re # re模块
pattern =r'(黑客)|(抓包)|(监听)|(Trojan)'# 模式字符串
about ='我是一名程序员,我喜欢看黑客方面的图书,想研究一下Trojan。'
match = re.search(pattern, about)# 进行模式匹配if match isNone:print(about,'@ 安全!')else:print(about,'@ 出现了危险词汇!')
about ='我是一名程序员,我喜欢看计算机网络方面的图书,喜欢开发网站。'
match = re.match(pattern, about)# 进行模式匹配if match isNone:# 判断是否为None,为真表示匹配失败print(about,'@ 安全!')else:print(about,'@ 出现了危险词汇!')
The findall() method is used to search for all strings that match the regular expression in the entire string and return them in the form of a list. Returns a list containing the matching structure if the match is successful, otherwise returns an empty list.
import re
pattern =r'mr_\w+'# 模式字符串
string ='MR_SHOP mr_shop'# 要匹配的字符串
match = re.findall(pattern, string, re.I)# 搜索字符串,不区分大小写print(match)
string ='项目名称 MR_SHOP mr_shop'
match = re.findall(pattern, string)# 搜索字符串,区分大小写print(match)# 输出匹配结果
operation result:
['MR_SHOP','mr_shop']['mr_shop']
Example two
Returns a list of text that matches the group if the group is contained in the specified pattern string.
import re
pattern =r'[1-9]{1,3}(\.[0-9]{1,3}){3}'# 模式字符串
str1 ='127.0.0.1 192.168.1.66'# 要匹配的字符串
match = re.findall(pattern, str1)# 进行模式匹配print(match)
The result of the execution is as follows:
['.1','.66']
As can be seen from the above results, no matching IP address is obtained, because there are groups in the pattern string, so the result obtained is the result of matching according to the group, that is, "(.[0-9]{ 1,3})" matching results. If you want to get matches for the entire pattern string, you can group the entire pattern string with a pair of parentheses. Then when getting the result, only the first element of each element (a tuple) of the return value list is taken.
import re
pattern =r'([1-9]{1,3}(\.[0-9]{1,3}){3})'# 模式字符串
str1 ='127.0.0.1 192.168.1.66'# 要匹配的字符串
match = re.findall(pattern, str1)# 进行模式匹配for item in match:print(item[0])
The result of the operation is as follows:
127.0.0.1192.168.1.66
2. Replace string - sub() method
The sub() method is used to implement string replacement. The syntax format is as follows:
Hide the mobile phone number in the winning information:
import re
pattern =r'1[34578]\d{9}'# 定义要替换的模式字符串
string ='中奖号码为:84978981 电话为:13611111111'
result = re.sub(pattern,'1XXXXXXXXXX', string)# 替换字符串print(result)
The result of the operation is as follows:
中奖号码为:84978981 电话为:1XXXXXXXXXX
Example two
Replace occurrences of dangerous strings:
import re
pattern =r'(黑客)|(抓包)|(监听)|(Trojan)'# 模式字符串
about ="我是一名程序员,我喜欢看黑客方面的图书,想研究一下Trojan。"
sub = re.sub(pattern,'@_@', about)# 进行模式替换print(sub)
about ='我是一名程序员,我喜欢看计算机网络方面的图书,喜欢开发网站。'
sub = re.sub(pattern,'@_@', about)# 进行模式替换print(sub)
3. Use regular expressions to split strings - split() method
The split() method is used to split a string according to a regular expression and return it in the form of a list. Its function is similar to the split() method of the string object, the difference is that the split string is specified by the pattern string. The syntax format is as follows:
Scenario simulation: In the @Friends section of Weibo, enter "@未明技术@ Zuckerberg@门耶" (use a space to separate the names of friends), and you can @ three friends at the same time.
Example two
Output the names of friends who are @.
import re
str1 ='@明日科技 @扎克伯格 @盖茨'
pattern =r'\s*@'
list1 = re.split(pattern, str1)# 用空格和@或单独的@分割字符串print('您@的好友有:')for item in list1:if item !='':# 输出不为空的元素print(item)# 输出每个好友名