Python: How to quickly process strings using regular expressions


foreword

1、正则表达式是对字符串操作的一种逻辑公式,就是用事先定义好的一些特定字符、及这些特定字符的组合,组成一个“规则字符串”,这个“规则字符串”用来表达对字符串的一种过滤逻辑。

2、Python中我们经常用match、search、findall函数搭配使用快速获取关键字符串。


1. Regular expressions

1. Detailed explanation of metacharacters

pattern:匹配的关键字

metacharacter describe
\ Puts the next character token, or a backreference, or an octal escape.
^ Matches the beginning of the input line.
$ Match the end of input line.
* Matches 0 or more expressions. Example: The expression zo* can match "z", as well as "zo" and "zoooo...". Because o can be 0 or more.
+ Match 1 or more expressions. Example: The expression zo+ can match "zo" and "zoooo...", but not "z". Because o has at least 1.
? Matches 0 or 1 expression. Example: the expression zo? Can match "z" and "zo", because o can exist 0 or 1.
{n} Match character n times. Example: The expression o{2} can match two o's in "food".
{n,} Match a character at least n times. The expression o{2,} can match all o in "foooood", but cannot match "o" in "fod", because there is only one o in fod, and o{2,} requires more than two o.
{n,m} Match characters at least n times and at most m times. For example, the expression o{1,3} can match "fod", "food" and "foood", o exists 1 or 3 times.
? When this character immediately follows any of the other qualifiers (*,+, {n}, {n,}, {n,m}), the matching pattern is non-greedy. The non-greedy mode matches the searched string as little as possible, while the default greedy mode matches the searched string as much as possible. For example, for the string "oooo", "o+" will match as many "o" as possible, resulting in ["oooo"], while "o+?" will match as few "o" as possible, resulting in ['o', 'o', 'o', 'o']
. Matches any single character except "\n" and "\r". To match any character including "\n" and "\r", use a pattern like "[\s\S]".
(pattern) Matches pattern and retrieves this match.
(?:pattern) Non-acquisition match, matches the pattern but does not obtain the matching result, and does not store it for later use.
(?=pattern) Forward lookahead assertion, there is content of pattren behind the match, for example: the expression \S(?=you), can match "love" in "I love you", because "love" is followed by "you", so it is successfully matched . (\S means match any visible character)
(?!pattern) Negative lookahead assertion, matches the content without pattren after it, for example: the expression \S(?! You), cannot match the "love" in "I love you", what I need to match is the content without "you" after the character . (\S means match any visible character)
(?<=pattern) Forward backward assertion, matches the content with pattren in front , for example: the expression (?<=I)\S, can match the "love" in "I love you", because "love" has "I" in front, so it is successfully matched. (\S means match any visible character)
(?<!pattern) Negative lookbehind assertion, matches the content without pattren in front, for example: the expression \S(?!I), cannot match the "love" in "I love you", what I need to match is the content without "I" after the character . (\S means match any visible character)
x|y Match x or y. For example, the expression "z|food" matches either "z" or "food" (be careful here). "[z|f]ood" matches "zood" or "food".
[xyz] collection of characters. Matches any one of the contained characters. For example, "[abc]" would match the "a" in "plain".
[^xyz] Negative character set. Matches any character not contained. For example, "[^abc]" can match any character in "plin" in "plain".
[a-z] range of characters. Matches any character in the specified range. For example, "[az]" matches any lowercase alphabetic character in the range "a" through "z".
[^a-z] A negative character range. Matches any arbitrary character not in the specified range. For example, "[^az]" matches any character not in the range "a" through "z".
\b Match the boundary of a word, that is, the position between the word and the space (that is, there are two concepts of "matching" in regular expressions, one is the matching character, and the other is the matching position, where \b is the matching position). For example, "er\b" can match "er" in "never", but not "er" in "verb"; "\b1_" can match "1_" in "1_23", but not "1_" in "21_3".
\B Matches non-word boundaries. "er\B" matches "er" in "verb", but not "er" in "never".
\d Matches a numeric character. Equivalent to [0-9].
\D Matches a non-numeric character. Equivalent to [^0-9].
\f Matches a form feed character.
\n Matches a newline character.
\r Matches a carriage return.
\s Matches any invisible character, including spaces, tabs, form feeds, etc.
\S Matches any visible character.
\t Matches a tab character.
\v Matches a vertical tab character.
\w Matches any word character including an underscore. Similar to but not equivalent to "[A-Za-z0-9]", where "word" characters use the Unicode character set.
\W Matches any non-word character. Equivalent to "[^A-Za-z0-9_]".
\num Matches num, where num is a positive integer. A reference to the hit that was fetched. For example, "(.)\1" matches two consecutive identical characters.
\n Identifies an octal escape value or a backreference. If \n is preceded by at least n fetched subexpressions, then n is a backreference. Otherwise, if n is an octal digit (0-7), then n is an octal escape value.
< > Matches the start (<) and end (>) of a word. For example, the regular expression <the> can match "the" in the string "for the wise", but not "the" in the string "otherwise". Note: This metacharacter is not supported by all software.
( ) 将( 和 ) 之间的表达式定义为“组”(group),并且将匹配这个表达式的字符保存到一个临时区域(一个正则表达式中最多可以保存9个),它们可以用 \1 到\9 的符号来引用。
| 将两个匹配条件进行逻辑“或”(or)运算。

2、等价方法(速记)

一、等价:
等价是等同于的意思,表示同样的功能,用不同符号来书写。
?,*,+,\d,\w 都是等价字符
?等价于匹配长度{
    
    0,1}
*等价于匹配长度{
    
    0,}
+等价于匹配长度{
    
    1,}
\d等价于[0-9]
\D等价于[^0-9]
\w等价于[A-Za-z_0-9]
\W等价于[^A-Za-z_0-9]。

二、常用运算符与表达式:
^ 开始
()域段
[] 包含,默认是一个字符长度
[^] 不包含,默认是一个字符长度
{
    
    n,m} 匹配长度
. 任何单个字符(\. 字符点)
| 或
\ 转义
$ 结尾
[A-Z] 26个大写字母
[a-z] 26个小写字母
[0-9] 09数字
[A-Za-z0-9] 26个大写字母、26个小写字母和09数字

二、常用的表达式

1、常用的正则表达式

常见的正则表达式

1.验证用户名和密码:("[a-zA-Z]\w{5,15}")正确格式:"[A-Z][a-z]_[0-9]"组成,并且第一个字必须为字母6~16位;
2.验证电话号码:("(\d{3,4}-)\d{7,8}")正确格式:xxx/xxxx-xxxxxxx/xxxxxxxx;
3.验证手机号码(包含虚拟号码和新号码段):"1([38][0-9]|4[5-9]|5[0-3,5-9]|66|7[0-8]|9[89])[0-9]{8}"4.验证身份证号(15位):"\d{14}[[0-9],0-9xX]",(18位):"\d{17}(\d|X|x)"5.验证Email地址:("\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*")6.只能输入由数字和26个英文字母组成的字符串:("[A-Za-z0-9]+")7.整数或者小数:[0-9]+([.][0-9]+){
    
    0,1}
8.只能输入数字:"[0-9]*"9.只能输入n位的数字:"\d{n}"10.只能输入至少n位的数字:"\d{n,}"11.只能输入m~n位的数字:"\d{m,n}"12.只能输入零和非零开头的数字:"(0|[1-9][0-9]*)"13.只能输入有两位小数的正实数:"[0-9]+(\.[0-9]{2})?"14.只能输入有1~3位小数的正实数:"[0-9]+(\.[0-9]{1,3})?"15.只能输入非零的正整数:"\+?[1-9][0-9]*"16.只能输入非零的负整数:"\-[1-9][0-9]*"17.只能输入长度为3的字符:".{3}"18.只能输入由26个英文字母组成的字符串:"[A-Za-z]+"19.只能输入由26个大写英文字母组成的字符串:"[A-Z]+"20.只能输入由26个小写英文字母组成的字符串:"[a-z]+"21.验证是否含有^%&',;=?$\"等字符:"[%&',;=?$\\^]+"。
22.只能输入汉字:"[\u4e00-\u9fa5]{0,}"23.验证URL:"http://([\w-]+\.)+[\w-]+(/[\w-./?%&=]*)?"24.验证一年的12个月:"(0?[1-9]|1[0-2])"正确格式为:"01""09""10""12"25.验证一个月的31天:"((0?[1-9])|((1|2)[0-9])|30|31)"正确格式为;"01""09""10""29"和“30~31”。
26.获取日期正则表达式:\\d{
    
    4}[|\-|\.]\d{
    
    \1-\12}[|\-|\.]\d{
    
    \1-\31}?
评注:可用来匹配大多数年月日信息。
27.匹配双字节字符(包括汉字在内)[^\x00-\xff]
评注:可以用来计算字符串的长度(一个双字节字符长度计2,ASCII字符计128.匹配空白行的正则表达式:\n\s*\r
评注:可以用来删除空白行
29.匹配HTML标记的正则表达式:<(\S*?)[^>]*>.*?</>|<.*? />
30.匹配首尾空白字符的正则表达式:\s*|\s*
评注:可以用来删除行首行尾的空白字符(包括空格、制表符、换页符等等),非常有用的表达式
31.匹配网址URL的正则表达式:[a-zA-z]+://[^\s]*
32.匹配帐号是否合法(字母开头,允许5-16字节,允许字母数字下划线)[a-zA-Z][a-zA-Z0-9_]{
    
    4,15}
评注:表单验证时很实用
33.匹配腾讯QQ号:[1-9][0-9]{
    
    4,}
34.匹配中国邮政编码:[1-9]\\d{
    
    5}(?!\d)
评注:中国邮政编码为6位数字
35.匹配ip地址:([1-9]{
    
    1,3}\.){
    
    3}[1-9]。
评注:提取ip地址时有用
36.匹配MAC地址:([A-Fa-f0-9]{
    
    2}\:){
    
    5}[A-Fa-f0-9]
37.匹配括号内的内容(懒惰匹配):\((.*?)\)
38.匹配括号内的内容(贪婪匹配):\((.*)\)

2、先行断言和后行断言

在使用正则表达式的过程中发现先行断言和后行断言比较好用,但是也比较难理解,这里刚好做一下解释。

1)正向先行断言:(?=pattren)
正向先行断言,匹配后面有pattren的内容。

如何匹配python yyds,what is the python第一个python?这时可以用到正向先行断言。

在这里插入图片描述

根据上图,因为第一个python 的后面紧接着yyds,因此直接使用.*(?=yyds),就能直接匹配到第一个python


2)负向先行断言:(?!pattren)
负向先行断言,匹配后面没有pattren的内容。

如何匹配python yyds,what is the pycharm含有py的pycharm,这时可以用到负向先行断言。

在这里插入图片描述根据上图,py后面没有thon的只有pycharm


3)正向后行断言:(?<=pattren)
正向后行断言,匹配前面有pattren的内容。
如何匹配python yyds,what is the pycharm前面有python的内容,这时可以用到正向后行断言。

在这里插入图片描述
根据上图,前面有python的字符串为 yyds,what is the pycharm


4)负向后行断言:(?<!pattren)

负向后行断言,匹配字符串前面没有pattren的内容。
如何匹配python yyds,what is the pycharm yyds前面没有python 的yyds,这时可以用到负向后行断言。

在这里插入图片描述

根据上图,前面没有python的yyds为pycharm之后的yyds


.*是匹配所有(除换行符外)的所有字符串。

三、Python匹配函数

注:使用这三个函数需要使用re库

1、Match函数

1)re.match函数介绍
re.match从字符串的起始位置开始匹配,如果没有在起始位置匹配成功的话则会返回None。

import re

#如果成功匹配字符串会返回一个对象,如果没有则会返回None;
matchobj = re.match(pattern,string,flags=0)

参数
1)pattern:匹配的正则表达式
2)string:需要匹配的字符串
3)flags参数:标志位,用于选择正则表达式的匹配方式(大小写,多行匹配)
re.I 忽略大小写
re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依赖于当前环境
re.M 多行模式
re.S 即为 . 并且包括换行符在内的任意字符(. 不包括换行符)
re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依赖于 Unicode 字符属性数据库
re.X 为了增加可读性,忽略空格和#后面的注释


#返回结果的调用方法
matchobj.groups()
matchobj.group(num=0)
matchobj.span()

2)re.match函数方法案例

import re

#re.match方法
#注意1:group用法
string="python yyds"
matchobj = re.match("python",string)
print(matchobj.group(0))
>>>python

#注意2:需要从字符串开始的地方匹配
string="python yyds"
matchobj = re.match("yyds",string)
print(matchobj.group(0))
>>>AttributeError: 'NoneType' object has no attribute 'group'
#re.match是从字符串的起始位置开始匹配,如果字符串开始未匹配到字符串,则会报错。

#注意3:需要优化一下
string="python yyds"
matchobj = re.match("Python",string,re.I)
#如果没有匹配到结果就使用matchobj.group(0)属性报错。
if matchobj:
	print(matchobj.group(0))

#注意4:flags大小写设置
string="python yyds"
matchobj = re.match("Python",string,re.I)
if matchobj:
	print(matchobj.group(0))
>>>python
#这里匹配到了python,因为re.I忽略大小写匹配。

2、Search函数

1)Search函数介绍
re.search扫描整个字符串并返回第一个成功的匹配;re.search和re.match的区别在于re.match只匹配字符串的开始,如果开始匹配不上,则匹配失败;而re.search从任何地方找。

import re

#同样的正则表达式,看看re.match和re.search的区别

#re.match
string="python yyds"
matchobj = re.match("yyds",string)
if matchobj:
	print(matchobj.group(0))
>>>AttributeError: 'NoneType' object has no attribute 'group'
#re.match是从字符串的起始位置开始匹配,如果字符串开始未匹配到字符串,则会报错。

#re.search
string="python yyds"
matchobj = re.search("yyds",string)
if matchobj:
	print(matchobj.group(0))
>>>yyds

3、Findall函数

1)Findall函数介绍
re.findall会返回所有匹配到的结果,匹配的结果为列表(,re.findall与search、match的区别在于:search和match是一次匹配,找到后返回匹配对象;而findall则会多次匹配,返回匹配列表。

import re

string = "python is the best language,python yyds"
matchobj = re.findall("python",string)
if matchobj is not None:
	print(matchobj)
	print(matchobj[0])
>>>['python', 'python']
>>>python

四、regex101网站推荐

在这个网站可以输入正则表达式来匹配函数,能够快速匹配到结果,非常好用。

点击进入regex101网站

在这里插入图片描述


Guess you like

Origin blog.csdn.net/zataji/article/details/128356853