python - regular expressions (Regex)

metacharacter
describe
\
Converts the next character token, or a backreference, or an octal escape. For example, "\\n" matches \n. "\n" matches a newline. The sequence "\\" matches "\" and "\(" matches "(". This is equivalent to the concept of "escape character" found in many programming languages.
^
Matches the beginning of the input word line. If the Multiline property of the RegExp object is set, ^ also matches the position after "\n" or "\r".
$
Match end of input line. If the Multiline property of the RegExp object is set, $ also matches the position before "\n" or "\r".
*
Matches the preceding subexpression any number of times. For example, zo* matches "z", as well as "zo" and "zoo". *equivalent to o{0,}
+
Match the preceding subexpression one or more times (greater than or equal to 1). For example, "zo+" matches "zo" and "zoo", but not "z". + is equivalent to {1,}.
?
Matches the preceding subexpression zero or one time. For example, "do(es)?" can match "do" or "do" in "does". ? is equivalent to {0,1}.
{n}
n is a non-negative integer. Match a certain number of n times. For example, "o{2}" cannot match the "o" in "Bob", but can match the two o's in "food".
{n,}
n is a non-negative integer. Match at least n times. For example, "o{2,}" would not match the "o" in "Bob", but would match all o's in "foooood". "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*".
{n,m}
Both m and n are non-negative integers, where n <= m . Match at least n times and at most m times. For example, "o{1,3}" will match the first three o's in "fooooood" as a set, and the last three o's as a set. "o{0,1}" is equivalent to "o?". Note that there can be no spaces between the comma and the two numbers.
?
When the character immediately follows any one of the other qualifiers (*,+,?, { n }, { n ,}, { n , m }), the matching pattern is non-greedy. The non-greedy mode matches as little of the searched string as possible, while the default greedy mode matches as much of the searched string as possible. For example, for the string "oooo", "o+" will match as many "o" as possible, yielding the result ["oooo"], while "o+?" will match as little "o" as possible, yielding the result ['o ', 'o', 'o', 'o']
.point
Matches any single character except "\n". To match any character including "\n", use a pattern like "[\s\S]".
(pattern)
Match pattern and get that match. The retrieved matches can be obtained from the resulting Matches collection, using the SubMatches collection in VBScript and the $0…$9 properties in JScript. To match parentheses characters, use "\(" or "\)".
(?:pattern)
Non-fetching matches, matches the pattern but does not obtain the matching result, and does not store it for later use. This is useful when using the or character "(|)" to combine parts of a pattern. For example "industr(?:y|ies)" is a shorter expression than "industry|industries".
(?=pattern)
Non-acquisition matching, positive positive lookahead, matches the lookup string at the beginning of any string that matches pattern, the match does not need to be fetched for later use. For example, "Windows(?=95|98|NT|2000)" can match "Windows" in "Windows2000", but not "Windows" in "Windows3.1". Lookahead consumes no characters, that is, after a match occurs, the search for the next match begins immediately after the last match, not after the character containing the lookahead.
(?!pattern)
Non-fetch matching, forward negative lookahead, matches the lookup string at the beginning of any string that does not match pattern, the match does not need to be fetched for later use. For example, "Windows(?!95|98|NT|2000)" can match "Windows" in "Windows3.1", but not "Windows" in "Windows2000".
(?<=pattern)
Non-acquisition matching, reverse positive pre-check, is similar to positive positive pre-check, but in the opposite direction. For example, "(?<=95|98|NT|2000)Windows" matches "Windows" in "2000Windows", but not "Windows" in "3.1Windows".
(?<!pattern)
Non-acquisition matches, reverse negative pre-checks, are similar to forward negative pre-checks, but in the opposite direction. For example, "(?<!95|98|NT|2000)Windows" can match "Windows" in "3.1Windows", but not "Windows" in "2000Windows". This place is incorrect, there is a problem
Any item used here cannot exceed 2 digits, such as "(?<!95|98|NT|20) Windows is correct, "(?<!95|980|NT|20) Windows reports an error, if it is used alone, then Unlimited, eg (?<!2000) Windows matches correctly
x|y
matches x or y. For example, "z|food" can match "z" or "food" (be careful here). "[zf]ood" matches "zood" or "food".
[xyz]
字符集合。匹配所包含的任意一个字符。例如,“[abc]”可以匹配“plain”中的“a”。
[^xyz]
负值字符集合。匹配未包含的任意字符。例如,“[^abc]”可以匹配“plain”中的“plin”。
[a-z]
字符范围。匹配指定范围内的任意字符。例如,“[a-z]”可以匹配“a”到“z”范围内的任意小写字母字符。
注意:只有连字符在字符组内部时,并且出现在两个字符之间时,才能表示字符的范围; 如果出字符组的开头,则只能表示连字符本身.
[^a-z]
负值字符范围。匹配任何不在指定范围内的任意字符。例如,“[^a-z]”可以匹配任何不在“a”到“z”范围内的任意字符。
\b
匹配一个单词边界,也就是指单词和空格间的位置(即正则表达式的“匹配”有两种概念,一种是匹配字符,一种是匹配位置,这里的\b就是匹配位置的)。例如,“er\b”可以匹配“never”中的“er”,但不能匹配“verb”中的“er”。
\B
匹配非单词边界。“er\B”能匹配“verb”中的“er”,但不能匹配“never”中的“er”。
\cx
匹配由x指明的控制字符。例如,\cM匹配一个Control-M或回车符。x的值必须为A-Z或a-z之一。否则,将c视为一个原义的“c”字符。
\d
匹配一个数字字符。等价于[0-9]。grep 要加上-P,perl正则支持
\D
匹配一个非数字字符。等价于[^0-9]。grep要加上-P,perl正则支持
\f
匹配一个换页符。等价于\x0c和\cL。
\n
匹配一个换行符。等价于\x0a和\cJ。
\r
匹配一个回车符。等价于\x0d和\cM。
\s
匹配任何不可见字符,包括空格、制表符、换页符等等。等价于[ \f\n\r\t\v]。
\S
匹配任何可见字符。等价于[^ \f\n\r\t\v]。
\t
匹配一个制表符。等价于\x09和\cI。
\v
匹配一个垂直制表符。等价于\x0b和\cK。
\w
匹配包括下划线的任何单词字符。类似但不等价于“[A-Za-z0-9_]”,这里的"单词"字符使用Unicode字符集。
\W
匹配任何非单词字符。等价于“[^A-Za-z0-9_]”。
\xn
匹配n,其中n为十六进制转义值。十六进制转义值必须为确定的两个数字长。例如,“\x41”匹配“A”。“\x041”则等价于“\x04&1”。正则表达式中可以使用ASCII编码。
\num
匹配num,其中num是一个正整数。对所获取的匹配的引用。例如,“(.)\1”匹配两个连续的相同字符。
\n
标识一个八进制转义值或一个向后引用。如果\n之前至少n个获取的子表达式,则n为向后引用。否则,如果n为八进制数字(0-7),则n为一个八进制转义值。
\nm
标识一个八进制转义值或一个向后引用。如果\nm之前至少有nm个获得子表达式,则nm为向后引用。如果\nm之前至少有n个获取,则n为一个后跟文字m的向后引用。如果前面的条件都不满足,若nm均为八进制数字(0-7),则\nm将匹配八进制转义值nm
\nml
如果n为八进制数字(0-7),且ml均为八进制数字(0-7),则匹配八进制转义值nml
\un
匹配n,其中n是一个用四个十六进制数字表示的Unicode字符。例如,\u00A9匹配版权符号(&copy;)。
\p{P}
小写 p 是 property 的意思,表示 Unicode 属性,用于 Unicode 正表达式的前缀。中括号内的“P”表示Unicode 字符集七个字符属性之一:标点字符。
其他六个属性:
L:字母;
M:标记符号(一般不会单独出现);
Z:分隔符(比如空格、换行等);
S:符号(比如数学符号、货币符号等);
N:数字(比如阿拉伯数字、罗马数字等);
C:其他字符。
*注:此语法部分语言不支持,例:javascript。
\<
\>
匹配词(word)的开始(\<)和结束(\>)。例如正则表达式\<the\>能够匹配字符串"for the wise"中的"the",但是不能匹配字符串"otherwise"中的"the"。注意:这个元字符不是所有的软件都支持的。
( ) 将( 和 ) 之间的表达式定义为“组”(group),并且将匹配这个表达式的字符保存到一个临时区域(一个正则表达式中最多可以保存9个),它们可以用 \1 到\9 的符号来引用。
|

将两个匹配条件进行逻辑“或”(Or)运算。例如正则表达式(him|her) 匹配"it belongs to him"和"it belongs to her",但是不能匹配"it belongs to them."。注意:这个元字符不是所有的软件都支持的。

注:本表格来自百度百科

使用正则表达式需要导入re模块 import re

匹配对象

2.re.compile() #返回一个Regex模式对象

    eg: a = re.compile(r'\d\d\d\d') #a变量包含了一个Regex对象,r表示该字符串中不包含转义字符

3.search() :从传入的字符串中查找

    eg: b = a.search('sdasfds4444jkh')#a为Regex对象,若无查找结果,则返回None

若找到匹配字段,则返回一个Match对象,Match对象中的group()方法返回查找的结果

本例中。若想获得查找结果,则print(b.group())

分组

1.利用括号进行分组

  eg: (\d\d\d)-(\d\d\d\d)为两组,读取第一组输入b.group(1),输入0或者为空则输出整个匹配文本

    若想输出所有组,使用groups()

eg:

   

   管道

1.符号 |

2.作用 匹配多个表达式中的一个时用管道分割,当管道两侧的表达式都匹配时,返回第一个出现匹配的文本

可选匹配

1.符号 ?

作用:bat(x)?hjj 使用该表达式进行匹配时。 x 可出现一次或零次

2. 符号 *

作用:*前面的分组内的内容可以重复零次或多次

3.符号 +

作用:+前面分组内的内容重复一次货多次

4.符号 {x} #x为正整数

作用:{x}前面的分组内的内容重复x次

{3,5} 重复 3 到 5 次

{3,}重复大于等于3次

{,5} 重复小于等于5次

字符分类

[0-5]  表示匹配数字0到5

\d+ 表示有多个数字

[dhoiymne]匹配[]中的单个字符

注:在[]中,正则表达式不会被解释

findall()  regex对象的方法

1.返回一组字符串,包含所有符合条件的匹配结果

2.若正则表达式中含有分组。则返回元组的列表

sub()  regex对象的方法

1.作用:查找并替代

2.用法:与compile配合使用

eg:x =re.compile(r'sas')

x.sub('c','sasikkksas')

用 c 替换 sasikkksas中的所有sas

re.compile()的第二个参数

1.re.DOTALL 

与 .* 配合使用可以匹配所有字符

2.re.IGNORECASE或re.I

让正则表达式不区分大小写

3.re.VERBOSE

忽略字符串中的空白符与注释

注:若要在第二个参数位置输入多个,则用|隔开




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325938853&siteId=291194637