Day15 regular expressions and re module

1. Regular Expressions

What are regular?

  Regular use of some symbol combinations that have special meaning or the method described character string together (referred to as regular expressions). Regular or rules used to describe a class of things.

  In python we use regular string is used to filter specific content

Regular application scenarios

  1. reptiles

  2. Data analysis

Regular expression symbol

(1) Character

Metacharacters Matched content
. It matches any character except newline
\w Match letters or numbers or an underscore
\s Matches any whitespace
\d Matching numbers
\n Matches a newline
\t A matching tab
\b Match the end of a word
^ Matches the beginning of the string
$ End of the string
\W Non-matching letters or numbers or an underscore
\D Matching non-numeric
\S Matching non-whitespace characters
a|b A matching character or character b
() Matching expression in parentheses, also represents a group
[...] Matches the character set of characters
[^...] In addition to all the other characters match the characters in the character set

 note:

^ And $ symbol indicates the beginning and end of the string, the string when they will be together with precise matching of content restrictions, among them what to write, what is the match

abc | form ab, abc or ab of this, it must EDITORIAL long, or short of the full length will take away

^ Symbols used in two scenarios:

  ^ Written on the outside, meant to limit the beginning of the string, and what followed, beginning a string of what must be

  ^ Written in [^ ...] format, expressed in brackets with a ^ character, every other

Group: when a plurality of regular symbols to be repeated a plurality of times, or other operations as a whole, may be in the form of packets

  Packet syntax is ()

 

(2) group in the form of commonly used characters

Character Group: [character set], brackets in the form of multiple characters

  In a burst present in the same position a plurality of characters, the regular expression by []

  Character divided into many categories, such as numbers, letters, punctuation, etc.

[0-9] All figures represent 0-9
[a-z] It represents all lowercase letters
[A-Z] It represents all capital letters
[0-9a-zA-Z] It represents all the numbers and letters

 

(3) quantifier

quantifier Usage Notes
* Repeated zero or more times
+ Repeated one or more times
? Repeat zero or one time
{n} N times
{n,} Repeated n times or more
{n,m} Repeated n times to m

 Quantifiers must follow a regular behind the sign, and the only limit close to his regular sign

 

(4) escape character \

  在正则表达式中,有很多有特殊意义的元字符,如\n和\s等,如果我们想输出正常的\n而不是换行符,我们就要使用转义符\,对\n的\进行转义,变成'\\'.

  但是当我们需要输出\\n时,我们必须使用两个转义符\\\\n来转义,这样就显得很麻烦了,这时候我们就需要用到r这个转义符,只需输出r'\\n'就可以了

转义符 待转义字符 正则
\ \n \\n
\\ \\n \\\\n
r \\n r'\\n'

 

(5)贪婪匹配

贪婪匹配:在满足匹配时,匹配尽可能长的字符串,默认情况下,采用贪婪匹配

正则 待匹配字符 匹配结果 说明
<.*> <script>...<script> '<script>...<script>' 默认为贪婪模式,会匹配尽可能多的字符
<.*?> <script>...<script> '<script>','<script>' 加上?后默认为非贪婪模式,会匹配尽可能少的字符

 

几个常用的非贪婪匹配

  *?重复任意次,但尽可能少重复

  +?重复1次或者更多次,但尽可能少重复

  ??重复0次或1次,但尽可能少重复

  {n,m}?重复n到m次,但尽可能少重复

  {n,}?重复n次以上,但尽可能少重复

.*?的用法

  .是任意字符

  *是取0到无限大

  ?是非贪婪模式

  两个组合在一起就是尽量少的取任意字符,一般不会单独写

  一般都是.*?x,在x之前取任意长度的字符,直到x出现

 

2.re模块

re模块和正则表达式之间的关系

  正则表达式不是python独有的,它是一门独立的技术,所有的编程语言都可以使用正则,但是如果想在python中使用,必须依赖于re模块

re模块下的常用方法

import re
res = re.findall('z','sxc zzj zzp')  # 返回所有满足条件的结果,放在列表里
print(res)  # 返回的结果是['z', 'z', 'z', 'z']

res1 = re.search('z','sxc zzj zzp')  # 查找模式,只需要找到第一个就返回
print(res1)  # 返回匹配信息的对象
print(res1.group())  # 调用group()可以查看返回值,结果 z
res2 = re.search('a','sxc zzj zzp')
print(res2)  # 当查找一个不存在的字符会返回None

res3 = re.match('s','sxc zzj zzp')  # 和search一样,但是在开始处匹配
print(res3.group())  # 返回结果 s

res4 = re.split('[23]','2a1b2c3d4')  # 先按'2'切得到一串字符,再按3切得到最终返回结果
print(res4)  # 返回结果['', 'a1b', 'c', 'd4']

res5 = re.sub('\d','haha','1-2-3-4-5-')  # 将字符串的数字全部换成haha
print(res5)  # 返回结果haha-haha-haha-haha-haha-
res6 = re.sub('\d','haha','1-2-3-4-5-',3)  # 可以加数字,指定替换几个字符
print(res6)  # 返回结果haha-haha-haha-4-5-

res7 = re.subn('\d','zz','1-2-3-4-5',3)  # 替换的结果和替换的次数以元组的形式返回
print(res7)  # 返回结果('zz-zz-zz-4-5', 3)

obj = re.compile('\d{5}')  # 将正则表达式编译成一个正则表达式对象,规则要匹配3个数字
res8 = obj.search('s123456xc')  #正则表达式对象调用search,参数为待匹配的字符串
print(res8.group())  # 返回结果12345

res9 = re.finditer('\d','1s2s5f6t')  # finditer返回一个存放匹配结果的迭代器
print(res9)  # 生成一个迭代器
print(next(res9).group())  # 调用取值,返回1
print(next(res9).group())  # 调用取值,返回2
print(next(res9).group())  # 调用取值,返回5
print(next(res9).group())  # 调用取值,返回6
print(next(res9).group())  # 调用取值,超过范围报错StopIteration

 

findall的优先级查询

import re
res = re.findall('www.(baidu|oldboy).com','www.oldboy.com')# 需要返回的是www.oldboy.com
print(res)  # 返回的结果是oldboy,这是因为findall会优先返回组里的内容

res1 = re.findall('www.(?:baidu|oldboy).com','www.oldboy.com')# 需要返回的是www.oldboy.com
print(res1)  # 使用?:取消其优先级权限,返回的结果是www.oldboy.com

 

 

spilt的优先级查询

import re
res = re.split('\d','sxc1zzj2zzp3')  # 优先按照数字来切
print(res)  # 结果:['sxc', 'zzj', 'zzp', '']

res1 = re.split('(\d)','sxc1zzj2zzp3')  # 加了括号会保存数字
print(res1)  # 结果['sxc', '1', 'zzj', '2', 'zzp', '3', '']

在匹配部分加()结果是完全不同的,没有()的没有保留匹配的项,保留()的保留了匹配的项,这在某些需要保留匹配的项是非常重要的

 

Guess you like

Origin www.cnblogs.com/sxchen/p/11202117.html