Regular Expressions
re module with a regular relationship between the expression
Regular expressions are not unique to it python is an independent technology \
All programming languages can use regular
But if you want to use in python, it must rely on re module
Regular screening is the string of specific content
Regular application scenarios
1 Reptile
2 Data Analysis
Note: the beginning of the general re both regular and relationship
Character group [character set]
in Karma kind of character set may appear in the same position in the regular expression with [] indicate, of course, can be written in this position directly above 0,1,2,2,3,4 , 5,6,7,8,9 this 10 digit
Metacharacters | Matched content |
. | Match any character except newline unexpected |
\w | Matching numbers, letters, underscores |
\d | Matching numbers |
\s | Matches any whitespace |
\W | Non-matching numbers, letters, underscores |
\D | Matching non-numeric |
\S | Matching non-whitespace characters |
\n | Matches a newline |
\t | A matching tab |
\b | Match the end of a word |
^ | Matches the beginning of string |
$ | End of the string |
a|b | A matching character or character B (the length in front) |
() | Matching regular expression in parentheses denote a group |
[...] | Matching string of characters |
[^...] | Matches all characters in the string except |
^ And $ characters will be used in conjunction with both the precise content restrictions string matching what to write in the middle of the match must be nothing more than a few did not want a not OK
Regular | Description |
[0-9] | This range represents 0,1,2,3,4,5,6,7,8,9 |
[a-z] | Lowercase letters represent 26 |
[A-Z] | Represents capitalized 26 |
[0-9a-zA-Z] | 26 represents a 0-9 lowercase letters capitalized 26 |
quantifier | Usage Notes |
* | Repeated zero or more times |
+ | Repeated one or more times |
? | 0 or 1 is repeated |
{n} | N times |
{n,} | Repeat n or more times |
{n,m} | 重复n到m次 |
<.*> | 默认是贪婪匹配,尽可能的匹配长的字符串 |
<.*?> | 加上了? 从贪婪匹配转为非贪婪匹配,就是尽可能短的匹配字符串 |
贪婪匹配:尽量的去多个值 在量词中,他们都是取贪婪匹配 默认情况下,采用贪婪匹配
非贪婪匹配 加? 贪婪变非贪婪
*? 重复的任意的次数,尽可能少重复
+? 重复1次或多次 尽可能少重复
?? 重复0次或1次 尽可能少重复
{n,m}? 重复n到m次 尽可能少重复
{n,}? 重复n次以上 尽可能少重复
.*?的用法
. 是任意字符
* 是0到无线长度取值
? 是非贪婪匹配
三个合在一起就是取尽量少的任意字符
应用场景 .
*?x 表示的意思是前面取任意长度的字符,知道末尾有一个x出现
re模块
findall 返回所有满足要求的结果 放在一个列表中
s = '0123456789' print(re.findall('1',s)) # ['1'] print(re.findall('[0-3]',s)) # ['0', '1', '2', '3'] print(re.findall('[0-9]',s)) # ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] # print(re.findall('asd',s)) # [] 查找的字符没有的话就返回空列表 # s1 = 'hello my name is james' print(re.findall('[a-h]',s1)) # ['h', 'e', 'a', 'e', 'a', 'e'] print(re.findall('[a]',s1)) # ['a', 'a'] # print(re.findall('[123]',s1)) # [] 没有返回空列表
search 也是查找字符,但是search是再找到第一个符合要求的字符后就不找了 可以是多个连接的字符
使用.group会找到
s = 'egon owen mac' print(re.search('o',s)) # <_sre.SRE_Match object; span=(2, 3), match='o'> 返回一个这样的东西,说明search不能直接返回值使用.group print(re.search('k',s)) # None 不实用group方法没找到会返回None print(re.search('o',s).group()) # o print(re.search('ego',s).group()) # ego # print(re.search('k',s).group()) # 找不到 的会报错
match 和 search一样 也是查找元素在不在字符串中,使用group方法会返回值,但是不同的一点是他是检查是不是以什么开头的 如果不是直接报错
s = 'egon owen mac' print(re.match('k',s)) # None 没有匹配到就返回None print(re.match('e',s).group()) # e 使用group方法匹配成功就返回输入的字符,没找到就报错 # print(re.match('k',s).group()) # 没有匹配到 报错
split 切割不过是一次一次切 []里面的字符如果是都在字符串开头 那么就都打印出空字符串
s = 'asadfaghjkl' print(re.split('a',s)) # ['', 's', 'df', 'ghjkl'] 切一个字符 如果被切字符在开头第一个, # 会输出一个空字符串,中间的不打印,但是也没有了,逗号隔开 print(re.split('k',s)) # ['asadfaghj', 'l'] 不是开头字符直接被切用逗号隔开 print(re.split('[asa]',s)) # ['', '', '', 'df', 'ghjkl'] 被切列表中有几个字符,就相当于是切了几次 #先按照a切在用剩下的和s切在用剩下的和a再切 都已空字符串打印 print(re.split('[fag]',s)) # ['', 's', 'd', '', '', 'hjkl']
subn 也是替换 不过是输出结果的元组,后面的参数是修改了的个数
s = 'aaassaasdfaaghj' print(re.subn('\w','3',s)) # ('333333333333333', 15) print(re.subn('a','3',s)) # ('333ss33sdf33ghj', 7)
了解
obj = re.compile('\d{3}') #将正则表达式编译成为一个 正则表达式对象,规则要匹配的是3个数字 ret = obj.search('abc123eeee') #正则表达式对象调用search,参数为待匹配的字符串 print(ret.group()) #结果 : 123 # ret = re.finditer('\d', 'ds3sy4784a') #finditer返回一个存放匹配结果的迭代器 print(ret) # <callable_iterator object at 0x10195f940> print(next(ret).group()) #查看第一个结果 print(next(ret).group()) #查看第二个结果 print([i.group() for i in ret]) #查看剩余的左右结果