re module & regular expressions

 

re module with regular expressions relationship between:

               Regular expressions are not unique to python

               All programming languages ​​can use regular, it is a separate language

               If you want to use in python, it must rely on re module

Regular : A regular expression is a logical formula of string operations, it is to use some combination of a particular pre-defined characters, and these particular character, form a "string rule", this "rule string" used to express kind of string filter logic. Filter string for specific content; can be used in data analysis and reptiles.

Note: As long as is reg beginning ... basically related to the regular

 

PS: online testing tool     http://tool.chinaz.com/regex/

 Then there is the expression of a regular expression:

1. Character

   Metacharacter matches

   . Matches any character except newline
   \ W matches any character except newline
   \ S matches any whitespace
   \ D match numbers
   \ N Matches a newline
   \ T matches a tab
   \ B matches the end of a word
   ^ Matches the beginning of the string
   $ Matches the end of the string
   \ W                                matches non letters or numbers or an underscore
   \ D                                 matches non-digital
   \ S                                 matches non-whitespace
   a | b matches a character or characters b
   () Matches expressions within the brackets, but also represents a group
   [...] matches the character set of characters
   [^ ...] matches all characters except the characters in the character set

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2. quantifier

         * Repeat 0 or more times
         + Repeated one or more times
          ? 0 or 1 is repeated
        {N} n times
        {N,} n times or more times
       {N, m} n times m times to

 

 

 

 

 

 

 

 

3. character set

Such as: [0123456789] represents screens out all the individual numbers can be abbreviated as [0-9]

 

[az] represents screens out all lowercase letters a through z single letter        

 

[AZ] represents screens out all capital letters A to Z single letter

 

In addition, used in combination, the intermediate space is not . Such as:

 

[0-9a-z] represents screens out all of the numbers and the individual single letter az

************************************************************

  • Character Group []: [] expressions are inside or relationship is a string attached to the
  • ^ And $ characters in conjunction, will precisely match the restricted content, what is that between the two write, matching string must be what it is, do not want one more nor one less
  • abc | ab sure to long on the front
  • ^ Write directly on the outside --- restrictions beginning of the string; and [^] in addition to [] Other characters to be written
  • Regular on match default is greedy match (try to match more), you can add a quantifier behind? You can become greedy match non-greedy match (inert match)
  • Quantifiers must follow close behind in the sign of the positive, and only with that one can limit its regular symbol
  • When a plurality of regular symbols repeated as many times as a whole, or other operations, it may be in the form of a packet; packet is in the regular grammar ()

The determination of the ID number: ^ ([1-9] \ d {16} [0-9x] | [1-9] \ d {14}) $

Analysis: ① to begin with $ ^ end, the middle content is accurate limit

     ② group () indicate () which requires as a whole

     ③ |, represents | content can be screened on both sides, and will long on the left

     ④ | on the left represents the regular expression begins with a number 1-9 in indirect 16 last digit can be 0-9 or any one of a total of lowercase x 18

     ⑤ | being on the right of expression represented by the numbers 1-9 after the opening pick 14 numbers total 15

************************************************************

 

 

supplement:

Escapes: \    

  In regular expressions, there are a lot of special significance is metacharacters, such as \ n and \ s, etc., if you want the match in a positive normal "\ n" instead of "line breaks" on the need to "\" transposed righteousness, becomes '\\' .

 

  在python中,无论是正则表达式,还是待匹配的内容,都是以字符串的形式出现的,在字符串中\也有特殊的含义,本身还需要转义。

       所以如果匹配一次"\n",字符串中要写成'\\n',那么正则里就要写成"\\\\n",这样就太麻烦了。这个时候我们就用到了r'\n'这个概念,此时的正则是r'\\n'就可以了。

几个常用的非贪婪匹配Pattern:量词后面加上一个?就可以将贪婪匹配变成非贪婪匹配(惰性匹配)

  *             重复任意次,但尽可能少重复

  +            重复1次或更多次,但尽可能少重复

  ?          重复0次或1次,但尽可能少重复

  {n,m}?       重复n到m次,但尽可能少重复

  {n,}        重复n次以上,但尽可能少重复

.*?的用法:

.是任意字符     *是取0至无无限长度   ?是非贪婪模式

合在一起就是    取尽量少的任意字符

一般写.*?x   表示取前面任意长度的字符出现,直到一个x出现

 

 

 

re模块的常用方法:

首先在py文件中导入模块   import re

re模块的调用方式主要有三种:findall / search / match             findall / search / match('正则表达式','带匹配的字符串')

import re

ret = re.findall('a', 'eva egon yuan')  # 返回所有满足匹配条件的结果,放在列表里   
print(ret) #结果 : ['a', 'a']

ret = re.search('a', 'eva egon yuan').group()
print(ret) #结果 : 'a'
# 函数会在字符串内查找模式匹配,只到找到第一个匹配然后返回一个包含匹配信息的对象,该对象可以通过调用group()方法得到匹配的字符串,如果字符串没有匹配,则返回None。

ret = re.match('a', 'abc').group()  # 同search,不过尽在字符串开始处进行匹配
print(ret)
#结果 : 'a'

 

此外,还有split :

ret = re.split('[ab]', 'abcd')  # 先按'a'分割得到''和'bcd',在对''和'bcd'分别按'b'分割
print(ret)  # ['', '', 'cd']

sub :  sub('正则表达式','新的内容','待替换的字符串',n)

ret = re.sub('\d', 'H', 'eva3egon4yuan4', 1)#将数字替换成'H',参数1表示只替换1个
print(ret) #evaHegon4yuan4

 subn : 

 

ret = re.subn('\d', 'H', 'eva3egon4yuan4')#将数字替换成'H',返回元组(替换的结果,替换了多少次)
print(ret)

 

 compile :

obj = re.compile('\d{3}')  #将正则表达式编译成为一个 正则表达式对象,规则要匹配的是3个数字
ret = obj.search('abc123eeee') #正则表达式对象调用search,参数为待匹配的字符串
print(ret.group())  #结果 : 123

 

 finditer :

import re
ret = re.finditer('\d', 'ds3sy4784a')   #finditer返回一个存放匹配结果的迭代器
print(ret)  # <callable_iterator object at 0x10195f940>
print(next(ret).group())  #查看第一个结果
print(next(ret).group())  #查看第二个结果

print([i.group() for i in ret])  #查看剩余的结果

 

 注意:优先级查询

1. findall的优先级查询

import re

ret = re.findall('www.(baidu|oldboy).com', 'www.oldboy.com')
print(ret)  # ['oldboy']     这是因为findall会优先把匹配结果组里内容返回,如果想要匹配结果,取消权限即可

ret = re.findall('www.(?:baidu|oldboy).com', 'www.oldboy.com')  # ?:可以取消这种优先级查询权限
print(ret)  # ['www.oldboy.com']

 

2. split的优先级查询

ret=re.split("\d+","eva3egon4yuan")
print(ret) #结果 : ['eva', 'egon', 'yuan']

ret=re.split("(\d+)","eva3egon4yuan")
print(ret) #结果 : ['eva', '3', 'egon', '4', 'yuan']

#在匹配部分加上()之后所切出的结果是不同的,
#没有()的没有保留所匹配的项,但是有()的却能够保留了匹配的项
#这个在某些需要保留匹配部分的使用过程是非常重要的。

 

 

给某一个正则起别名:!!!!

import re
res = re.search('^[1-9](\d{14})(\d{2}[0-9x])?$','110105199812067023')
res = re.search('^[1-9](?P<password>\d{14})(?P<username>\d{2}[0-9x])?$','110105199812067023')  # 固定格式:    + 大写的P + <名字>

 

Guess you like

Origin www.cnblogs.com/pupy/p/11201938.html