Python modules commonly used - regular expressions re module

Python modules commonly used - regular expressions re module

Primer

Please remove all phone numbers from the following file

姓名       地区    身高   体重       电话
况咏蜜     北京    171    48    13651054608
王心颜     上海    169    46    13813234424
马纤羽     深圳    173    50    13744234523
乔亦菲     广州    172    52    15823423525
罗梦竹     北京    175    49    18623423421
刘诺涵     北京    170    48    18623423765
岳妮妮     深圳    177    54    18835324553
贺婉萱     深圳    174    52    18933434452
叶梓萱     上海    171    49    18042432324
杜姗姗     北京    167    49    13324523342

What you could think of that?

This must be following it?

f = open("兼职白领学生空姐模特护士联系方式.txt",'r',encoding="gbk")
phones = []
for line in f:
    name,city,height,weight,phone = line.split()
    if phone.startswith('1') and len(phone) == 11:
        phones.append(phone)
print(phones)

Is there an easier way?

There is a regular phone number, are digital and are 11, then strict point, we are at the beginning of 1, if such rules can write code that matches the code files directly to get the rules of the contents is not on the list?

import re
f = open("兼职模特空姐联系方式.txt")
data = f.read()
print(re.findall("[0-9]{11}", data))

So what nb is played? It's called regular expressions!

re module

Regular expression is a string matching rules, there is appropriate support in most programming languages, python in the corresponding module is re.

1, a commonly used expression rules

'.'     默认匹配除\n之外的任意一个字符,若指定flag DOTALL,则匹配任意字符,包括换行
'^'     匹配字符开头,若指定flags MULTILINE,这种也可以匹配上(r"^a","\nabc\neee",flags=re.MULTILINE)
'$'     匹配字符结尾, 若指定flags MULTILINE,re.search('foo.$','foo1\nfoo2\n',re.MULTILINE).group() 会匹配到foo1
'*'     匹配*号前的字符0次或多次, re.search('a*','aaaabac')  结果'aaaa'
'+'     匹配前一个字符1次或多次,re.findall("ab+","ab+cd+abb+bba") 结果['ab', 'abb']
'?'     匹配前一个字符1次或0次 ,re.search('b?','alex').group() 匹配b 0次
'{m}'   匹配前一个字符m次 ,re.search('b{3}','alexbbbs').group()  匹配到'bbb'
'{n,m}' 匹配前一个字符n到m次,re.findall("ab{1,3}","abb abc abbcbbb") 结果'abb', 'ab', 'abb']
'|'     匹配|左或|右的字符,re.search("abc|ABC","ABCBabcCD").group() 结果'ABC'
'(...)' 分组匹配, re.search("(abc){2}a(123|45)", "abcabca456c").group() 结果为'abcabca45'
'\A'    只从字符开头匹配,re.search("\Aabc","alexabc") 是匹配不到的,相当于re.match('abc',"alexabc") 或^
'\Z'    匹配字符结尾,同$ 
'\d'    匹配数字0-9
'\D'    匹配非数字
'\w'    匹配[A-Za-z0-9]
'\W'    匹配非[A-Za-z0-9]
's'     匹配空白字符、\t、\n、\r , re.search("\s+","ab\tc1\n3").group() 结果 '\t'
'(?P...)' 分组匹配 re.search("(?P[0-9]{4})(?P[0-9]{2})(?P[0-9]{4})")

2, re matching syntax

re.match 从头开始匹配

re.search 匹配包含

re.findall 把所有匹配到的字符放到一起,以列表中的元素返回

re.split 以匹配到的字符当做列表分隔符

re.sub 匹配字符并替换

re.fullmatch 全部匹配

re.match(pattern, string, flags=0)

From the start position in accordance with the model to specify the contents of the string matching, matching single

  • regular expression pattern
  • string to match the string
  • flags flag for controlling the regular expression matching mode
import re
obj = re.match('\d+', '123uuasf') #如果能匹配到就返回一个可调用的对象,否则返回None
if obj: 
    print obj.group()

Flags identifier

  • re.I (re.IGNORECASE): Ignore case (in brackets is the full wording, the same below)
  • re.M (MULTILINE): multi-line mode, change '^' and '$' behavior
  • re.S(DOTALL): 改变’.’的行为,make the ‘.’ special character match any character at all, including a newline; without this flag, ‘.’ will match anything except a newline.
  • re.X (re.VERBOSE) can write comments to your expression, to make it more readable, the following two mean the same
a = re.compile(r"""\d + # the integral part
                \. # the decimal point
                \d * # some fractional digits""", 
                re.X)
b = re.compile(r"\d+\.\d*")

re.search(pattern, string, flags=0)

According to the content that matches the specified string model, matching single

import re
obj = re.search('\d+', 'u123uu888asf')
if obj:
    print obj.group()

re.findall(pattern, string, flags=0)

match and search for matches are single-valued, namely: only match a string, if you want to match the string in all eligible element, you need to use findall.

import re
obj = re.findall('\d+', 'fa123uu888asf')
print obj

re.sub(pattern, repl, string, count=0, flags=0)

Used to replace the string matching, function more powerful than str.replace

>>>re.sub('[a-z]+','sb','武配齐是abc123',)
>>> re.sub('\d+','|', 'alex22wupeiqi33oldboy55',count=2)
'alex|wupeiqi|oldboy55'

re.split(pattern, string, maxsplit=0, flags=0)

As to the division point of matched values, the value is divided into a listing

>>>s='9-2*5/3+7/3*99/4*2998+10*568/14'
>>>re.split('[\*\-\/\+]',s)
['9', '2', '5', '3', '7', '3', '99', '4', '2998', '10', '568', '14']
>>> re.split('[\*\-\/\+]',s,3)
['9', '2', '5', '3+7/3*99/4*2998+10*568/14']

re.fullmatch(pattern, string, flags=0)

Match the entire string is returned successfully re object, otherwise None

re.fullmatch('\w+@\w+\.(com|cn|edu)',"[email protected]")

Guess you like

Origin www.cnblogs.com/Kwan-C/p/11620942.html