<Automate the boring stuff with python> --- Chapter VII of the canonical example & regular & greedy matching phone number and email

Chapter VII to find a phone number by a string, compare the difference between regular expressions whether to use the program, significantly more concise wording regular, easy to expand.
Mode: 3 digits, a dash, three numbers, a dash, and then four digits. For example: 415-555-4242

. 1  Import Re
 2  '' ' 
. 3  do not find a regular pattern, matching three numbers, a dash, three numbers, a dash, 4 digits
 . 4  EX. 111-222-3334
 . 5  ' '' 
. 6  
. 7  DEF isPhoneNo (text):
 . 8      IF len (text) = 12 is! :
 . 9          return False
 10      for I in Range (0,3 ):
 . 11          IF  Not [I] .isdecimal text ():
 12 is              return False
 13 is      IF text [ . 3] =! ' - ' :
 14          return False
 15     for I in Range (4,7 ):
 16          IF  Not text [I] .isdecimal ():
 . 17              return False
 18 is      IF text [. 7] =! ' - ' :
 . 19          return False
 20 is      for I in Range (8,12 ) :
 21 is          IF  Not text [I] .isdecimal ():
 22 is              return False
 23 is      return True
 24  
25  '' ' 
26 is  a regular expression matching the above pattern
 27  ' '' 
28  DEF regPhoneNo (text):
29     phoneNoReg=re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
30     res=phoneNoReg.search(text)
31     if res != None:
32         print('phone No find by reg: '+ res.group())
33 
34 print(isPhoneNo('123-122-9090'))
35 print(isPhoneNo('1234123321'))
36 msg = 'call me at 415-443-1111 tomorrow. 415-443-2222 is my office'
37 for i in range(len(msg)):
38     tmp = msg[i:i+12]
39     if isPhoneNo(tmp):
40         print('phone No find: ' + tmp)
41     regPhoneNo(tmp)
42 print('msg find end')
View Code

 

Python regular expression default is "greedy", which means that under ambiguous circumstances, they will match the longest string possible.

Braces "non-greedy" version match the shortest possible string, followed by a question mark at the end of braces

Example:

'' ' 
Examples and non-greedy greedy Python matches 
'' ' 
DEF showGreedReg (): 
    greedReg = the re.compile (R & lt ' (HA) {3,5} ' ) 
    nonGreedReg = the re.compile (R & lt ' (HA) ? {3,5} ' ) 
    InP = ' hahahahahah ' 
    R1 = greedReg.search (InP) 
    R2 = nonGreedReg.search (InP)
     Print ( ' Greed RES REG: ' + r1.group ())
     Print ( ' nongreed REG RES : ' + r2.group ())

showGreedReg()
View Code

 

Chapter 7 project for the phone number and email regular extraction, the clipboard section omitted here.

 1 import pyperclip, re
 2 phoneReg=re.compile(r'''(
 3 (\d{3}|\(\d{3}\))?   #area code
 4 (\s|-|\.)?     #separator
 5 (\d{3})    #first 3 digits
 6 (\s|-|\.)?     #separator
 7 (\d{4})    #last 4 digits
 8 (\s*(ext|x|ext.)\s*(\d{2,5}))?
 9 )''', re.VERBOSE
10     )
11 
12 emailReg=re.compile(r'''(
13 [a-zA-Z0-9_-]+    #username
14 @    #@
15 [a-zA-Z0-9_-]+    #domain name
16 (\.[a-zA-Z]{2,4})
17 )''', re.VERBOSE
18     )
View Code

A phone number from "optional" area code beginning, the area code followed by a question mark packets.

Because the code may only three digits (i.e., \ d {3}), or three figures in parentheses (i.e., \ (\ d {3} \)), so that the two parts should be connected with the pipe symbol.

This part of multi-line strings can be combined with regular expression comment # Area code, to help you remember (\ d {3} | \ (\ d {3} \))? What is to be matched Yes.
Telephone numbers may be divided character spaces (\ S), a dash (-) (.) Or periods, these portions should be connected by piping.

The following regular expression is simple parts: 3 digits, followed by another delimiter, followed by four digits.

The last part is optional extension, including any number of spaces,
then ext, x or EXT., Then followed by 2-5 digits.

 

E-mail user name part address one or more characters, which may include: lower and upper case letters, numbers, dot, underscore, percent sign, plus or dashes.

All of these can be classified into a character: [a-zA-Z0-9 ._ % + -].
Splitting name and username @ symbol domain to allow less classified characters, only letters, numbers, and dashes periods: [a-zA-Z0-9.-] .

The last is the "dot-com" part (technically called "top-level domain"), it can actually be "dot-anything". It has 2-4 characters.

re.VERBOSE, ignoring the regular expression string whitespace and comments

At this point, the end of the contents of Chapter VII, practical projects strong passwords to detect the next issue of blog

Guess you like

Origin www.cnblogs.com/chenzhefan/p/11932976.html