"Python programming to automate tedious work" notes

Pattern matching and regular

Regular expression to find text

  1. All regular expression functions in Python are in the re module, import this module:
import re
  1. Use the re.compile() function to create a Regex object, which will return a Regex pattern
    object (or simply Regex object).

phoneNumRegex = re.compile(r’\d\d\d-\d\d\d-\d\d\d\d’)

The phoneNumRegex variable contains a Regex object.

Pass the string you want to find to the search() method of the Regex object. It returns a Match object. Call the group() method of the Match object to return the string that actually matches the text.

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found: ' + mo.group())
Phone number found: 415-555-4242

Regular expression matches more patterns

  1. Use parentheses to group
>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>> mo = phoneNumRegex.search('My number is 415-555-4242.')
>>> mo.group(1)
'415'
>>> mo.group(2)
'555-4242'
  1. Use pipes to match multiple groups
>>> batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
>>> mo = batRegex.search('Batmobile lost a wheel')
>>> mo.group()
'Batmobile'
>>> mo.group(1)
'mobile'
  1. Use question marks to achieve optional matching
>>> batRegex = re.compile(r'Bat(wo)?man')
>>> mo1 = batRegex.search('The Adventures of Batman')
>>> mo1.group()
'Batman'
>>> mo2 = batRegex.search('The Adventures of Batwoman')
>>> mo2.group()
'Batwoman'

ps: It can be considered that "match the group before this question mark zero or one time".

  1. Match zero or more times with an asterisk
>>> mo3 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo3.group()
'Batwowowowoman'
  1. Use the plus sign to match one or more times +
>>> mo2 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo2.group()
'Batwowowowoman'
  1. The curly braces match a specific number of times (only match a specific number of times). The
    regular expression (Ha){3} will match the string'HaHaHa' but not'HaHa'
    because the latter only repeats the (Ha) grouping twice.

  2. Greedy and non-greedy match

>>> greedyHaRegex = re.compile(r'(Ha){3,5}')

The regularity is (Ha){3,5} can match 3, 4, or 5 instances, and the group() function will return the most matching results. Even if 3 is satisfied and 4 is satisfied, 5 will match 5

>>> mo1 = greedyHaRegex.search('HaHaHaHaHa')
>>> mo1.group()
'HaHaHaHaHa'

Python's regular expressions have "greedy" matching by default. If you want to use non-greedy matching,
follow the closing curly brace with a question mark.

findall() method

  1. findall() method

In addition to the search method, Regex objects also have a findall() method. search() will return a Match object containing the "first" match text in the search string.
The findall() method will return a set of strings, including all matches in the searched string

Simply put, the Match object returned by search() only contains the first occurrence of the matched text, as long as there is no grouping in the regular expression. Each string in the list is a piece of text to be found

>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
['415-555-9999', '212-555-0000']

If you call findall() on a regular expression with grouping, it will return a list of tuples of strings

>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
[('415', '555', '1122'), ('212', '555', '0000')]

Character classification
11. Character classification

\d Any digit
from 0 to 9 \D Any character except digits from 0 to 9
\w Any letter, number, or underscore character (can be considered to match a "word" character)
\W Any other than letters, numbers and underscore The character
\s space, tab, or newline (can be considered to match "blank" characters)
\S any character except space, tab, and newline

>>> xmasRegex = re.compile(r'\d+\s\w+')
>>> xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7
swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6
geese', '5 rings', '4 birds', '3 hens', '2 doves', '1 partridge']

The class [a-zA-Z0-9] will match all
lowercase letters, uppercase letters and numbers.

Character classification [0-5] only matches numbers 0 to 5

  1. Character (^)

Non-character classes will match all characters not in this character class

>>> consonantRegex = re.compile(r'[^aeiouAEIOU]')
>>> consonantRegex.findall('RoboCop eats baby food. BABY FOOD.')
['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', '
', 'B', 'B', 'Y', ' ', 'F', 'D', '.']
  1. Use the caret (^) at the beginning of the regular expression to indicate that the match must occur at
    the beginning of the searched text
>>> beginsWithHello = re.compile(r'^Hello')
>>> beginsWithHello.search('Hello world!')
<_sre.SRE_Match object; span=(0, 5), match='Hello'>

The difference from 12 o'clock is that it is for strings, the content in single quotes, and 12 o'clock is the [] in brackets.

  1. You can add a dollar sign ($) to the end of the regular expression to indicate that the string must
    end in the pattern of this regular expression

Use the caret (^) to indicate that the match must occur at the beginning of the searched text. Add a dollar sign ($) at the end to indicate that the string must
end in the pattern of this regular expression. You can use ^and at the same time to $indicate the entire The string must match the
pattern

>>> wholeStringIsNum = re.compile(r'^\d+$')
>>> wholeStringIsNum.search('1234567890')
<_sre.SRE_Match object; span=(0, 10), match='1234567890'>
>>> wholeStringIsNum.search('12345xyz67890') == None
True

Wildcard characters
15.. (Period) characters are called "wildcards". It matches all
characters except newline

>>> atRegex = re.compile(r'.at')
>>> atRegex.findall('The cat in the hat sat on the flat mat.')
['cat', 'hat', 'sat', 'lat', 'mat']
  1. Use dot-star to match all characters (.*)
    combined with greedy mode to experiment
>>> nongreedyRegex = re.compile(r'<.*?>')
>>> mo = nongreedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()
'<To serve man>'
>>> greedyRegex = re.compile(r'<.*>')
>>> mo = greedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()
'<To serve man> for dinner.>'

Dot-star will match all characters except newline, so it can also be used to match newline


>>> noNewlineRegex = re.compile('.*')
>>> noNewlineRegex.search('Serve the public trust.\nProtect the innocent.
\nUphold the law.').group()
'Serve the public trust.'

? Match zero or one preceding grouping.
Match zero or more previous groups.
 +Matches the preceding group one or more times.
 {n} matches the preceding group n times.
 {n,} matches n or more previous groups.
 {,m} matches the group from zero to m times.
 {n,m} matches the preceding group at least n times and at most m times.
 {n,m}? or
? Or +? for non-greedy matching of previous groups.
 ^spam means that the string must start with spam.
 spam$ means that the string must end with spam.
. Matches all characters, except newline characters.
 \d, \w, and \s match numbers, words, and spaces, respectively.
 \D, \W and \S match all characters except numbers, words and spaces respectively.
 [abc] matches any character in square brackets (such as a, b, or c).
 [^abc] matches any character not in square brackets.

  1. Complex regular expression

Pass the
variable re.VERBOSE to re.compile() as the second parameter to
tell re.compile() to ignore
whitespace and comments in the regular expression string

phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # separator
Python 编程快速上手——让繁琐工作自动化
\d{3} # first 3 digits
(\s|-|\.) # separator
\d{4} # last 4 digits
(\s*(ext|x|ext.)\s*\d{2,5})? # extension
)''', re.VERBOSE)

exercise

  1. The re.compile() function returns a Regex object.

  2. The original string is used so that the backslash does not have to be escaped.

  3. The search() method returns a Match object.

  4. The group() method returns a string of matching text.

  5. Group 0 is the entire match, group 1 contains the first set of brackets, and group 2 contains the second set of brackets

  6. Periods and brackets can be escaped with backslashes: ., \ (and \).

  7. If the regular expression is not grouped, it returns a list of strings. If the regular expression has grouping, it returns a list of tuples of strings.

  8. The | character means to match "any one" of the two groups.

  9. The? Character can mean "match the previous group 0 or 1 time", or it can be used to indicate a non-greedy match.

  10. + Match 1 or more times. Matches 0 or more times.

  11. {3} matches exactly 3 instances of the previous grouping. {3, 5} matches 3 to 5 instances.

  12. The abbreviated character categories \d, \w, and \s match a number, word or blank character respectively.

  13. The abbreviated character categories \D, \W, and \S each match a character, which is not a number, word or blank character.

  14. Pass re.I or re.IGNORECASE as the second parameter to re.compile() to make the matching case insensitive.

  15. Characters. Usually matches any character, except for newlines. If you pass re.DOTALL as the second parameter to re.compile(), the dot will also match the newline character.

  16. . Perform greedy matching, .? Perform non-greedy matching.

  17. [a-z0-9] or [0-9a-z]

  18. ‘X drummers, X pipers, five rings, X hens’

  19. The re.VERBOSE parameter ignores whitespace and comments in the regular expression string

  20. re.compile(r’^\d{1,3}(,{3})$’)

  21. re.compile(r’[A-Z][a-z]*\sNakamoto’)

Practice project
Strong password detection:

  1. No less than 8 characters in length
  2. Contains uppercase and lowercase characters
  3. At least one digit
import re 

len_str=re.compile(r'.{8,}')
num_str=re.compile(r'\d')
str_str=re.compile(r'[a-z].*[A-Z]|[A-Z].*[a-z]')

def pwdTest(pwd):
	if len_str.search(pwd) and num_str.search(pwd) and str_str.search(pwd):
		print(pwd+"  is strong enough")
		print(str_str.search(pwd))
	else:
		print('Noo it\'s so weak')
test = 'test123TEST'
pwdTest(test)
	

Read and write files

On Windows, path writing uses backslashes as the separator between folders. But on OS X and
Linux, use forward slashes as their path separator

  • os.path.join() returns the file path
>>> myFiles = ['accounts.txt', 'details.csv', 'invite.docx']
>>> for filename in myFiles:
print(os.path.join('C:\\Users\\asweigart', filename))
C:\Users\asweigart\accounts.txt
C:\Users\asweigart\details.csv
C:\Users\asweigart\invite.docx
  • Current working directory
>>> os.getcwd()
'C:\\Python34'
>>> os.chdir('C:\\Windows\\System32')
>>> print(os.getcwd())
'C:\\Windows\\System32'

Guess you like

Origin blog.csdn.net/qq_42812036/article/details/104651951