Crawl a website:
import urllib.request
res=urllib.request.urlopen('https://www.csdn.net/')
print(res.read())
Crawler regular expression:
1. Extract four consecutive numbers /d/d/d/d
import re
m=re.findall('\d\d\d\d','123adfa56sne6742')
print(m)
Result: ['6742']
2. Knowing the extracted head and tail, extracting the middle part is to add brackets (.*)
m=re.findall(r'<div>(.*)</div>','<div>hello</div>')
Result: ['hello']
3. When there are multiple <>, only the first and last one can be identified using the above method. The result is as follows:
m=re.findall(r'<div>(.*)</div>','<div>hello</div><div>world</div>')
In this case, what should be added between the brackets ()?
m=re.findall(r'<div>(.*?)</div>','<div>hello</div><div>world</div>')
The result is ['hello', 'world']
4. Matches characters other than line breaks
m=re.findall('.','sd\nefwe')
Result: ['s', 'd', 'e', 'f', 'w', 'e']
5. Match any character in [ ] brackets
m=re.findall('a[bcd]e','jabesadebacesse')
Result: ['abe', 'ade', 'ace']
6. Extracting mathematics is \d, and non-digital characters are \D
7. Extract the space \s, the non-space character is \S
m=re.findall('\s',' vssf\t s')
Result: [' ', '\t', ' ']
8. Extract letters and numbers \w, non-letters and numbers \W
m=re.findall('\w','1d*31&%4')
Result: ['1', 'd', '3', '1', '4']
9. Extract string abs
m=re.findall('abs','absdgregabssff')
Result: ['abs', 'abs']
To extract only the beginning, add ^ before the matched character
m=re.findall('^abs','absdgregabssff')
Result: ['abs']
10. Matching is not case sensitive, add re.I
m=re.findall('abc','abcABCdf',re.I)
Result: ['abc', 'ABC']
11. Match 0 or 1 (followed by 0 or 1 of a certain character)
m=re.findall('ab?','abbbabbba')
Result: ['ab', 'ab', 'a']
12. Match at least 1
m=re.findall('ab+','abbbabbba')
Result: ['abbb', 'abbb']
13. Match at least 0
m=re.findall('ab*','abbbabbba')
Result: ['abbb', 'abbb', 'a']
14. Match the email at the end of com
m=re.findall('\w+@\w+\.com','[email protected];[email protected]')
Result: ['[email protected]']