Python crawler basics + regular expressions

Crawl a website:

import urllib.request
res=urllib.request.urlopen('https://www.csdn.net/')
print(res.read())

Crawler regular expression:

1. Extract four consecutive numbers /d/d/d/d

import re
m=re.findall('\d\d\d\d','123adfa56sne6742')
print(m)

Result: ['6742']

2. Knowing the extracted head and tail, extracting the middle part is to add brackets (.*)

m=re.findall(r'<div>(.*)</div>','<div>hello</div>')

Result: ['hello']

3. When there are multiple <>, only the first and last one can be identified using the above method. The result is as follows:
Insert image description here

m=re.findall(r'<div>(.*)</div>','<div>hello</div><div>world</div>')

In this case, what should be added between the brackets ()?

m=re.findall(r'<div>(.*?)</div>','<div>hello</div><div>world</div>')

The result is ['hello', 'world']

4. Matches characters other than line breaks

m=re.findall('.','sd\nefwe')

Result: ['s', 'd', 'e', ​​'f', 'w', 'e']

5. Match any character in [ ] brackets

m=re.findall('a[bcd]e','jabesadebacesse')

Result: ['abe', 'ade', 'ace']

6. Extracting mathematics is \d, and non-digital characters are \D

7. Extract the space \s, the non-space character is \S

m=re.findall('\s',' vssf\t s')

Result: [' ', '\t', ' ']

8. Extract letters and numbers \w, non-letters and numbers \W

m=re.findall('\w','1d*31&%4')

Result: ['1', 'd', '3', '1', '4']

9. Extract string abs

m=re.findall('abs','absdgregabssff')

Result: ['abs', 'abs']

To extract only the beginning, add ^ before the matched character

m=re.findall('^abs','absdgregabssff')

Result: ['abs']

10. Matching is not case sensitive, add re.I

m=re.findall('abc','abcABCdf',re.I)

Result: ['abc', 'ABC']

11. Match 0 or 1 (followed by 0 or 1 of a certain character)

m=re.findall('ab?','abbbabbba')

Result: ['ab', 'ab', 'a']

12. Match at least 1

m=re.findall('ab+','abbbabbba')

Result: ['abbb', 'abbb']

13. Match at least 0

m=re.findall('ab*','abbbabbba')

Result: ['abbb', 'abbb', 'a']

14. Match the email at the end of com

m=re.findall('\w+@\w+\.com','[email protected];[email protected]')

Result: ['[email protected]']

Guess you like

Origin blog.csdn.net/qq_42740834/article/details/105329719