Pull hook-crawler knowledge points
map()
Automatic task assignment, execute url array in turn, convenient for multi-process crawlers
def scrape(url):
try:
urllib.request.urlopen(url)
print(f'URL {url} Scraped')
except (urllib.error.HTTPError, urllib.error.URLError):
print(f'URL {url} not Scraped')
if __name__ == '__main__':
pool = Pool(processes=3)
urls = [
'https://www.baidu.com',
'http://www.meituan.com/',
'http://blog.csdn.net/',
'http://xxxyxxx.net'
]
pool.map(scrape, urls)
pool.close()
Several uses of request
When you need to get the picture, you can directly save it in the corresponding storage format. Note that r.content is a native (byte-style) string, and r.text returns an encoded unicode
r = requests.get(CONST.RESOURCES[0], headers=headers)
# print(r.text)
with open("picTest.png",'wb') as pic:
pic.write(r.content)
r.cookies can get and set cookies, and then set cookies in headers:
'Cookie': '_octo=GH1.1.1849343058.1576602081; _ga=GA1.2.90460451.1576602
You can also specify cookies individually in the request
jar = requests.cookies.RequestsCookieJar()
for cookie in cookies.split(';'):
key, value = cookie.split('=', 1)
jar.set(key, value)
r = requests.get('https://github.com/', cookies=jar, headers=headers)
Session and SSL certificate
request.Session() established session
SSL certificate verification error
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)
Solved by adding verified parameters
response = requests.get('url', verify=False)
Timeout setting
When no parameters are added, timeout is not triggered by default
r = requests.get('url', timeout=(5, 30))
Authentication
Use the auth parameter that comes with the request
r = requests.get('url', auth=('admin', 'admin'))
Proxy settings
Find an effective proxy pool on the Internet, replace the IP below
with my current shallow knowledge, it seems that the replacement of the IP is not very useful , and I will learn more
proxies = {
'http': 'http://10.10.10.10:1080',
'https': 'http://10.10.10.10:1080',
}
requests.get('https://httpbin.org/get', proxies=proxies)