python request library related

Pull hook-crawler knowledge points

map()

Automatic task assignment, execute url array in turn, convenient for multi-process crawlers

def scrape(url):
try:
    urllib.request.urlopen(url)
    print(f'URL {url} Scraped')
except (urllib.error.HTTPError, urllib.error.URLError):
    print(f'URL {url} not Scraped')
if __name__ == '__main__':
pool = Pool(processes=3)
urls = [
    'https://www.baidu.com',
    'http://www.meituan.com/',
    'http://blog.csdn.net/',
    'http://xxxyxxx.net'
]
pool.map(scrape, urls)
pool.close()

Several uses of request

When you need to get the picture, you can directly save it in the corresponding storage format. Note that r.content is a native (byte-style) string, and r.text returns an encoded unicode

r = requests.get(CONST.RESOURCES[0], headers=headers)
# print(r.text)
with open("picTest.png",'wb') as pic:
pic.write(r.content)

r.cookies can get and set cookies, and then set cookies in headers:

 'Cookie': '_octo=GH1.1.1849343058.1576602081; _ga=GA1.2.90460451.1576602

You can also specify cookies individually in the request

jar = requests.cookies.RequestsCookieJar()
for cookie in cookies.split(';'):
key, value = cookie.split('=', 1)
jar.set(key, value)
r = requests.get('https://github.com/', cookies=jar, headers=headers)

Session and SSL certificate

request.Session() established session
SSL certificate verification error

ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)

Solved by adding verified parameters

response = requests.get('url', verify=False)

Timeout setting

When no parameters are added, timeout is not triggered by default

r = requests.get('url', timeout=(5, 30))

Authentication

Insert picture description here
Use the auth parameter that comes with the request

r = requests.get('url', auth=('admin', 'admin'))

Proxy settings

Find an effective proxy pool on the Internet, replace the IP below
with my current shallow knowledge, it seems that the replacement of the IP is not very useful , and I will learn more

proxies = {
'http': 'http://10.10.10.10:1080',
'https': 'http://10.10.10.10:1080',
}
requests.get('https://httpbin.org/get', proxies=proxies)

Guess you like

Origin blog.csdn.net/weixin_44602409/article/details/107306337