Python crawler base library

1. request

  1. API of urlopen () function
urllib. request. urlopen(url, data=None, [ t工『neout, ]*, cafile=None, capath=None, cadefault=False, context=None) 

1. The
data parameter The data parameter is optional. If you want to add parameters, and if it is the content of the byte stream encoding format, that is, the bytes type, you need to convert it through the bytes () method. The request method is no longer the GET method, but the POST method.

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word':'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

2.
timeout parameter timeout parameter is used to set the timeout time, the unit is seconds, which means that if the set time is exceeded, an error will be reported if there is no response.

import urllib.error
try:
    response = urllib.request.urlopen('http://www.baidu.com/get',timeout=1)
    print(response.read())
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):
        print('TIME OUT')
  1. Request
    request parameter construction
class urllib. request. Request ( ur 1, data=None, headers={}, origin_req_host=None, unverifiable=False, method=N one) 

URL: Used to request the URL, the parameter
must be passed data: If you use data, you must pass the bytes type. If it is a dictionary, you can first encode it with urlencode () in the urllib.parse module.
headers: request headers, generally disguised as browser
origin_req_host: host name or IP address of the requester.
unverifiable: indicates whether the request cannot be verified, the default is false, in other words, the user does not have sufficient permissions to choose to accept the result of the request.
method: Used to request the method used. Such as GET, POST and PUT

from urllib import request,parse
url = 'http://httpbin.org/post'
headers = {
    'User-Agent':'Mozilla/4.0',
    'Host':'httpbin.org'
}
dict = {
    'name':'Germey'
}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))

  1. proxy
from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener
proxy_handler = ProxyHandler({
    'http':'http://127.0.0.1:9743',
    'https':'https:127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)
  1. Cookies
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)
Published 19 original articles · Likes2 · Visits 1088

Guess you like

Origin blog.csdn.net/qq_42692319/article/details/104343466