Requests library for web crawler and general code framework for crawling web pages

Requests library

7 main methods

method illustrate
requests.request() Constructs a request that underpins the foundation of the following methods
requests.get() The main method for obtaining HTML pages, corresponding to HTTP GET
requests.head() The method for obtaining the header information of HTML pages, corresponding to the HEAD of HTTP
requests.post() A method for submitting a POST request to an HTML web page, corresponding to HTTP POST
requests.put() A method for submitting a PUT request to an HTML web page, corresponding to HTTP PUT
requests.patch() Submit a partial modification request to an HTML page, corresponding to HTTP PATCH
requests.delete() Submit a delete request to an HTML page, corresponding to HTTP DELETE

get method

Introduction to the get method
requests.get(url, params=None, **kwargs)

  • url: the url of the page to be retrieved
  • params: extra parameters in url, dictionary or byte stream format, optional
  • **kwargs: 13 parameters that control access, as follows:
field illustrate Example
params A dictionary or byte sequence, added to the url as a parameter kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('GET','http://www.python123.io/ws', params=kv``print(r.url)
#https://www.python123.io/ws?key1=value1&key2=value2
data Dictionary, endian or file object, as the content of the Request kv = {'key1':'value1', 'key2':'value2'}
r = requests.request('GET','http://www.python123.io/ws', data=kv)
json JSON format data, as the content of the Request kv = {'key1':'value1'}
r = requests.request('POST','http://www.python123.io/ws', json=kv)
headers Dictionary, HTTP custom headers hd = {'user-agent':'Chrome/10'}
r = requests.request('POST','http://www.python123.io/ws', headers=hd)
cookies Dictionary or cookieJar, cookie in Request
auth Tuple, support HTTP authentication function
files Dictionary type, transfer file fs = {'file':open('test.xls', 'rb')}
r = requests.request('POST','http://www.python123.io/ws', files=fs)
timeout timeout, in seconds r = requests.request('GET','http://www.python123.io/ws', timeout=10)
proxies Dictionary type, set the access server, you can add login authentication pxs = {'http':'http://user:[email protected]:1234',
'https':'https://10.10.10.1:4321'}
r = requests.request('GET','http://www.python123.io/ws', proxies=pxs)
allow_redirects True/False, the default is True, redirect switch
stream True/False, the default is True, get the content immediately download switch
varify True/False, the default is True, authentication SSL switch
cert Local SSL certificate path

### Properties of the Response object

Attributes illustrate
r.status_code The status of the HTTP request, 200 for success, 404 for failure
r.text() The string form of the HTTP response content, that is, the page content corresponding to the url
r.encoding Response content encoding format guessed from HTTP headers
r.apparent_encoding Response content encoding method analyzed from the content
r.content The binary form of the HTTP response content
r.raise_for_status() If not 200, requests.HTTPError is generated

Requests library exception

abnormal illustrate
requests.ConnectionError Abnormal network connection errors, such as DNS query failure, connection refused, etc.
requests.HTTPError Http error exception
requests.URLRequired URL missing exception
requests.TooManyRedirects If the maximum number of redirects is exceeded, a redirection exception is generated
requests.ConnectTimeout Connection to remote server timed out exception
requests.Timeout The request URL timed out, resulting in a timeout exception

Common code framework

import requests

def get_html_text(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except :
        return ''

url = 'http://www.baidu.com'
text = get_html_text(url)#得到页面text,可进一步处理

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325325068&siteId=291194637