python crawler requests notes

  1. Preparations
    Install pycharm, requests library
  2. requests crawler (template)
import requests
def getHTMTText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status() #如果状态不是200,引发HTTPError异常
        r.encoding=r.raise_for_status()
        return r.text
    except:
        return "产生异常"
if __name__=="__main__":
    url="http://www.baidu.com"
    print(getHTMTText(url))	#打印url页面内容

When starting to write a crawler, we must pay attention to the response status. If it returns 404, we can make changes in time.

  1. The attribute
    r.status_code of the Response object : the return status of the HTTP request, 200 indicates a successful connection, and 404 indicates a failure.
    r.text: The string form of the HTTP response content, that is, the page content corresponding to the URL.
    r.encoding: The response content encoding method guessed from the HTTP header.
    r.apparent_encoding: Analyze the response content encoding method (alternative encoding method) from the content
    r.content: Binary form of HTTP response content.

  2. Understand the abnormal
    requests.ConnectionError of the requests library : network connection error exceptions, such as DNS query failure, connection refused, etc.
    requests.HTTPError: HTTP error exception.
    requests.URLRequired: URL missing exception.
    requests.TooManyRedirects: exceeding the maximum number of redirects, a redirect exception is generated.
    requests.ConnectTimeout is abnormal when connecting to a remote server. :
    requests.Timeout: The request URL timed out, resulting in a timeout exception.

  3. The 7 main methods of the Requests library
    requests.request (): construct a request to support the basic methods of the following methods.
    requests.get (): The main method for obtaining HTML pages, which corresponds to HTTP GET.
    requests.head (): A method to obtain HTML Wang Yuting header information, corresponding to HTTP HEAD.
    request.post (): A method to submit a POST request to an HTML webpage, corresponding to HTTP POST.
    requests.put (): A method for submitting PUT requests to HTML web pages, corresponding to HTTP PUT.
    requests.patch (): Submit a partial modification request to the HTML page, corresponding to HTTP PATCH.
    requests.delete (): Submit a delete request to the HTML page, corresponding to HTTP DELETE.

  4. HTTP protocol
    HTTP, Hypertext Transfer Protocol.
    HTTP is a stateless application layer protocol based on the "request and response" mode. The HTTP protocol uses a URL as an identifier for locating network resources. The format of the URL is as follows:

http: // host [: port] [path]
host: legal Internet host domain name or IP address
port: port number, default port is 80
path: path to request resources

  1. HTTP protocol operation on resources
    GET: request to obtain the resource at the URL location.
    HEAD: Request to obtain the response message report of the URL location resource, that is, to obtain the resource header information.
    POST: Request to append new data to the resource at the URL location.
    PUT: Request to store a resource to the URL location, overwriting the resource at the original URL location.
    PATCH: Request to partially update the resource of the URL location, that is, change part of the content of the resource there.
    DELETE: Request to delete the resource stored in the URL location.
  2. requests.request(method,url,**kwatgs)

method: request method, corresponding to 7 kinds of
urls such as get / put / post : url link of the page to be obtained
** kwargs: access control parameters, a total of 13

method: request method

r=requests.request('GET',url,**kwargs)
r=requests.request('HEAD',url,**kwargs)
r=requests.request('POST',url,**kwargs)
r=requests.request('PUR',url,**kwargs)
r=requests.request('PATCH',url,**kwargs)
r=requests.request('delete',url,**kwargs)
r=requests.request('OPTIONS',url,**kwargs)

** kwargs: The parameters that control access are optional

params: dictionary or byte sequence, added to the url as a parameter.
data: Dictionary, byte sequence or file object, as the content of the Request.
json: JSON singer data, as the content of the Request.
headers: dictionary, HTTP custom headers,
cookies: dictionary or CookieJar, cookies in Request.
auth: tuple, supports HTTP authentication.
files: Dictionary type, transfer files.
timeout: Set the timeout time in seconds.
proxies: dictionary type, set access to proxy server, you can add login authentication.
allow_redirects: True / False, the default is True, redirect switch.
stream: True / False, the default is True, download the switch immediately when you get the content.
verify: True / False, default True, authentication SSL certificate switch.
cert: Local SSL certificate path.

Published 19 original articles · Likes2 · Visits 1102

Guess you like

Origin blog.csdn.net/qq_42692319/article/details/102616451