Detailed explanation of urllib library based on python crawler

One: The basic process of the crawler:
Initiate a request, that is, send a request, the request can contain additional information, wait for the server's response
Get the content of the response, the content type of the obtained page may be HTML, json or binary data
Parse the content of the response , if it is HTML, it can be parsed with regular expressions, or parsed with a parsing library, and the json type can be converted into a json object for parsing.
Save data, text or database

Two: Request some libraries of the website

1.

(1): The urllib library (the built-in HTTP request library in python), where the urllib library is different between python2 and python3.
In the urllib library, there are four main methods:
urllib.request This is the representation method in python3, and urllib.open in python2, the function is to send the request request and get the result of the request {is the most basic request library}
urllib .error: "Exception Handling Module"
urllib.parse: "URL Parsing Module"

urilib.robotparser: "robots.txt parsing Momo module"

(2): Code examples of some methods in the urllib library

#urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
#This is the parameter of the urllib.request.urlopen function.
import urllib.request
response=urllib.request.urlopen("http://www.baidu.com")#This is a get request, just a url parameter.
print(type(response))
The function of print(response.getheader("Date"))# is to return the data in the response headers of the requested data. The data is in the format of a dictionary, and the value can be obtained by entering the key value.
print(response.getheaders())#This can get all the response headers data without entering the key value.
print(response.status)# returns the status code.
print(response.read().decode('utf-8'))
#<class 'http.client.HTTPResponse'>This is the type of response. It mainly includes the methods of
# read() , readinto() , getheader(name) , getheaders() , fileno() and other functions and
#  msg 、 version 、 status 、 reason 、 debuglevel 、 closed 等属性

(3): There is only one url parameter in the previous step, but we found that there are actually many more parameters, but the first three are commonly used.We can also pass other content, such as data (additional parameters), timeout (timeout time) and so on.

The data parameter is optional. If you want to add data, it must be in the byte stream encoding format, that is, the bytes type, which can be converted by the bytes() function. In addition, if you pass this data parameter, its request method is not Then it is a GET request, but a POST.

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())
Here we pass a parameter word with the value hello. It needs to be transcoded into bytes (byte stream) type. The bytes() method is used to convert the byte stream. The first parameter needs to be of str (string) type, and the urllib.parse.urlencode() method needs to be used to convert the parameter dictionary into a string. The second parameter specifies the encoding format, in this case utf8.
The submitted URL is httpbin.org, which provides HTTP request testing. http://httpbin.org/post This address can be used to test POST requests, which can output request and response information, including the data parameter we passed

(4): timeout method

import urllib.request
response = urllib.request.urlopen('http://httpbin.org', timeout=0.1)
print(response.read())
#urllib.error.URLError: <urlopen error timed out>Error timed out.

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

This is to use exception handling to reflect the timeout method. Here we request the test link http://httpbin.org/get, set the timeout time to 0.1 seconds, and then capture the urllib.error.URLError exception, and then judge The reason for the exception is a timeout exception, and it is concluded that it is indeed an error due to timeout, and TIME OUT .

2.

Use of urllib.request.Request.

From the above, we know that the most basic request can be initiated by using the urlopen() method, but these simple parameters are not enough to construct a complete request. If headers and other information need to be added to the request, we can use the more powerful Request class to construct a request.


First, let's use an example to feel the usage of Request:

import urllib.request
request=urllib.request.Request("http://www.baidu.com")
response=urllib.request.urlopen(request)
print(response.read().decode("utf-8"))
#In fact, the result of this method is the same as the result of using urlopen directly before.

It can be found that we still use the urlopen() method to send the request, but this time the parameter of the urlopen() method is no longer a URL, but a Request. By constructing this data structure, on the one hand, we can convert the request Independent into an object, on the other hand, the configurable parameters are more
abundant and flexible.

Let's take a look at some of the parameters of Request

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None

The url is mandatory, and the rest are optional.

If the data parameter is to be passed, the bytes (byte stream) type must be passed. If it is a dictionary, it can be encoded with urllib.parse.urlencode() first.
The headers parameter is a dictionary, you can pass the headers parameter when constructing the Request, or you can add request headers by calling the add_header() method of the Request object. The most common usage of the request header is to disguise the browser by modifying the User-Agent. The default User-Agent
is Python-urllib. You can disguise the browser by modifying it. For example, to disguise the Firefox browser, you can set it to Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
origin_req_host refers to the requester's host name or IP address.
unverifiable refers to whether the request is unverifiable, the default is False. This means that the user does not have sufficient rights to choose to receive the result of this request. For example, we request an image in an HTML document, but we do not have permission to automatically grab the image, then the value of unverifiable is True.

method is a string that indicates the method used by the request, such as GET, POST, PUT, etc.

Let's take an example to understand:

import urllib.request
import urllib.parse
url="http://httpbin.org/post"
dict={"name":"Mrsun"}
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
header={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    "host":'httpbin.org'}
request=urllib.request.Request(url=url,headers=header,data=data,method="POST")
response=urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

operation result:

{
  "args": {},
  "data": "",
  "files": {},
  "form": {
    "word": "hello"
  },
  "headers": {
    "Accept-Encoding": "identity",
    "Connection": "close",
    "Content-Length": "10",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)"
  },
  "json": null,
  "origin": "222.173.104.238",
  "url": "http://httpbin.org/post"
}
You will find that we have successfully set some parameters such as header and data, method, etc.



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324379135&siteId=291194637