Urllib is a standard library that comes with Python. It can be used directly without installation.
The following functions are provided:
- web page request
- get response
- Proxy and cookie settings
- exception handling
- URL parsing
The functions required by crawlers can basically urllib
be found in , and learning this standard library can give you a deeper understanding of the more convenient requests
libraries that follow.
-------------------------------------------------- -------------------- I am the dividing line -------------------------- ------------------------------------------
Let's start with the simplest example:
from urllib import request response = urllib.request.urlopen('http://www.baidu.com') print(response.read().decode('utf-8'))
After execution, you can get the HTML of the Baidu homepage.
The response object has a read method that returns the obtained web page content.
If you print directly without read, the description of the object will be printed directly
A urlopen function is used above, and the urlopen() function is used to access the target url .
urlopen syntax
urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
#url:访问的网址
#data: Additional data, such as header, form data
From the previous article, we can see that data transmission is mainly divided into two ways: POST and GET. So what's the difference between these two ways?
The most important difference:
The GET method is accessed directly in the form of a link. The link contains all the parameters. If it contains a password, it is not safe, but you can intuitively see what you have submitted. POST doesn't show all the parameters on the URL, but it's inconvenient if you want to see what was submitted directly.
POST method:
Let's demonstrate first:
import urllib.parse import urllib.request values = {"username":"1559186****","password":"*********"} data = urllib.parse.urlencode(values) url = "https://passport.csdn.net/account/login" response = urllib.request.urlopen(url,data) print(response.read())
urlencode is a function that URL-encodes a string for encoding processing.
You can see that the core code is response = urllib.request.urlopen(url, data)
The data parameter is used here (or the POST method uses the data parameter ). The content of the data parameter in the example is very simple, so it cannot be logged in.
GET method:
Likewise, let's take an example:
import urllib.parse import urllib.request values = {"username":"15591861964","password":"yanhang1235813"} data = urllib.parse.urlencode(values) url = "https://passport.csdn.net/account/login" geturl = url + "?"+data response = urllib.request.urlopen(geturl) print(response.read())You can see that the core code this time is like this: response = urllib.request.urlopen(geturl)
There is no data parameter, just the url parameter. And the content of the URL parameter is the original url plus ? Then add the encoded parameters.
A conclusion is verified from the above examples of POST and GET in two different ways,
The GET method is directly accessed in the form of a link, and the link contains all the parameters
POST will not display all parameters on the URL
------------------------------------------The above are some crawlers I learned principle----------------------------------------
------------------------------------------ Here are some things for extended learning ( Other contents of the Urllib library) -----------------------------------
Use of timeout parameter
When I started to practice, I made a request to the website http://httpbin.org/post , but there was no response for a long time,,,
Finally returned the error message
urllib.error.HTTPError: HTTP Error 503: Service Unavailable
Then it should be better if you set a timeout for the request, urlopen has a timeout parameter
from urllib import request response = request.urlopen('http://httpbin.org/get', timeout=0.1) print(response.read())The error returned this time is:
urllib.error.URLError: <urlopen error timed out>
You can also handle this as an exception, catch
import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason, socket.timeout): print('TIME OUT')