Python crawler from entry to abandonment (3) - basic usage of Urllib library 1

Urllib is a standard library that comes with Python. It can be used directly without installation.
The following functions are provided:

  • web page request
  • get response
  • Proxy and cookie settings
  • exception handling
  • URL parsing

The functions required by crawlers can basically urllibbe found in , and learning this standard library can give you a deeper understanding of the more convenient requestslibraries that follow.

-------------------------------------------------- -------------------- I am the dividing line -------------------------- ------------------------------------------

Let's start with the simplest example:

from urllib import request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

After execution, you can get the HTML of the Baidu homepage.

The response object has a read method that returns the obtained web page content.

If you print directly without read, the description of the object will be printed directly

A urlopen function is used above, and the urlopen() function is used to access the target url .

urlopen syntax

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
#url:访问的网址

#data: Additional data, such as header, form data

From the previous article, we can see that data transmission is mainly divided into two ways: POST and GET. So what's the difference between these two ways?

The most important difference:

The GET method is accessed directly in the form of a link. The link contains all the parameters. If it contains a password, it is not safe, but you can intuitively see what you have submitted. POST doesn't show all the parameters on the URL, but it's inconvenient if you want to see what was submitted directly.

POST method:

Let's demonstrate first:

import urllib.parse
import urllib.request

values = {"username":"1559186****","password":"*********"}
data = urllib.parse.urlencode(values)
url = "https://passport.csdn.net/account/login"
response = urllib.request.urlopen(url,data)
print(response.read())

urlencode is a function that URL-encodes a string for encoding processing.

You can see that the core code is  response = urllib.request.urlopen(url, data)

The data parameter is used here (or the POST method uses the data parameter ). The content of the data parameter in the example is very simple, so it cannot be logged in.

GET method:

Likewise, let's take an example:

import urllib.parse
import urllib.request

values = {"username":"15591861964","password":"yanhang1235813"}
data = urllib.parse.urlencode(values)
url = "https://passport.csdn.net/account/login"
geturl = url + "?"+data
response = urllib.request.urlopen(geturl)
print(response.read())
You can see that the core code this time is like this: response = urllib.request.urlopen(geturl)

There is no data parameter, just the url parameter. And the content of the URL parameter is the original url plus ? Then add the encoded parameters.


A conclusion is verified from the above examples of POST and GET in two different ways,

The GET method is directly accessed in the form of a link, and the link contains all the parameters

POST will not display all parameters on the URL

------------------------------------------The above are some crawlers I learned principle----------------------------------------

------------------------------------------ Here are some things for extended learning ( Other contents of the Urllib library) -----------------------------------

Use of timeout parameter

When I started to practice, I made  a request to the website http://httpbin.org/post  , but there was no response for a long time,,,

Finally returned the error message

urllib.error.HTTPError: HTTP Error 503: Service Unavailable

Then it should be better if you set a timeout for the request, urlopen has a timeout parameter

from urllib import request

response = request.urlopen('http://httpbin.org/get', timeout=0.1)
print(response.read())
The error returned this time is:
urllib.error.URLError: <urlopen error timed out>

You can also handle this as an exception, catch

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326685567&siteId=291194637