Python web crawler study notes (three): the use of urllib library

Use urllib library

First of all, learn about the urllib library, which is a built-in HTTP request library in Python, which means it can be used without additional installation. It contains the following 4 modules.

request : It is the most basic HTTP request module that can be used to simulate sending requests. Just like entering the URL in the browser and then pressing Enter, you only need to pass in the URL and additional parameters to the library method to simulate this process.

error : Exception handling module. If a request error occurs, we can catch these exceptions, and then retry or other operations to ensure that the program will not terminate unexpectedly.

parse : A tool module that provides many URL processing methods, such as splitting, parsing, and merging.

robotparser : It is mainly used to identify the robots.txt file of the website, and then determine which websites can be crawled and which websites cannot be crawled. It is less practical.

1.urllib.request.urlopen() function

The urllib.request module provides the most basic method of constructing HTTP requests, which can be used to simulate a request initiation process of the browser

import urllib.request
response = urllib.request.urlopen("http://www.python.org")
#print(response.read().decode('utf-8'))
# print(type(response)) #class类型
# html = response.read() #读取该网页的html信息,爬取的内容是以 utf-8 编码的bytes对象
# print(html) #打印的时候字符串前面有个 b,表示这是一个bytes对象
#
# #要还原成带中文的html代码,需要对其进行解码,将它变成Unicode编码
# html = html.decode("utf-8")
# print(html)
# print("=======================")

Next, see what it returns. Use the type() method to output the response type:

print(type(response))
<class 'http.client.HTTPResponse'>

It can be found that it is an object of type HTTPResposne. It mainly contains methods such as read(), readinto(), getheader(name), getheaders(), fileno(), and attributes such as msg, version, status, reason, debuglevel, and closed.

Call the read() method to get the content of the returned webpage, and call the status attribute to get the status code of the returned result. For example, 200 represents the request is successful, 404 represents the webpage not found, etc.

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))
200
[('Connection', 'close'), ('Content-Length', '49243'), ('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur'), ('Via', '1.1 varnish'), ('Accept-Ranges', 'bytes'), ('Date', 'Wed, 23 Sep 2020 00:49:55 GMT'), ('Via', '1.1 varnish'), ('Age', '1774'), ('X-Served-By', 'cache-bwi5142-BWI, cache-hkg17933-HKG'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '1, 2695'), ('X-Timer', 'S1600822196.530321,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
nginx

It can be seen that the first two outputs respectively output the response status code and response header information, and the last output obtains the Server value in the response header by calling the getheader() method and passing a parameter Server. The result is nginx, which means that the server is Built with Nginx.

urlopen() function API:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

data parameter

If this parameter is passed, its request method is no longer GET, but POST.

The data parameter is optional. If you want to add this parameter, and if it is the content of the byte stream encoding format, that is, the bytes type, it needs to be converted by the bytes() method

import urllib.parse
import urllib.request
 
data = bytes(urllib.parse.urlencode({
    
    'word': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

Here we pass a parameter word, the value is hello. It needs to be transcoded into bytes (byte stream) type. The byte stream conversion uses the bytes() method. The first parameter of this method needs to be of type str (string), and the urlencode() method in the urllib.parse module needs to be used to convert the parameter dictionary into a string; The two parameters specify the encoding format, here designated as utf8.

The site requested here is httpbin.org, which can provide HTTP request testing. The URL we requested this time is http://httpbin.org/post. This link can be used to test POST requests. It can output some information of the request, including the data parameter we passed.

{
    
    
     "args": {
    
    },
     "data": "",
     "files": {
    
    },
     "form": {
    
    
         "word": "hello"
     },
     "headers": {
    
    
         "Accept-Encoding": "identity",
         "Content-Length": "10",
         "Content-Type": "application/x-www-form-urlencoded",
         "Host": "httpbin.org",
         "User-Agent": "Python-urllib/3.5"
     },
     "json": null,
     "origin": "123.124.23.253",
     "url": "http://httpbin.org/post"
}

The parameters we passed appear in the form field, which indicates that the form submission method is simulated and the data is transmitted in POST.

timeout parameter

The timeout parameter is used to set the timeout period, in seconds, which means that if the request exceeds the set time and no response is received, an exception will be thrown. If this parameter is not specified, the global default time will be used. It supports HTTP, HTTPS, FTP requests.

Here we set the timeout period to be 1 second. After 1 second from the program, the server still did not respond, so a URLError exception was thrown. The exception belongs to the urllib.error module, and the cause of the error is timeout.

Therefore, you can set this timeout to control if a web page does not respond for a long time, it will skip its crawling. This can be achieved using the try except statement, the relevant code is as follows:

import urllib.request
 
response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
print(response.read())
import socket
import urllib.request
import urllib.error
 
try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')
TIME OUT

2.urllib.request function

The construction method is as follows:

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

The first parameter is urlused to request the URL, which is a mandatory parameter, and the others are optional parameters.

dataIf the second parameter is to be passed, it must be of type bytes (byte stream). If it is a dictionary, you can use urlencode() encoding in the urllib.parse module first.

The third parameter headersis a dictionary, which is the request header. We can construct it directly through the headers parameter when constructing the request, or add it by calling the add_header() method of the request instance.
The most common usage of adding request headers is to disguise the browser by modifying the User-Agent. The default User-Agent is Python-urllib. We can disguise the browser by modifying it. For example, to disguise the Firefox browser, you can set it to:

Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11

The fourth parameter origin_req_hostrefers to the host name or IP address of the requester.

The fifth parameter unverifiableindicates whether the request cannot be verified. The default is False, which means that the user does not have sufficient permissions to choose the result of receiving this request. For example, we request an image in an HTML document, but we do not have the permission to automatically grab the image, then the unverifiable value is True.

The sixth parameter methodis a string that indicates the method used in the request, such as GET, POST, and PUT.
Let's take a look at the construction request by passing in multiple parameters:

from urllib import request, parse
 
url = 'http://httpbin.org/post'
headers = {
    
    
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    
    
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

Here we construct a request with 4 parameters, where url is the request URL, User-Agent and Host are specified in the headers, and the parameter data is converted into a byte stream using the urlencode() and bytes() methods. In addition, the request method is specified as POST.

The results are as follows:

{
    
    
  "args": {
    
    }, 
  "data": "", 
  "files": {
    
    }, 
  "form": {
    
    
    "name": "Germey"
  }, 
  "headers": {
    
    
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)"
  }, 
  "json": null, 
  "origin": "219.224.169.11", 
  "url": "http://httpbin.org/post"
}

In addition, headers can also be added using the add_header() method:

req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')

verification

Some websites will pop up a prompt box when they are opened, directly prompting you to enter the user name and password, and the page can be viewed after successful verification, as shown in Figure 3-2.

The above request can be completed with the help of HTTPBasicAuthHandler, the relevant code is as follows:

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError
 
username = 'username'
password = 'password'
url = 'http://localhost:5000/'
 
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)
 
try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

Here first instantiate the HTTPBasicAuthHandler object, whose parameter is the HTTPPasswordMgrWithDefaultRealm object, which uses add_password() to add the username and password, thus establishing a Handler for processing authentication.

Next, use this Handler and use the build_opener() method to build an Opener. This Opener is equivalent to successfully verified when sending a request.

Next, use Opener's open() method to open the link, and the verification can be completed. The result obtained here is the verified page source code content.

proxy

When doing crawlers, it is inevitable to use a proxy. If you want to add a proxy, you can do this:

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
 
proxy_handler = ProxyHandler({
    
    
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

Here we set up an agent locally, which runs on port 9743.

ProxyHandler is used here, its parameter is a dictionary, the key is the protocol type (such as HTTP or HTTPS, etc.), the key is the proxy link, and multiple proxies can be added.

Then, use this Handler and build_opener() method to construct an Opener, and then send the request.

Cookies

import http.cookiejar, urllib.request
 
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

First, we must declare a CookieJar object. Next, you need to use HTTPCookieProcessor to build a Handler, and finally use the build_opener() method to build an Opener, and execute the open() function.

The results are as follows:

BAIDUID=2E65A683F8A8BA3DF521469DF8EFF1E1:FG=1
BIDUPSID=2E65A683F8A8BA3DF521469DF8EFF1E1
H_PS_PSSID=20987_1421_18282_17949_21122_17001_21227_21189_21161_20927
PSTM=1474900615
BDSVRTM=0
BD_HOME=0

As you can see, the name and value of each cookie are output here.

Save Cookies

filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

At this time, CookieJar needs to be replaced with MozillaCookieJar, which will be used when generating files. It is a subclass of CookieJar, which can be used to process cookies and file-related events, such as reading and saving cookies, and cookies can be saved as Mozilla browsing Cookies format of the device.

Read cookies

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

As you can see, the load() method is called here to read the local Cookies file, and the content of the Cookies is obtained. But the premise is that we first generate Cookies in LWPCookieJar format and save them as files, and then use the same method to construct Handler and Opener after reading the Cookies to complete the operation.

Guess you like

Origin blog.csdn.net/qq_43328040/article/details/108761028