Crawler notes: Detailed explanation of Urllib library

If you don’t understand crawlers, please see the first lesson of link
crawlers: the basic principles of crawlers

What is Urllib

Urllib is a built-in http request library of python. Contains four modules

  • urllib.request request module
  • urllib.error exception handling module
  • urllib.parse url parsing module
  • urllib.robotparser robots.txt parsing module (identify which websites can be crawled)
    This article mainly explains the first three modules

Difference from urlib module in python2

Python2
import urllib2
response = urllib2.urlopen(‘http://www.baidu.com’)

Python3
import urllib.request
response = urllib.request.urlopen(‘http://www.baidu.com’)

Usage explanation

Request urlopen
urlopen

The first parameter URL requests the website, the second parameter data is required for post request, and the third parameter timeout is set for a limited time, and will not be used temporarily.

#%% raw
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

get request webpage open

import urllib.request

response = urllib.request.urlopen('https://www.baidu.com/')
print(response.read().decode('utf-8'))#解码

post request webpage to open

import urllib.request
import urllib.parse

data = bytes(urllib.parse.urlencode({
    
    'word': 'hello'}), encoding='utf8')#编码
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

timeout
sets a timeout, if the request is not completed within the specified time, an error will be reported

import urllib.request

response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
print(response.read())
import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

response

Response type

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(type(response))

Status code and response headers The
response includes two important information: status code and response headers

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.status)#状态码
print(response.getheaders())#响应头
print(response.getheader('Server'))#查看响应头具体某一项

Read response read

#The read response is byte stream data, which needs to be decoded as utf-8

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

Request
through the previous urlopen we can request a web page and get the response, but urlopen cannot add additional information, such as request headers, so a more advanced request is required, Request

Read webpage

import urllib.request

request = urllib.request.Request('https://python.org')#声明一个request对象
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

Add request header and read webpage in post mode at the same time

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    
    
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    
    
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')#编码
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

Adding the request header in front is through the parameter, you can also use the add_header method

from urllib import request, parse

url = 'http://httpbin.org/post'
dict = {
    
    
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

Handler proxy

import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    
    
    'http': 'http://127.0.0.1:9743',#服务器网址
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())

Cookie
Cookie is a text file that is saved on the client and used to record the user's identity

If we open the Taobao webpage, the user name information is displayed in the upper left corner

Review the elements, find Taobao cookie, right click to clear

Refresh the webpage user login information disappeared

If there is a cookie, it can maintain the user's login information. When doing a crawler, it can crawl some login information.
1. Print cookies

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()#声明cookie对象
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')#百度,我登录了百度账号
for item in cookie:
    print(item.name+"="+item.value)

2. Save the cookie as a text file

Way 1

import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
cookie.save(ignore_discard=True, ignore_expires=True)

Way 2

import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
cookie.save(ignore_discard=True, ignore_expires=True)

3. Read web pages with cookies

import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()#要和前面保存方式一样
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
print(response.read().decode('utf-8'))

Exception handling

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)
from urllib import request, error

try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')
import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

URL parsing
urlparse
passes in a url, and then divides a url into several parts. Protocol, domain name, parameters, etc.

urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

Specify the protocol type, the incoming ur has no protocol

from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)

Specify the default protocol as ihttps, if the incoming url has no protocol, it will be designated as https

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)

Anchor splicing allow_fragments

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)#allow_fragments锚点
print(result)

urlunparse: splicing URL

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

urljoin

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

urlencode
converts dictionary objects into get request parameters

from urllib.parse import urlencode

params = {
    
    
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

Insert picture description here
Author: Electrical - Yudeng Wu.
It’s not easy to create, big guys, please stay... Come and like it before you go (๑◕櫫←๑)

Guess you like

Origin blog.csdn.net/kobeyu652453/article/details/112977309