Data acquisition - crawler -2 (Urllib packet analyzer)

Urllib library

It is built python HTTP request library, use it to send Request. It consists primarily of the following basic modules:

  1. urllib.request: Request library, simulation open the page.
  2. urllib.error: exception handling module, collecting, processing error value returned.
  3. urllib.parse: a parsing module provides a number of analytical methods.
  4. urllib.roboparse: robots.txt file parsing, can climb judgment document.

Request

Although urllib python library is built-in library, but still need to import. Can be introduced directly after the urllib.request.urlopen () Function Request sent directly to the server. Request When the data contained in the data request is POST, GET requests otherwise. Detailed code is as follows:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)#urlopen函数形式,主要使用前三个参数   

#GET请求
import urllib.request #导入相应的库

response = urllib.request.urlopen('http://www.baidu.com') #发送Request
print(response.read().decode('utf-8')) '''打印相关请求,关于网页的编码格式如果常见的仍然无法编译,查看网页源代码,在head的第一行charset属性中可能会有相应信息。'''

# POST请求
import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
#POST请求比GET多了一个data文件
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

#设置延迟时间
import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)#反应时间0.1s
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):#判断错误类型
        print('TIME OUT')

urlopen()It can send Request, but not direct more settings, such as setting request class. This time you can first declare a Request object and pass the appropriate information, and finally passed to the Request object urlopen().

from urllib import request, parse #导入相应的包

url = 'http://httpbin.org/post' #网址
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
} #设置请求头

dict = {
    'name': 'Germey'
}#设置DataFrom信息

data = bytes(parse.urlencode(dict), encoding='utf8')#将DataFrom信息编译成二进制流
req = request.Request(url=url, data=data, headers=headers, method='POST')#构建Request类
#如果req中缺少header时,urllib提供了add_header方法
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)#传入urlopen
print(response.read().decode('utf-8'))#打印

Response

Respective body sent by the server, we can get its type, status codes and response headers.

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(type(response))#获取相应类型
print(response.status)# 获取状态码
print(response.getheaders())#获取响应头
print(response.getheader('Server'))#获取相应头的中的参数
print(response.read().decode('utf-8'))#打印相应体

Handler

In addition to the normal of Request content, urllib provides many additional features, commonly used handler implementation.

proxy

You need to set up a proxy bedstead ProxyHandler, which is then constructed as an opener, using the open()method to open. Above urlopen()the internal is also a build opener, then open()open the page.

import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())

Cookie is used to maintain the state of landing pages for crawling need to visit the site. Cookie settings common format is as follows:

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar() #首先创建一个CookieJar类
handler = urllib.request.HTTPCookieProcessor(cookie)#借助handler处理Cookie
opener = urllib.request.build_opener(handler)#构建opener
response = opener.open('http://www.baidu.com')#打开网页
for item in cookie:
    print(item.name+"="+item.value)#打印出Cookie的值
    
import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)#火狐浏览器格式存储cookie
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)#保存

import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()#是用另一种格式存储
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)#加载Cookie
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

Exception Handling

python only defines two kinds of error class, and Base # Ear into UrlError, Plant try--except, determines the type of error trapping.

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)


from urllib import request, error

try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')
    
import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

URL parsing

It's like a toolkit, which has good multi-functional.

  1. urlparse: dividing the URL, and generates a ParseResult class, which holds the URL information of each part.
#获取URl信息
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)  
#设置URL信息,如有URL已经存在相应信息,那么该设置不会起作用
from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)
#可以通过指定不存在相应信息方式更改切分结果
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)#设置不存在allow_fragments
print(result)
  1. urlunparse: splicing a URL
from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))
  1. urljoin: merging a URL, each URL can be divided into six parts. The function to use the latter as a reference URl forward supplemental element as the URL, of the obtained URL.
from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))
  1. urlenconde: dictionary-size parameter into the GET request the parameter, URL obtained can be used directly.
from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)#拼接URL
print(url)

Guess you like

Origin www.cnblogs.com/lizhe-Ning/p/11373636.html