Two, python3 the module urllib

Original link: https://www.cnblogs.com/zhangxinqi/p/9170312.html

Read catalog

urllib HTTP request is python built libraries, can be used without having to install, it contains four modules:

request: It is the most basic http request module, used to simulate the transmission request

error: exception handling module, if an error occurs, you can catch these exceptions

the parse: a tool module, provides a number of URL processing methods, such as: resolution, resolution, mergers

robotparser: mainly used to identify the site's robots.txt file, and then determine which sites can climb

一、urllib.request.urlopen()

urllib.request.urlopen(url,data=None,[timeout,],cafile=None,capath=None,cadefault=False,context=None)

Request object, the object returns a HTTPResponse type, comprising methods and properties:

方法:read()、readinto()、getheader(name)、getheaders()、fileno()

属性:msg、version、status、reason、bebuglevel、closed

import urllib.request

response=urllib.request.urlopen('https://www.python.org')  #请求站点获得一个HTTPResponse对象
#print(response.read().decode('utf-8'))   #返回网页内容
#print(response.getheader('server')) #返回响应头中的server值
#print(response.getheaders()) #以列表元祖对的形式返回响应头信息
#print(response.fileno()) #返回文件描述符
#print(response.version)  #返回版本信息
#print(response.status)  #返回状态码200,404代表网页未找到
#print(response.debuglevel) #返回调试等级
#print(response.closed)  #返回对象是否关闭布尔值
#print(response.geturl()) #返回检索的URL
#print(response.info()) #返回网页的头信息
#print(response.getcode()) #返回响应的HTTP状态码
#print(response.msg)  #访问成功则返回ok
#print(response.reason) #返回状态信息

the urlopen () method to pass parameters:

url: Web site address, str type, it can be a request object

data: data parameter is optional, the contents of a stream of bytes, i.e., bytes encoding format type, if the data transmission parameter, the urlopen mode request using the Post

from urllib.request import urlopen
import urllib.parse

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8') 
#data需要字节类型的参数,使用bytes()函数转换为字节,
#使用urllib.parse模块里的urlencode()方法来讲参数字典转换为字符串并指定编码
response = urlopen('http://httpbin.org/post',data=data)
print(response.read())

#output
b'{
"args":{},
"data":"",
"files":{},
"form":{"word":"hello"},  #form字段表明模拟以表单的方法提交数据,post方式传输数据
"headers":{"Accept-Encoding":"identity",
    "Connection":"close",
    "Content-Length":"10",
    "Content-Type":"application/x-www-form-urlencoded",
    "Host":"httpbin.org",
    "User-Agent":"Python-urllib/3.5"},
"json":null,
"origin":"114.245.157.49",
"url":"http://httpbin.org/post"}\n'

timeout: set the timeout time, in seconds, if the request exceeds the time has not been set in response to an exception is thrown, to support HTTP, HTTPS, FTP request

import urllib.request
response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)  #设置超时时间为0.1秒,将抛出异常
print(response.read())

#output
urllib.error.URLError: <urlopen error timed out>

#可以使用异常处理来捕获异常
import urllib.request
import urllib.error
import socket
try:
    response=urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
    print(response.read())
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout): #判断对象是否为类的实例
        print(e.reason) #返回错误信息
#output
timed out

Other parameters: context parameter, she must be ssl.SSLContext type, to specify SSL settings. In addition, cafile and capath these two parameters specify CA certificate and its path will be used when https link.

二、urllib.request.Requset()

urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)

parameter:

url: parameter request URL, you must pass, others are optional parameters

data: upload data bytes to be transmitted the byte stream type data, if it is a dictionary, can first module in the urllib.parse urlencode () encoding

headers: It is a dictionary, transmission request is header data, it can request configuration method head, you may be invoked by the requested instance of the add_header () to add. For example: to modify the value of the head camouflage User_Agent browser, such as Firefox can be set up:

{'User-Agent':'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)'}

origin_req_host: refers to the host name or IP address of the requestor

unverifiable: indicates that this request is not validated, the default is False, as we requested permission to get a picture without the picture that its value is true

method: is a string, the method used for indicating a request, such as: GET, POST, PUT, etc.

#!/usr/bin/env python
#coding:utf8
from urllib import request,parse

url='http://httpbin.org/post'
headers={
    'User-Agent':'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)',
    'Host':'httpbin.org'
}  #定义头信息

dict={'name':'germey'}
data = bytes(parse.urlencode(dict),encoding='utf-8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
#req.add_header('User-Agent','Mozilla/5.0 (compatible; MSIE 8.4; Windows NT') #也可以request的方法来添加

response = request.urlopen(req) 
print(response.read())

Three, urllib.request senior class

1, BaseHandler class urllib.request module where he is all the other parent class Handler, who is a processor, such as using it to handle login validation, process cookies, proxy settings, redirects, etc.

It provides a method for direct use and use of derived classes:

add_parent (director): was added as a parent class director

close (): Close its parent

parent (): Open using different protocols or processing error

defautl_open (req): capture all the URL and subclass called before the agreement opens

Handler subclass comprises:

HTTPDefaultErrorHandler: to handle http response error , the error will be thrown HTTPError class

HTTPRedirectHandler: for processing the redirect

HTTPCookieProcessor: a process cookies

ProxyHandler: used to set up a proxy , the proxy is empty by default

HTTPPasswordMgr: always manage passwords, user names and passwords to maintain its table

HTTPBasicAuthHandler: user management certification, authentication is required if a link is opened, you can use it to achieve validation

 

2, OpenerDirector class senior class is used to handle the URL

It is open in three stages URL: sequentially calling these methods at each stage is determined by the sort processing program instance;

  • Each method of using such programs call protocol_request () method to pre-request,

  • Then call protocol_open () to process the request,

  • Last call protocol_response () method to handle the response.

3, before the urlopen () is a method of providing the urllib Opener to construct Cookies implemented Opener processed by the processor Handler, proxy settings, password, etc.

Opener method comprising:

  • add_handler (handler): add a handler to the link

  • open (url, data = None [, timeout]): Open the given URL and the urlopen () in the same manner

  • error (proto, * args): treatment given protocol error

(1) password authentication:

#!/usr/bin/env python
#coding:utf8
from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

username='username'
passowrd='password'
url='http://localhost'
p=HTTPPasswordMgrWithDefaultRealm() #构造密码管理实例
p.add_password(None,url,username,passowrd) #添加用户名和密码到实例中
auth_handler=HTTPBasicAuthHandler(p) #传递密码管理实例构建一个验证实例
opener=build_opener(auth_handler)  #构建一个Opener
try:
    result=opener.open(url)  #打开链接,完成验证,返回的结果是验证后的页面内容
    html=result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

(2) proxy settings:

#!/usr/bin/env python
#coding:utf8
from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener

proxy_handler=ProxyHandler({
    'http':'http://127.0.0.1:8888',
    'https':'http://127.0.0.1:9999'
})
opener=build_opener(proxy_handler) #构造一个Opener
try:
    response=opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

(3)Cookies:

Cookies get site

#!/usr/bin/env python
#coding:utf8
import http.cookiejar,urllib.request
cookie=http.cookiejar.CookieJar() #实例化cookiejar对象
handler=urllib.request.HTTPCookieProcessor(cookie) #构建一个handler
opener=urllib.request.build_opener(handler) #构建Opener
response=opener.open('http://www.baidu.com') #请求
print(cookie)
for item in cookie:
    print(item.name+"="+item.value)
(4)Mozilla型浏览器的cookies格式,保存到文件:
#!/usr/bin/env python
#coding:utf8
import http.cookiejar,urllib.request
fielname='cookies.txt'
#创建保存cookie的实例,保存浏览器类型的Mozilla的cookie格式
cookie=http.cookiejar.MozillaCookieJar(filename=fielname) 
#cookie=http.cookiejar.CookieJar() #实例化cookiejar对象
handler=urllib.request.HTTPCookieProcessor(cookie) #构建一个handler
opener=urllib.request.build_opener(handler) #构建Opener
response=opener.open('http://www.baidu.com') #请求
cookie.save(ignore_discard=True,ignore_expires=True)
 

(5) Cookies can also be saved as a file libwww-perl (LWP) format

cookie=http.cookiejar.LWPCookieJar(filename=fielname)

Read from the file cookies:

#!/usr/bin/env python
#coding:utf8
import http.cookiejar,urllib.request
#fielname='cookiesLWP.txt'
#创建保存cookie的实例,保存浏览器类型的Mozilla的cookie格式
#cookie=http.cookiejar.MozillaCookieJar(filename=fielname) 
#cookie=http.cookiejar.LWPCookieJar(filename=fielname) #LWP格式的cookies
#cookie=http.cookiejar.CookieJar() #实例化cookiejar对象
cookie=http.cookiejar.LWPCookieJar()
cookie.load('cookiesLWP.txt',ignore_discard=True,ignore_expires=True)

handler=urllib.request.HTTPCookieProcessor(cookie) #构建一个handler
opener=urllib.request.build_opener(handler) #构建Opener
response=opener.open('http://www.baidu.com') #请求
print(response.read().decode('utf-8'))

Fourth, exception handling

urllib module defines the error exception generated by the request module, if a problem occurs, it will throw an exception error request module defined in the module.

1、URLError

urllib from error module URLError class library, which inherits from OSError class, the base class is abnormal module error, generated by the request module failure can be handled by capturing the class

It has only one attribute reason, that is, return the wrong reasons

 

#!/usr/bin/env python
#coding:utf8
from urllib import request,error

try:
    response=request.urlopen('https://hehe,com/index')
except error.URLError as e:
    print(e.reason)  #如果网页不存在不会抛出异常,而是返回捕获的异常错误的原因(Not Found)

The reason a timeout object is returned

 

#!/usr/bin/env python
#coding:utf8

import socket
import urllib.request
import urllib.error
try:
    response=urllib.request.urlopen('https://www.baidu.com',timeout=0.001)
except urllib.error.URLError as e:
    print(e.reason)
    if isinstance(e.reason,socket.timeout):
        print('time out')

2、HTTPError

It is a subclass of URLError, designed to handle HTTP request error, such as failure of the authentication request, which has three properties:

code: returning an HTTP status code, such as page 404 does not exist, the server 500 errors

reason: with the parent class, returns the wrong reasons

headers: returns the requested head

#!/usr/bin/env python
#coding:utf8
from urllib import request,error

try:
    response=request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:  #先捕获子类异常
    print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:  #再捕获父类异常
    print(e.reason)
else:
    print('request successfully')

Fifth, resolving links

urllib library provides parse module, which defines a standard interface to process the URL, such as the realization of the parts extracted URL, and a combined conversion link, which supports the following URL processing protocol:

file,ftp,gopher,hdl,http,https,imap,mailto,mms,news,nntp,prospero,rsync,rtsp,rtspu,sftp,sip,sips,snews,svn,snv+ssh,telnet,wais

1、urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

Urlparse can be seen through the API, it can pass three parameters

urlstring: to be resolved URL, string

scheme: it is the default protocol, such as http or HTTPS , if the URL without the http protocol, can be specified by the scheme, to take effect if the URL http protocol developed in the URL

allow_fragments: whether to ignore the fragment that is the anchor, if set to False, fragment part is ignored, otherwise not be ignored

1 urlparam ()

The method may be implemented to identify and segment the URL, are scheme (protocol), netloc (domain), path (path), the params (parameters), Query (query), the fragment (Anchor)

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlparse
result=urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result),result,sep='\n')  #返回的是一个元祖
print(result.scheme,result[0])  #可以通过属性或者索引来获取值
print(result.netloc,result[1])
print(result.path,result[2])
print(result.params,result[3])
print(result.query,result[4])
print(result.fragment,result[5])

#output
#返回结果是一个parseresult类型的对象,它包含6个部分,
#分别是scheme(协议),netloc(域名),path(路径),params(参数),query(查询条件),fragment(锚点)

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www.baidu.com', 
    path='/index.html', params='user', query='id=5', fragment='comment')
http http
www.baidu.com www.baidu.com
/index.html /index.html
user user
id=5 id=5
comment comment
指定scheme协议,allow_fragments忽略锚点信息:
from urllib.parse import urlparse
result=urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https',allow_fragments=False)
print(result) 

#output
ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', 
        params='user', query='id=5#comment', fragment='')

2 urlunparse ()

And the urlparse () In contrast, a subject receiving or in the form of iteration through a list of tuples, implement URL structure

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data)) #构造一个完整的URL

#output
http://www.baidu.com/index.html;user?a=6#comment

3, urlsplit ()

Similar to the urlparse () method, it will return portion 5, the path into the merging params

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlsplit
result=urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result)

#output
SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', 
        query='id=5', fragment='comment')

4, urlunsplit ()

The method is similar to the entire link urlunparse (), which is the linked combination of the various parts, may be passed parameters are iterative objects, such as a list of tuples, etc., the only difference is that the length must be 5 , which is omitted params

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlsplit,urlunsplit
data=['http','www.baidu.com','index.html','a=5','comment']
result=urlunsplit(data)
print(result)

#output
http://www.baidu.com/index.html?a=5#comment

5, urljoin ()

 By base URL (Base) in combination with another URL (URL) to build up a complete URL, it uses the fundamental component URL, protocol (schemm), domain (netloc), the path (path), the URL is provided to the partial deletion supplement, the final result is returned

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urljoin

print(urljoin('http://www.baidu.com','index.html'))
print(urljoin('http://www.baidu.com','http://cdblogs.com/index.html'))
print(urljoin('http://www.baidu.com/home.html','https://cnblog.com/index.html'))
print(urljoin('http://www.baidu.com?id=3','https://cnblog.com/index.html?id=6'))
print(urljoin('http://www.baidu.com','?id=2#comment'))
print(urljoin('www.baidu.com','https://cnblog.com/index.html?id=6'))

#output
http://www.baidu.com/index.html
http://cdblogs.com/index.html
https://cnblog.com/index.html
https://cnblog.com/index.html?id=6
http://www.baidu.com?id=2#comment
https://cnblog.com/index.html?id=6

base_url provides three elements scheme, netloc, path, if these three do not exist in the new link will be supplemented, if there is a new link on the use of new links section, and base_url in params, query and fragment are not effective. Parsing may be implemented, by stitching and link generation urljoin () method

6 、urlencode()

 urlencode () at useful when constructing a GET request parameters , which can be converted to the dictionary GET request parameter

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlencode
params = {'username':'zs','password':'123'}
base_url='http://www.baidu.com'
url=base_url+'?'+urlencode(params) #将字典转化为get参数
print(url)

#output
http://www.baidu.com?password=123&username=zs

7、parse_qs()

 parse_qs () and urlencode () on the contrary, it is used to deserialize, as will be converted back to the GET parameter dictionary format

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlencode,parse_qs,urlsplit
params = {'username':'zs','password':'123'}
base_url='http://www.baidu.com'
url=base_url+'?'+urlencode(params) #将字典转化为get参数

query=urlsplit(url).query  #获去URL的query参数条件
print(parse_qs(query))  #根据获取的GET参数转换为字典格式

#output
{'username': ['zs'], 'password': ['123']}

8、parse_qsl()

It will convert the parameter list consisting of Ganso

#!/usr/bin/env python
#coding:utf8
from urllib.parse import urlencode,urlsplit,parse_qsl

params = {'username':'zs','password':'123'}
base_url='http://www.baidu.com'
url=base_url+'?'+urlencode(params) #将字典转化为get参数

query=urlsplit(url).query  #获去URL的query参数条件
print(parse_qsl(query)) #将转换成列表形式的元祖对

#output
[('username', 'zs'), ('password', '123')]

9、quote()

The method can convert the content URL-encoded format, such as Chinese having parameters sometimes cause distortion problems, this time with this method will be URL-encoded Chinese characters into

#!/usr/bin/env python
#coding:utf8
from urllib.parse import quote
key='中文'
url='https://www.baidu.com/s?key='+quote(key)
print(url)

#output
https://www.baidu.com/s?key=%E4%B8%AD%E6%96%87

10、unquote():

And quote () Instead, he used to decode the URL

#!/usr/bin/env python
#coding:utf8
from urllib.parse import quote,urlsplit,unquote
key='中文'
url='https://www.baidu.com/s?key='+quote(key)
print(url)
unq=urlsplit(url).query.split('=')[1] #获取参数值

print(unquote(unq))  #解码参数

Six protocol analysis Robots

 Urllib use of robotparser module, we can analyze the website Robots achieve agreement

1, Robots agreement

Robots protocol agreement also called crawlers, robots protocol, its full name is called web crawlers exclusion criteria (Robots Exclusion Protocol), used to tell the crawlers which pages and search engines can crawl, which can not crawl, it is usually a robots .txt text files, usually placed in the root directory of your site.

When the search crawlers visit a site, it first checks whether there is a robots.txt file in the root directory of the site, if there is, according to the search crawlers will climb the scope of which is defined to crawl, if not found, the search crawlers will have access to all direct access to the page

We look at examples of robots.txt:

User-agent: *
Disallow: /
Allow: /public/

It implements all search crawlers only allowed to function crawling public directory of the web root directory, and file entry site (index.html) together to save the above content as robots.txt file

User-agent describes the name of the search crawler, it is set to * represents the agreement is valid for any reptiles , as set Baiduspider represents the effective rule of Baidu reptiles, if a number of the plurality of reptiles are limited, but at least need specify a

(1) Some common search for reptiles Name:

BaiduSpider Baidu reptiles www.baidu.com

Googlebot Google crawler www.google.com

360Spider 360 reptiles www.so.com

YodaoBot proper way reptile www.youdao.com

ia_archiver Alexa reptiles www.alexa.cn

Scooter altavista reptiles www.altavista.com

Disallow specified directory does not allow gripping, as provided in embodiment / represents all allowed to crawl pages

Disallow Allow use with ships and, to exclude individual certain restrictions, such as the example set / public / gripping means all pages are not allowed, but the public directory can grab

Setting Example:

#禁止所有爬虫
User-agent: *
Disallow: /

#允许所有爬虫访问任何目录,另外把文件留空也可以
User-agent: *
Disallow:

#禁止所有爬虫访问某那些目录
User-agent: *
Disallow: /home/
Disallow: /tmp/

#只允许某一个爬虫访问
User-agent: BaiduSpider
Disallow:
User-agent: *
Disallow: /

2, robotparser

rebotparser module used to parse robots.txt, this module provides a class RobotFileParser, it can be judged when the authority has a robots.txt file crawling reptiles according to a Web site to crawl this page

urllib.robotparser.RobotFileParser(url='')

(1) robotparser class commonly used methods:

set_url (): used to set the connection robots.txt file, if the object is to create RobotFileParser incoming connection, you do not need to use this method to set up

read (): read reobts.txt file and analyzed, it will not return anything, but that the implementation of the reading and analysis operations

parse (): used to parse the robots.txt file, incoming parameter is robots.txt content of some lines, and install the grammar rules to analyze the content

can_fetch (): This method with two arguments, the first is the User-agent, the second is to grab the URL, the content is returned if the search engines can crawl this url, the result is True or False

mtime (): Returns the last time crawling and analysis of robots.txt

modified (): the current time is set to crawl and analyze robots.txt last time

#!/usr/bin/env python
#coding:utf8
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()  #创建对象
#设置robots.txt连接,也可以在创建对象时指定
rp.set_url('https://www.cnblogs.com/robots.txt') 
rp.read()  #读取和解析文件
#坚持链接是否可以被抓取
print(rp.can_fetch('*','https://i.cnblogs.com/EditPosts.aspx?postid=9170312&update=1')) 

 

Guess you like

Origin blog.csdn.net/s294878304/article/details/102693125