The development of reptile python urllib module detailed use of the whole solution in Example

Reptile required function, are all found in urllib, learn the standard library, it can be more in-depth understanding of later requests more convenient library.

First of all

Using the corresponding import urllib2 --- Pytho2.x in Python3.x will be used in the import urllib.request, urllib.error

Using the corresponding import urllib --- Pytho2.x in use import urllib.request, urllib.error, urllib.parse in the Python3.x

Using the corresponding import urlparse --- Pytho2.x in Python3.x will be used in the import urllib.parse

Import urlopen Pytho2.x used in the corresponding ---, used in Python3.x will import urllib.request.urlopen

Using the corresponding import urlencode --- Pytho2.x in Python3.x will be used in the import urllib.parse.urlencode

Using the corresponding import urllib.quote --- Pytho2.x in Python3.x will be used in the import urllib.request.quote

Use cookielib.CookieJar in Pytho2.x in --- corresponding, in Python3.x will use http.CookieJar

Use urllib2.Request in Pytho2.x in --- corresponding, in Python3.x will use urllib.request.Request

urllib is the Python standard library that comes with no installation, can be used directly.

urllib module provides the following functions:

Page request (urllib.request)
the URL resolved (The urllib.parse)
proxy cookie settings and
exception handling (urllib.error)
a robots.txt analysis module (urllib.robotparser)
the urllib urllib.request module package
. 1, the urllib.request.urlopen
the urlopen there are three commonly used parameters, which parameters are as follows:

r = urllib.requeset.urlopen(url,data,timeout)

url: link format: protocol: // hostname: [port] / path

data: Additional parameters must be content (bytes type) byte stream encoding format, may be converted by the function bytes (), if the parameters to be passed, the request method GET method is no longer requested, but POST method

timeout: timeout seconds
get request

import urllib
r = urllib.urlopen('//www.jb51.net/')
datatLine = r.readline() #读取html页面的第一行
data=file.read() #读取全部
f=open("./1.html","wb") # 网页保存在本地
f.write(data)
f.close()

rlopen returned object provides methods:

read (), readline (), readlines (), fileno (), close (): Use of these methods with exactly the same file object info (): returns a httplib.HTTPMessage object representing the header information returned by the remote server getcode ( ): returns the status code Http. If http requests, 200 requests completed successfully; 404 URL not found geturl (): Returns the url request

urllib.quote (url) and urllib.quote_plus (url), encoding may be keywords that can be identified urlopen

POST request

import urllib.request
import urllib.parse
url = 'https://passport.jb51.net/user/signin?'
post = {
'username': 'xxx',
'password': 'xxxx'
}
postdata = urllib.parse.urlencode(post).encode('utf-8')
req = urllib.request.Request(url, postdata)
r = urllib.request.urlopen(req)

During our registration, login and other operations, will transmit information through a form POST

In this case, we need to analyze the page structure, constructed POST form data, using urlencode () performs encoding processing, returns the string, and then specify the 'utf-8' encoding format, this is only because POSTdata bytes or a file object. Finally postdata transmitted by the Request () object using the urlopen () send the request.

2, urllib.request.Request
urlopen () method can achieve the most basic initiate the request, but a few simple parameters is not enough to build a complete request, if the request needs to add headers (request header) information such as simulation browser we can use to build a stronger request request class.

import urllib.request
import urllib.parse
url = 'https://passport.jb51.net/user/signin?'
post = {
'username': 'xxx',
'password': 'xxxx'
}
postdata = urllib.parse.urlencode(post).encode('utf-8')
req = urllib.request.Request(url, postdata)
r = urllib.request.urlopen(req)

3, urllib.request.BaseHandler
in the above process, although we can construct Request, but some of the more advanced operations, such as processing Cookies, proxy how to set?

Then you need a more powerful tool Handler came up basic urlopen () function does not support validation, cookie, HTTP proxy, or other advanced features. To support these functions, you must create your own custom opener object using build_opener () function.

First introduced to urllib.request.BaseHandler, it is all the other parent class Handler, which provides the most basic method of Handler.

HTTPDefaultErrorHandler for handling HTTP response error, the error will be thrown HTTPError types of exceptions.

HTTPRedirectHandler to handle redirects

HTTPCookieProcessor for processing Cookie.

ProxyHandler used to set the proxy default proxy is empty.

HTTPPasswordMgr for managing passwords, it maintains a list of user name and password.

HTTPBasicAuthHandler for managing authentication, authentication is required if a link is opened, you can use it to solve the authentication problem.
Proxy Settings

def use_proxy(proxy_addr,url):
import urllib.request
#构建代理
proxy=urllib.request.ProxyHandler({'http':proxy_addr})
# 构建opener对象
opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
# 安装到全局
# urllib.request.install_opener(opener)
# data=urllib.request.urlopen(url).read().decode('utf8') 以全局方式打开
data=opener.open(url) # 直接用句柄方式打开
return data
proxy_addr='61.163.39.70:9999'
data=use_proxy(proxy_addr,'//www.jb51.net')
print(len(data))
## 异常处理以及日输出

usually opener opener object build_opener () was created.

install_opener (opener) installation opener as urlopen () using global URL opener

Use the cookie
obtain the Cookie to a variable

import http.cookiejar, urllib.request
#使用http.cookiejar.CookieJar()创建CookieJar对象
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
#使用HTTPCookieProcessor创建cookie处理器,并以其为参数构建opener对象
opener = urllib.request.build_opener(handler)
#将opener安装为全局
urllib.request.install_opener(opener)
response = urllib.request.urlopen('//www.jb51.net')
#response = opener.open('//www.jb51.net')
for item in cookie:
print 'Name = '+item.name
print 'Value = '+item.value

First, we must declare a CookieJar objects, then we need to use HTTPCookieProcessor to build a handler, and finally build opener use build_opener method, open () s can be. The last cycle output cookiejar

Save Cookie to get local

import cookielib
import urllib
#设置保存cookie的文件,同级目录下的cookie.txt
filename = 'cookie.txt'
#声明一个MozillaCookieJar对象实例来保存cookie,之后写入文件
cookie = cookielib.MozillaCookieJar(filename)
#利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
handler = urllib.request.HTTPCookieProcessor(cookie)
#通过handler来构建opener
opener = urllib.request.build_opener(handler)
#创建一个请求,原理同urllib2的urlopen
response = opener.open("//www.jb51.net")
#保存cookie到文件
cookie.save(ignore_discard=True, ignore_expires=True)

Exception Handling
Exception handling structure follows

try:
# 要执行的代码
print(...)
except:
#try代码块里的代码如果抛出异常了,该执行什么内容
print(...)
else:
#try代码块里的代码如果没有跑出异常,就执行这里
print(...)
finally:
#不管如何,finally里的代码,是总会执行的
print(...)

URLerror Reason:

1, the network is not connected (i.e., not the Internet)

from urllib import request, error
try:
r=request.urlopen('//www.jb51.net')
except error.URLError as e:
print(e.reason)

2, to access the page does not exist (HTTPError)

The client sends a request to the server, if the resource request is successfully obtained, the returned status code of 200 indicates a successful response. If the requested resource does not exist, usually return a 404 error.

from urllib imort request, error
try:
response = request.urlopen('//www.jb51.net')
except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep='\n')
else:
print("Request Successfully')
# 加入 hasattr属性提前对属性,进行判断原因
from urllib import request,error
try:
response=request.urlopen('http://blog.jb51.net')
except error.HTTPError as e:
if hasattr(e,'code'):
print('the server couldn\'t fulfill the request')
print('Error code:',e.code)
elif hasattr(e,'reason'):
print('we failed to reach a server')
print('Reason:',e.reason)
else:
print('no exception was raised')
# everything is ok

Here, we will list a few of urllib very representative example
1, the introduction of urllib

import urllib.request
response = urllib.request.urlopen('http://jb51.net/')
html = response.read()

2. Use Request

import urllib.request
req = urllib.request.Request('http://jb51.net/')
response = urllib.request.urlopen(req)
the_page = response.read()

3, the transmission data

#! /usr/bin/env python3
import urllib.parse
import urllib.request
url = 'http://localhost/login.php'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {
'act' : 'login',
'login[email]' : '[email protected]',
'login[password]' : '123456'
}
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
req.add_header('Referer', '//www.jb51.net/')
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page.decode("utf8"))

4, transmission data and header

#! /usr/bin/env python3
import urllib.parse
import urllib.request
url = 'http://localhost/login.php'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {
'act' : 'login',
'login[email]' : '[email protected]',
'login[password]' : '123456'
}
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page.decode("utf8"))

5, http error

#! /usr/bin/env python3
import urllib.request
req = urllib.request.Request('//www.jb51.net ')
try:
urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
print(e.code)
print(e.read().decode("utf8"))

6, exception handling

#! /usr/bin/env python3
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("//www.jb51.net /")
try:
response = urlopen(req)
except HTTPError as e:
print('The server couldn't fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("good!")
print(response.read().decode("utf8"))

7, Exception Handling

from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request("//www.jb51.net /")
try:
response = urlopen(req)
except URLError as e:
if hasattr(e, 'reason'):
print('We failed to reach a server.')
print('Reason: ', e.reason)
elif hasattr(e, 'code'):
print('The server couldn't fulfill the request.')
print('Error code: ', e.code)
else:
print("good!")
print(response.read().decode("utf8"))

8, HTTP Authentication

#! /usr/bin/env python3
import urllib.request
# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "https://www.jb51.net /"
password_mgr.add_password(None, top_level_url, 'rekfan', 'xxxxxx')
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)
# use the opener to fetch a URL
a_url = "https://www.jb51.net /"
x = opener.open(a_url)
print(x.read())
# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)
a = urllib.request.urlopen(a_url).read().decode('utf8')
print(a)

9, using a proxy

#! /usr/bin/env python3
import urllib.request
proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
a = urllib.request.urlopen("//www.jb51.net ").read().decode("utf8")
print(a)

10, overtime

#! /usr/bin/env python3
import socket
import urllib.request
# timeout in seconds
timeout = 2
socket.setdefaulttimeout(timeout)
# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request('//www.jb51.net /')
a = urllib.request.urlopen(req).read()
print(a)

11. Create your own build_opener

header=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36')]
#创建opener对象
opener=urllib.request.build_opener()
opener.addheaders=header
#设置opener对象作为urlopen()使用的全局opener
urllib.request.install_opener(opener)
response =urllib.request.urlopen('//www.jb51.net/')
buff = response.read()
html = buff .decode("utf8")
response.close()
print(the_page)

12.urlib.resquest.urlretrieve remote download

header=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36')]
#创建opener对象
opener=urllib.request.build_opener()
opener.addheaders=header
#设置opener对象作为urlopen()使用的全局opener
urllib.request.install_opener(opener)
#下载文件到当前文件夹
urllib.request.urlretrieve('//www.jb51.net/','baidu.html')
#清除urlretrieve产生的缓存
urlib.resquest.urlcleanup()

13.post request

import urllib.request
import urllib.parse
url='//www.jb51.net/mypost/'
#将数据使用urlencode编码处理后,使用encode()设置为utf-8编码
postdata=urllib.parse.urlencode({name:'测试名',pass:"123456"}).encode('utf-8')
#urllib.request.quote()接受字符串,
#urllib.parse.urlencode()接受字典或者列表中的二元组[(a,b),(c,d)],将URL中的键值对以连接符&划分
req=urllib.request.Request(url,postdata)
#urllib.request.Request(url, data=None, header={}, origin_req_host=None, unverifiable=False, #method=None)
#url:包含URL的字符串。
#data:http request中使用,如果指定,则发送POST而不是GET请求。
#header:是一个字典。
#后两个参数与第三方cookie有关。
req.add_header('user-agent','User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/
537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
data=urllib.request.urlopen(req).read()
//urlopen()的data参数默认为None,当data参数不为空的时候,urlopen()提交方式为Post。

14.cookie use
1. Get to save Cookie variables

import urllib.request
import http.cookie
# 声明一个CookieJar对象实例来保存cookie
cookie = cookielib.CookieJar()
# 利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
handler = urllib.request.HTTPCookieProcessor(cookie)
# 通过handler来构建opener
opener = urllib.request.build_opener(handler)
# 此处的open方法同urllib.request的urlopen方法,也可以传入request
urllib.request.install_opener(opener)
#使用opener或者urlretrieve方法来获取需要的网站cookie
urllib.request.urlretrieve('//www.jb51.net/','baidu.html')
# data=urllib.request.urlopen('//www.jb51.net/')

2. Save cookies to file

import http.cookie
import urllib.request
# 设置保存cookie的文件,同级目录下的cookie.txt
filename = 'cookie.txt'
# 声明一个MozillaCookieJar对象实例来保存cookie,之后写入文件
cookie = http.cookie.MozillaCookieJar(filename)
# 利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
handler = urllib.request.HTTPCookieProcessor(cookie)
# 通过handler来构建opener
opener = urllib.request.build_opener(handler)
# 创建一个请求,原理同urllib的urlopen
response = opener.open("//www.jb51.net")
# 保存cookie到文件
cookie.save(ignore_discard=True, ignore_expires=True)

3. Get cookies from a file and access

import http.cookielib
import urllib.request
# 创建MozillaCookieJar实例对象
cookie = http.cookie.MozillaCookieJar()
# 从文件中读取cookie内容到变量
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
# 创建请求的request
req = urllib.Request("//www.jb51.net")
# 利用urllib的build_opener方法创建一个opener
opener = urllib.build_opener(urllib.request.HTTPCookieProcessor(cookie))
response = opener.open(req)
print (response.read())

15. The proxy server settings

import socket
#设置Socket连接超时时间,同时决定了urlopen的超时时间
socket.setdefaulttimeout(1)
import urllib.request
#代理服务器信息,http代理使用地址
startime = time.time()
#设置http和https代理
proxy=request.ProxyHandler({'https':'175.155.25.91:808','http':'175.155.25.91:808'})
opener=request.build_opener(proxy)
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0'),
# ("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
# ("Accept-Language", "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3"),
# ("Accept-Encoding","gzip, deflate, br"),
# ("Connection","keep-alive"),
# ("Pragma","no-cache"),
# ("Cache-Control","no-cache")
  ]
request.install_opener(opener)
# data = request.urlopen('https://www.jb51.net/find-ip-address').read()
data = request.urlopen( 'http://www.ipip.net/' ).read().decode('utf-8')
# data=gzip.decompress(data).decode('utf-8','ignore')
endtime = time.time()
delay = endtime-startime
print(data)

Sometimes directly decode the data in the data urlopen of ( 'utf-8') fails, we must use gzip.decompress ( 'utf-8', 'ignore') to open, speculation should be the header, and sometimes for a good

Finally, I recommend a good reputation python gathering [ click to enter ], there are a lot of old-timers learning skills, learning experience

, Interview skills, workplace experience and other share, the more carefully prepared the zero-based introductory information, information on actual projects, programmers every day

Python method to explain the timing of technology, to share some of the learning and the need to pay attention to small details

Published 32 original articles · won praise 34 · views 20000 +

Guess you like

Origin blog.csdn.net/haoxun12/article/details/105081380