urllib four modules
urrlib.request
urrlib.error
urrlib.parse
urrlib.robotparser
Get page source
import urllib.request
response=urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode('utf-8'))
#获取百度的源代码
post request
import urllib.parse
import urllib.request
data=bytes(urllib.parse.urlencode({"name":"hello"}),encoding='utf-8')
response=urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response.read().decode('utf-8'))
Here the parameter data is transmitted, that if the parameter is encoded byte stream format of the content (i.e. a byte), the required bytes () conversion method, and the POST request method
Timeout Testing
import urllib.request
import urllib.error
import socket
try:
response=urllib.request.urlopen("http://httpbin.org/get",timeout=0.1)
except urllib.error.URLError as e:
if isinstance(e.reason,socket.timeout):
print("TIME OUT")
Here passed the time parameters (timeout)
response
1. Response Type
import urllib.request
response=urllib.request.urlopen("http://httpbin.org/get")
print(type(response))
The result returned is: <class 'http.client.HTTPResponse'>
2. Status Code
3. Response header
4. Response Body
import urllib.request
response=urllib.request.urlopen("http://www.python.org")
print(response.status)#响应状态
print(response.getheaders())#获得头部信息
print(response.getheader('Server'))
a plurality of parameters with response
from urllib import request,parse
url='http://httpbin.org/post'
headers={
'User-Agent':'Mozillia/4.0(comoatible;MSIE 5.5;Windows NT)',
'Host':'httpbin.org'
}
dict={
'name':'Germey'
}
data=bytes(parse.urlencode(dict),encoding='utf-8')
req=request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))
Advanced Usage Hander
In some requests need to use cookies, proxy settings, so using a Handler
mport urllib.request
proxy_handler=urllib.request.ProxyHandler({
'http':'http://127.0.0.1:9743'
'https':'https://127.0.0.1:9743'
})
opener=urllib.request.build_opener(proxy_handler)
resopnse=opener.open('http://httpbin.org/get')
print(resopnse.read())
cookies()
import http.cookiejar,urllib.request
cookie=http.cookiejar.CookieJar()
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open("http://www.baidu.com")
for item in cookie:
print(item.name+"="+item.value)
# 声明一个CookieJar对象,利用HTTPCookieProcessor构建一个Handler,最后利用build_openr执行open方法
Exception Handling
from urllib import request,error
try:
response=request.urllib.urlopen('http://cuiqingcai.com/index.html')
except error.HTTPError as e:
print(e.reason,e.code,e.headers,seq='\n') #HTTPError是URLError的子类
except error.URLError as e:
print(e.reason)
else:
print('request successfully')
URL parsing:
URL contains parts:
What part of a URL (Uniform Resource path address) that contains it? For example, such "http://www.baidu.com/index.html?name=mo&age=25#dowell", in this case we can be divided into six parts;
1, transmission protocols: http, https
2, domain name: www.baidu.com cases for the site name. baidu.com as a domain name, www server
3, Port: do not fill, then take the default port number is 80
4, a path http://www.baidu.com/ path / paths 1.2. / Represents the root directory
5, carrying the parameters:? name = mo
6, the hash value: #dowell
----------------
Original link: https: //blog.csdn.net/qq_38990351/article/details/83689928
urlparse: url for the split operation
==urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)
url: url be resolved
scheme = '': If there is no agreement url parsing can set the default protocol, url if there is an agreement, this parameter is not set
allow_fragments = True: whether to ignore the anchor, the default is not negligible True expressed as False means to ignore
from urllib.parse import urlparse
result=urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)
#打印结果:ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')
urlunparse (composition)
import urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))
urljoin
For urljoin method, we can provide a base_url (basic link) as the first parameter, the new link as the second parameter, this method analyzes base_url the scheme, neloc, path to link these three elements missing parts supplement, the final result is returned
from urllib.parse import urljoin
print(urljoin("http://www.baidu.com","FAQ.html"))
print(urljoin("http://www.baidu.com","https://cuiqinghua.com/FAQ.html"))
print(urljoin("http://www.baidu.com","?category=2"))
#打印结果
#http://www.baidu.com/FAQ.html
#https://cuiqinghua.com/FAQ.html
#http://www.baidu.com?category=2
urlencode can be converted into a dictionary object parameters
from urllib.parse import urlencode
params={
'name':'germey'
'age':22
}
base_url='http://www.baidu.com?'
url=base_url+urlencode(params)
print(url)
quote
The method can convert content url format, with the Chinese when url parameters, may result in garbled
from urllib.parse import quote
keyword="壁纸"
url="https://www.baidu/s?wd"+quote(keyword)
print(url)
#https://www.baidu/s?wd%E5%A3%81%E7%BA%B8
unquote
The method can be decoded url
from urllib.parse import unquote
url='https://www.baidu/s?wd%E5%A3%81%E7%BA%B8'
print(unquote(url))
`#https://www.baidu/s?wd壁纸``