python reptile (iv) use and master the basics of urllib library

urllib four modules

urrlib.request
urrlib.error
urrlib.parse
urrlib.robotparser
Here Insert Picture Description

Get page source

import urllib.request
response=urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode('utf-8'))
#获取百度的源代码

post request

import urllib.parse
import urllib.request
data=bytes(urllib.parse.urlencode({"name":"hello"}),encoding='utf-8')
response=urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response.read().decode('utf-8'))

Here the parameter data is transmitted, that if the parameter is encoded byte stream format of the content (i.e. a byte), the required bytes () conversion method, and the POST request method

Timeout Testing

import urllib.request
import urllib.error
import socket
try:
	response=urllib.request.urlopen("http://httpbin.org/get",timeout=0.1)
except urllib.error.URLError as e:
	if isinstance(e.reason,socket.timeout):
		print("TIME OUT")

Here passed the time parameters (timeout)

response

1. Response Type

import urllib.request
response=urllib.request.urlopen("http://httpbin.org/get")
print(type(response))

The result returned is: <class 'http.client.HTTPResponse'>

2. Status Code
3. Response header
4. Response Body

import urllib.request
response=urllib.request.urlopen("http://www.python.org")
print(response.status)#响应状态
print(response.getheaders())#获得头部信息
print(response.getheader('Server'))

a plurality of parameters with response

from urllib import request,parse
url='http://httpbin.org/post'
headers={
	'User-Agent':'Mozillia/4.0(comoatible;MSIE 5.5;Windows NT)',
	'Host':'httpbin.org'
}
dict={
	'name':'Germey'
}
data=bytes(parse.urlencode(dict),encoding='utf-8')
req=request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))

Advanced Usage Hander

In some requests need to use cookies, proxy settings, so using a Handler

mport urllib.request
proxy_handler=urllib.request.ProxyHandler({
	'http':'http://127.0.0.1:9743'
	'https':'https://127.0.0.1:9743'
})
opener=urllib.request.build_opener(proxy_handler)
resopnse=opener.open('http://httpbin.org/get')
print(resopnse.read())

cookies()

import http.cookiejar,urllib.request
cookie=http.cookiejar.CookieJar()
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name+"="+item.value)
# 声明一个CookieJar对象,利用HTTPCookieProcessor构建一个Handler,最后利用build_openr执行open方法

Exception Handling

from urllib import request,error
try:
	response=request.urllib.urlopen('http://cuiqingcai.com/index.html')
except error.HTTPError as e:
	print(e.reason,e.code,e.headers,seq='\n') #HTTPError是URLError的子类
except error.URLError as e:
	print(e.reason)
else:
print('request successfully')

URL parsing:

URL contains parts:

What part of a URL (Uniform Resource path address) that contains it? For example, such "http://www.baidu.com/index.html?name=mo&age=25#dowell", in this case we can be divided into six parts;

1, transmission protocols: http, https

2, domain name: www.baidu.com cases for the site name. baidu.com as a domain name, www server

3, Port: do not fill, then take the default port number is 80

4, a path http://www.baidu.com/ path / paths 1.2. / Represents the root directory

5, carrying the parameters:? name = mo

6, the hash value: #dowell

----------------
Original link: https: //blog.csdn.net/qq_38990351/article/details/83689928

urlparse: url for the split operation

==urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)

url: url be resolved
scheme = '': If there is no agreement url parsing can set the default protocol, url if there is an agreement, this parameter is not set
allow_fragments = True: whether to ignore the anchor, the default is not negligible True expressed as False means to ignore

from urllib.parse import urlparse
result=urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)
#打印结果:ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')

urlunparse (composition)

import urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

urljoin

For urljoin method, we can provide a base_url (basic link) as the first parameter, the new link as the second parameter, this method analyzes base_url the scheme, neloc, path to link these three elements missing parts supplement, the final result is returned

from urllib.parse import urljoin
print(urljoin("http://www.baidu.com","FAQ.html"))
print(urljoin("http://www.baidu.com","https://cuiqinghua.com/FAQ.html"))
print(urljoin("http://www.baidu.com","?category=2"))
#打印结果
#http://www.baidu.com/FAQ.html
#https://cuiqinghua.com/FAQ.html
#http://www.baidu.com?category=2

urlencode can be converted into a dictionary object parameters

from urllib.parse import urlencode
params={
	'name':'germey'
	'age':22
	}
base_url='http://www.baidu.com?'
url=base_url+urlencode(params)
print(url)

quote

The method can convert content url format, with the Chinese when url parameters, may result in garbled

from urllib.parse import quote
keyword="壁纸"
url="https://www.baidu/s?wd"+quote(keyword)
print(url)
#https://www.baidu/s?wd%E5%A3%81%E7%BA%B8

unquote

The method can be decoded url

from urllib.parse import unquote
url='https://www.baidu/s?wd%E5%A3%81%E7%BA%B8'
print(unquote(url))
`#https://www.baidu/s?wd壁纸``

Published 63 original articles · won praise 12 · views 4054

Guess you like

Origin blog.csdn.net/qq_45353823/article/details/104167865