(1) URL address
URL address component
URL component |
illustrate |
scheme |
Network protocol or download scheme |
net_loc |
Server location (may contain user information) |
path |
Use (/) split files or paths to CGI applications |
params |
optional parameter |
query |
A sequence of key-value pairs separated by an ampersand (&) |
fragment |
Specify a section within a document for a specific anchor |
net_loc component
user:password@host:port
components |
illustrate |
user |
username or login |
password |
user password |
host |
The name or address of the computer running the web server (required) |
port |
port number (if not default 80) |
(2) urllib
Here we mainly describe urllib.request and urllib.parse.
(三) urllib.request
urllib.request
components |
illustrate |
urlopen(url,data=None) |
Opens a URL link and returns a file-type object as if open had opened a file locally with binary read-only. url: can be a url string or a request object When data:url is the request object, you can specify the data to be passed |
urlretrieve(url,filename=None) |
Download the file in the url filename: file name and path (if no path is specified, it will be stored in the current working directory) |
urlopen object method
components |
illustrate |
read() |
read all data |
readline() |
read a row of data |
readlines() |
Read all rows, return as a list |
fileno() |
return file handle |
close() |
Close the url connection (close and the above four methods are the same as the open method of the same name) |
info() |
Returns MIME (Multi-Object Internet Mail Extensions) headers. This header file informs the browser what type of files to return, and what kinds of applications can be used to open them. |
geturl() |
Returns the real url (e.g. if there is a redirect, you can get the real url from the final opened file) |
getcode() |
Return HTTP status code |
1 import urllib.request 2 url = 'https://tieba.baidu.com/p/5475267611' 3 #打开url(就像用open二进制只读方式打开一个文件一样),使用read读取所有数据 4 html = urllib.request.urlopen(url).read() 5 print(type(html)) 6 7 url_file = 'https://imgsa.baidu.com/forum/w%3D580/sign=99114e38abec08fa260013af69ef3d4d/e549b13533fa828bc80c7764f61f4134960a5a85.jpg' 8 #下载url中的文件并保存 9 urllib.request.urlretrieve(url_file,'C:\Temp\\1.jpg') 10 11 #返回MIEM头文件 12 html_info = urllib.request.urlopen(url).info() 13 print(html_info)
(四) urlib.parse
urlib.parse
函数 |
说明 |
urlparse(urlstr) |
将url解析为一个元组(scheme='', netloc='', path='', params='', query='', fragment='') |
urlunparse(urltup) |
和urlparse相反,将url组件(一个元组)拼接为完整的url |
urljoin(base,url) |
将base的根域名和url拼接为一个完整的url base:函数会自动截取net_loc及前面的所有内容 |
1 import urllib.parse 2 3 url = 'https://www.cnblogs.com/cate/python/' 4 newurl = '/cate/ruby/' 5 #将url解析为一个元组(scheme='', netloc='', path='', params='', query='', fragment='') 6 urlpar = urllib.parse.urlparse(url) 7 print('urlparse示例:',urlpar) 8 #和urlparse刚好相反,将元组(scheme='', netloc='', path='', params='', query='', fragment='')拼接为完整的url 9 urlunp = urllib.parse.urlunparse(urlpar) 10 print('urlunparse示例:',urlunp) 11 #将url netloc及前面部分的内容与newurl连接起来 12 url_ruby = urllib.parse.urljoin(url,newurl) 13 print('urljoin示例:',url_ruby)