URL库的其他用法

文章目录

解析与还原链接
拼接链接
请求编码与解码
URL中的中文问题
判断爬虫能否爬取

解析与还原链接

from urllib.parse import urlparse
result = urlparse('http://www.iplant.cn/info/Dendranthema%20morifolium?t=z')
print(type(result), result)

<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.iplant.cn', path='/info/Dendranthema%20morifolium', params='', query='t=z', fragment='')

链接格式：
scheme://netloc/path;params?query#fragment
协议://域名/路径;参数?查询#锚点

print(result.scheme, result[0])

http http

from urllib.parse import urlunparse
print(urlunparse(list(result)))
print(urlunparse(result))

http://www.iplant.cn/info/Dendranthema%20morifolium?t=z
http://www.iplant.cn/info/Dendranthema%20morifolium?t=z

除此之外还有urlsplit和urlunsplit，与上述方法非常类似。

拼接链接

将基础链接作为第一个参数，将新的链接作为第二个参数，返回结果是用第一个的组成部分去添补第二部分对应的缺失部分，如果第二部分没有缺失，那就直接返回第二部分。

from urllib.parse import  urljoin
urljoin('http://www.iplant.cn', 'www.iplant.cn/info/Dendranthema%20morifolium?t=z')

'http://www.iplant.cn/www.iplant.cn/info/Dendranthema%20morifolium?t=z'

请求编码与解码

from urllib.parse import urlencode
params = {
    'name':'siri',
    'word':'hello'
}
base_usl = 'http://www.baidu.com?'
print(base_usl+urlencode(params))

http://www.baidu.com?name=siri&word=hello

from urllib.parse import parse_qs
parse_qs(urlencode(params))

{'name': ['siri'], 'word': ['hello']}

from urllib.parse import parse_qsl
parse_qsl(urlencode(params))

[('name', 'siri'), ('word', 'hello')]

URL中的中文问题

以搜索苹果为例，在地址栏中，我们看到的链接长这样：http://www.iplant.cn/info/苹果?t=z ，但是复制到别的地方以后，就变成了这样：http://www.iplant.cn/info/%E8%8B%B9%E6%9E%9C?t=z 。这是因为按照标准， URL只允许一部分 ASCII 字符（数字字母和部分符号），其他的字符（如汉字）是不符合 URL 标准的，所以URL中使用其他字符就需要进行URL编码，使用utf-8编码。

from urllib.parse import quote
url = 'http://www.iplant.cn/info/'+quote('苹果')+'?t=z'
print(url)

http://www.iplant.cn/info/%E8%8B%B9%E6%9E%9C?t=z

from urllib.parse import unquote
unquote(url)

'http://www.iplant.cn/info/苹果?t=z'

判断爬虫能否爬取

from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://www.jianshu.com/robots.txt')
rp.read()

robots.txt的具体内容如下，其中User_agent代表爬虫白名单，Disallow代表禁止爬取的目录，Allow代表可以爬取的目录。

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
User-agent: *
Disallow: /search
Disallow: /convos/
Disallow: /notes/
Disallow: /admin/
Disallow: /adm/
Disallow: /p/0826cf4692f9
Disallow: /p/d8b31d20a867
Disallow: /collections/*/recommended_authors
Disallow: /trial/*
Disallow: /keyword_notes
Disallow: /stats-2017/*

User-agent: trendkite-akashic-crawler
Request-rate: 1/2 # load 1 page per 2 seconds
Crawl-delay: 60

User-agent: YisouSpider
Request-rate: 1/10 # load 1 page per 10 seconds
Crawl-delay: 60

User-agent: Cliqzbot
Disallow: /

User-agent: Googlebot
Request-rate: 2/1 # load 2 page per 1 seconds
Crawl-delay: 10
Allow: /

User-agent: Mediapartners-Google
Allow: /

使用can_fetch方法判断是否网页是否可以被抓取，两个参数分别为User-agent和URL。

print(rp.can_fetch('*', 'https://www.jianshu.com/p/3e44bbd511e0'))

False

可以看到上面的结果是不符合预期的，据我推断可能是简书设置了反爬虫，所以没有拿到robots.txt的内容，这时候我们就需要使用parse函数，手动传入参数内容。

from urllib.request import urlopen, Request
rp = RobotFileParser()
# 加上header头对抗反爬
headers = {
    'User-Agent': 'Mozilla/4.0(compatible; MSIE 5.5; Windows NT)'
}
req = Request('https://www.jianshu.com/robots.txt', headers=headers)
rp.parse(urlopen(req).read().decode('utf-8').split('\n'))
print(rp.can_fetch('*', 'https://www.jianshu.com/p/3e44bbd511e0'))
print(rp.can_fetch('*', 'https://www.jianshu.com/p/0826cf4692f9'))

True
False

文章目录

解析与还原链接

拼接链接

请求编码与解码

URL中的中文问题

判断爬虫能否爬取

猜你喜欢