Python crawler learning 11
content
-
resolve connection
-
urlencode
-
As mentioned before, urlencode can transform data
from urllib import parse params = { 'name': 'germey' , 'age': '25' } base_url = 'http://www.baidu.com?' url = base_url + parse.urlencode(params) print(url)
Running result: You can see that the parameters have been converted into GET type request parameters
-
-
prase_qs
-
The role is to restore the serialized request parameters to a dictionary
# parse_qs from urllib.parse import parse_qs qurey = 'name=germey&age=25' print(parse_qs(qurey))
operation result:
-
-
parse_qsl
-
parse_qsl is used to convert arguments to a list of tuples
# parsa_qsl from urllib.parse import parse_qsl query = 'name=germey&age=25' print(parse_qsl(query))
operation result:
-
-
quote
-
Convert content into URL encoding format, which can convert Chinese characters to character encoding
# quote from urllib.parse import quote kw = '雪容融' url = 'http://www.baidu.com/s?wd'+quote(kw) print(url)
operation result:
-
-
unquote
-
In contrast to quote, decoding can be achieved
# unquote from urllib.parse import unquote url = '%E9%9B%AA%E5%AE%B9%E8%9E%8D' print('%E9%9B%AA%E5%AE%B9%E8%9E%8D 解码后为:',unquote(url))
operation result:
-
-
-
Analyze Robots Protocol
-
Robots Agreement
-
Also known as crawler protocol and robot protocol, the full name is Robots Exclusion Protocol, which is used to list which pages can be crawled by crawlers or search engines and which cannot. It is usually a text file called robots.txt, usually placed in the root directory of the website.
-
When a search crawler visits a website, it will first check whether there is a robots.txt file in the root directory of the site. If it exists, it will crawl according to the crawling scope defined in it. If this file is not found, the crawler will visit all directly accessible pages.
# 样例 user-agent: * # 搜索爬虫名称,*代表对所有爬虫都有效 Disallow:/ # 指定了不允许爬虫爬取的目录,/代表不允许爬虫爬取所有页面 Allow:/pubilc/ # 一般会和disallow相互配合,用来排除某些限制,所以综上所述,此例中代表所有页面都不允许爬取,但是可以爬取pubilc目录
-
-
Reptile name
- The crawler has a fixed name. For example, Baidu's crawler is called BaiduSpider.
-
To be continued. . . .