1. Use the python-whis library to view the owner of the website
import whois
print(whois.whois("url"))
import builtwith
print(built.parse("url"))
3. Use robots.txt to let the crawler know what restrictions exist when crawling the website
www.baidu.com/robots.txt
4. Regardless of the user agent used, a 5 second crawl delay should be given between download requests, we need to follow that advice to avoid overloading the server
1.1 Writing the first web crawler
- Crawl sitemap
- Iterate through the database id of each web page
- Follow web links
Download web page
If you want to crawl the website, download the webpage first, and use the requests module to download the URL
import requests
res = requests.get("https://www.baidu.com")
print(res.text)
Enhance the robustness of the above code
import requests
def downLoad(url, num=2):
print("Downloading: " + str(url)) # 输出要下载的网页 URL
try:
headers = {
res = requests.get(url)
res.raise_for_status() # 检查是否发生请求异常 400-500
except requests.exceptions.RequestException as e:
print("失败")
res = None
if num > 0:
if isinstance(e, requests.exceptions.HTTPError) and 500 <= e.response.status_code < 600:#e.response.status_code获取错误状态码
return downLoad(url, num - 1)
return res.text
url = "https://www.baidu.com"
downLoad(url)
Set user agent
It would be nice to use a user-agent identifiable to avoid problems with our web crawlers. In addition, perhaps due to server overload caused by poor quality Pyon web crawlers, some websites will also block this default user agent. In order to download more reliably, we need to control the user agent setting
import requests
def downLoad(url, num=2):
print("Downloading: " + str(url)) # 输出要下载的网页 URL
try:
headers = {
'User-Agent': 'user_agent'}
res = requests.get(url, headers=headers)
res.raise_for_status() # 检查是否发生请求异常 400-500
except requests.exceptions.RequestException as e:
print("失败")
res = None
if num > 0:
if isinstance(e, requests.exceptions.HTTPError) and 500 <= e.response.status_code < 600:#e.response.status_code获取错误状态码
return downLoad(url, num - 1)
return res.text
url = "https://www.baidu.com"
downLoad(url)
Sitemap crawling
def crawl_sitemap(url):
sitemap = requests.get(url)
print(sitemap.text)
links = re.findall('="(.*?)"',sitemap.text)
for link in links:
print(link)
url = " http://www.sitemaps.org/protocol.html"
crawl_sitemap(url)