[Web Crawler] Getting Started - Understanding of Crawlers

1. Use the python-whis library to view the owner of the website

import whois
print(whois.whois("url"))
2. Use the builtwith library to identify the technology used by the website
import builtwith
print(built.parse("url"))

3. Use robots.txt to let the crawler know what restrictions exist when crawling the website

www.baidu.com/robots.txt

4. Regardless of the user agent used, a 5 second crawl delay should be given between download requests, we need to follow that advice to avoid overloading the server

1.1 Writing the first web crawler

  In order to crawl a website, we first need to download the web pages containing the data of interest, a process generally known as crawling. There are many ways to crawl a website, and which method to choose is more
suitable, depending on the structure of the target site
    Three common methods of crawling websites
  • Crawl sitemap
  • Iterate through the database id of each web page
  • Follow web links

   Download web page

If you want to crawl the website, download the webpage first, and use the requests module to download the URL

import requests
res = requests.get("https://www.baidu.com")
print(res.text)

Enhance the robustness of the above code

import requests

def downLoad(url, num=2):
    print("Downloading: " + str(url))  # 输出要下载的网页 URL
    try:
        headers = {
        res = requests.get(url)
        res.raise_for_status()  # 检查是否发生请求异常 400-500
    except requests.exceptions.RequestException as e:
        print("失败")
        res = None
        if num > 0:
            if isinstance(e, requests.exceptions.HTTPError) and 500 <= e.response.status_code < 600:#e.response.status_code获取错误状态码
                return downLoad(url, num - 1)
    return res.text


url = "https://www.baidu.com"
downLoad(url)

Set user agent

It would be nice to use a user-agent identifiable to avoid problems with our web crawlers. In addition, perhaps due to server overload caused by poor quality Pyon web crawlers, some websites will also block this default user agent. In order to download more reliably, we need to control the user agent setting

import requests

def downLoad(url, num=2):
    print("Downloading: " + str(url))  # 输出要下载的网页 URL
    try:
        headers = {
            'User-Agent': 'user_agent'}
        res = requests.get(url, headers=headers)
        res.raise_for_status()  # 检查是否发生请求异常 400-500
    except requests.exceptions.RequestException as e:
        print("失败")
        res = None
        if num > 0:
            if isinstance(e, requests.exceptions.HTTPError) and 500 <= e.response.status_code < 600:#e.response.status_code获取错误状态码
                return downLoad(url, num - 1)
    return res.text


url = "https://www.baidu.com"
downLoad(url)

Sitemap crawling

In this first simple crawler, we will use the example site found in the robots.txt file
sitemap to download all pages. To parse the sitemap, we'll use a simple regular expression to extract URLs from <="(.*?)"> tags.
def crawl_sitemap(url):
    sitemap = requests.get(url)
    print(sitemap.text)
    links = re.findall('="(.*?)"',sitemap.text)
    for link in links:
        print(link)

url = " http://www.sitemaps.org/protocol.html"
crawl_sitemap(url)

Guess you like

Origin blog.csdn.net/weixin_73865721/article/details/131799582