Python爬虫一一网络爬虫简介

本分类参考书籍：用Python写网络爬虫

书中采用的是Python2.7，而我使用的Python版本是Python3.7，所以在一些代码使用上做了对应修改

1.识别网站所用技术一一builtwith模块

运行结果：

2.寻找网站所有者

3.下载网页

书中介绍Python2.7中使用urllib2模块下载URL，但是我们发现Python3.0中已经不存在urllib2模块了，Python3.0中将其集成到urllib.request这个库中

4.爬取网站的3种常见方法

(1).爬取网站地图

我们从示例网站robot.txt文件中发现的网站地图来下载所有网页，为了解析网站地图，使用简单的正则表达式，从<loc>标签中提取URL。该爬取方法依赖于网站地图sitemap.xml文件

网站地图sitemap.xml的部分链接如下所示：

爬取代码如下：

运行结果：

(2).遍历每个网页的数据库ID

通过遍历网页ID来爬取网站。

优点：便捷

缺点：由于一些网站使用非连续大数作为ID，或是不使用数值作为ID，此时遍历就比较困难。

运行结果：

(3).跟踪网页链接

通过跟踪所有链接的方式，可以很容易地下载整个网站的页面。但是，这种方法会下载大量我们并不需要的网页。本节中的链接爬虫将使用正则表达式来确定需要下载哪些页面。

有关正则表达式的描述可以参考此文：https://www.imooc.com/article/2100

我们从网页看到实际字符包含'/places/default'字样，所以我们将link_regex传参修改如下：

再次运行，则出现了与书中类似的问题故障：

问题出现在下载/places/default/index/1时，该链接只有网页的路径部分，没有协议和服务器部分。也就是说这是一个相对链接，我们需要将其转换成绝对链接，才可以供我们访问。

此处我们利用urllib.parse来创建绝对路径，link_crawler函数修改如下：

这样程序就可以爬取相应网页了，但是还存在一个问题，

从上图可以看出，网页被重复爬取，这是因为这些地点之间存在链接，解决方法就是我们需要记录哪些链接被爬取过，避免重复下载。

至此，我们得到了一个可用的爬虫，但是我们运行程序会发现另外一个问题点，如下所示：

我们看到当爬虫爬取一定数目之后，出现TOO MANY REQUESTS报错，这个是由于爬取时，我们的请求速度到达一定的阈值，触发反爬虫机制引起的。

有一些应对策略，可以供我们参考：https://my.oschina.net/u/3247166/blog/831327

这里，我们根据书本中方式一致，采用限速的方式进行优化。也就是一些爬虫的高级功能

首先，解析robots.txt文件，以避免下载禁止爬取的URL，使用python自带的urllib.robotparser可以实现这个功能。

支持代理

下载限速

因为如果爬取网站的速度过快，就会面临被封禁或是造成服务器过载的风险，为了降低这些风险，在两次下载之间添加延时，从而对爬虫限速。下面是实现该功能的类的代码：

运用：

避免爬虫陷阱

我们的爬虫会跟踪所有之前没有访问过的链接，但是，一些网站会动态生成页面内容，这样就会出现无限多的网页。这样页面就会无止境地链接下去，这种情况称为爬虫陷阱。

避免方法：记录到达当前网页经过了多少个链接，也就是深度。当到达最大深度时，爬虫就不再向队列中添加该网页中的链接了。下面我们需要改变seen的记录方式，采用字典的方式记录每个url对应的访问深度

完整爬虫代码：

#!/usr/bin/env python3
# -*- coding:utf-8 -*-
import urllib.request
import urllib.parse # 将url链接从相对路径（浏览器可懂但python不懂）转为绝对路径（python也懂了）
import re #正则表达式
import urllib.robotparser # 爬取数据前解析网站robots.txt文件，避免爬取网站所禁止或限制的
import datetime # 下载限速功能所需模块

# 下载url网页,proxy是支持代理功能，初始值为None，想要设置就直接传参数即可
def download(url, user_agent = 'brain', proxy = None, num_retries = 2):
    print ('Downloading:', url)
    headers = {'User-agent':user_agent} # 设置用户代理，而不使用python默认的用户代理Python-urllib/3.6
    request = urllib.request.Request(url,headers=headers)

    opener = urllib.request.build_opener()
    if proxy: # 如果设置了proxy，那么就进行以下设置以实现支持代理功能
        proxy_params = {urllib.parse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib.request.ProxyHandler(proxy_params))

    try:
        html = opener.open(request).read()
    except urllib.request.URLError as e: # 下载过程中出现问题
        print('Download error:', e.reason)
        html = None
        if num_retries>0: # 错误4XX发生在请求存在问题，而5XX错误则发生在服务端存在问题，所以在发生5XX错误时重试下载
            if hasattr(e,'code') and 500 <= e.code < 600:
                # recursively retry 5xx HTTP errors
                return download(url, user_agent, proxy, num_retries-1)
    return html

# 爬取网站的下载限速功能的类的实现，每次在download下载前使用
class Throttle:
    """Add a delay between downloads to the same domain
    """
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        #记录上次访问的时间，小知识timestamp：时间戳是指格林威治时间1970年01月01日00时00分00秒(北京时间1970年01月01日08时00分00秒)
        #起至现在的总秒数。
        self.domains = {}

    def wait(self, url):
        domain = urllib.parse.urlparse(url).netloc
        last_accessed = self.domains.get(domain) # 记录每个域名上次访问的时间

        if self.delay > 0 and last_accessed is not None:
            # 外部延时与访问时间间隔（当前访问时间及上次访问时间）比较
            sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).seconds
            if sleep_secs > 0: # 访问时间间隔小于外部时延的话，执行睡眠操作
                # domain has been accessed recently
                # so need to sleep
                time.sleep(sleep_secs)
        # update the last accessed time
        self.domains[domain] = datetime.datetime.now()

"""先下载 seed_url 网页的源代码，然后提取出里面所有的链接URL，接着对所有匹配到的链接URL与link_regex 进行匹配，
如果链接URL里面有link_regex内容，就将这个链接URL放入到队列中，下一次 执行 while crawl_queue: 就对这个链接URL进行同样的操作。
反反复复，直到 crawl_queue 队列为空，才退出函数。"""
def link_crawler(seed_url, link_regex, max_depth = 2):
    """Crawl from the given seed URL following links matched by link_regex
    """
    crawl_queue = [seed_url]
    # keep track which URL's have seen before
    # seen = set(crawl_queue)
    seen = {seed_url: 0}  # 初始化seed_url访问深度为0
    while crawl_queue:
        url = crawl_queue.pop()

        # 爬取前解析网站robots.txt，检查是否可以爬取网站，避免爬取网站禁止或限制的
        rp = urllib.robotparser.RobotFileParser()
        rp.set_url(seed_url+'/robots.txt')
        rp.read()
        user_agent = 'brain'
        if rp.can_fetch(user_agent,url): # 解析后发现如果可以正常爬取网站，则继续执行

            # 爬取网站的下载限速功能的类的调用，每次在download下载前使用
            throttle = Throttle(delay = 5) # 这里实例网站robots.txt中的delay值为5
            throttle.wait(url)

            html = download(url)
            html = html.decode('utf-8') # 需要转换为UTF-8 / html = str(html)
            # filter for links matching our regular expression
            if html == None:
                continue
            depth = seen[url]  # 用于避免爬虫陷阱的记录爬取深度的depth
            if depth != max_depth:
                for link in get_links(html):
                    if re.match(link_regex, link):
                        link = urllib.parse.urljoin(seed_url,link)
                        if link not in seen:
                            # seen.add(link)
                            seen[link] = depth + 1  # 在之前的爬取深度上加1
                            crawl_queue.append(link)
        else:
            print("Blocked by %s robots,txt" % url)
            continue


def get_links(html):
    """Return a list of links from html
    """
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


# 只想找http://example.webscraping.com/places/default/index... or http://example.webscraping.com/places/default/view...
link_crawler('http://example.webscraping.com', '/places/default'+'/(index|view)')

运行结果：

Python爬虫一一网络爬虫简介

猜你喜欢