Reptile Advanced: CrawlSpider crawling 169ee the station's beautiful pictures

CrawlSpider

Earlier, we used the scrapy in CrawlSpider crawling a large piece of data embarrassments Encyclopedia.

However, qiubai this reptile did not take full advantage of CrawlSpider. In fact, in qiubai this reptile which we as ordinary just CrawlSpider Spider with it.

CrawlSpider inherited from Spider, provided Rule and LinkExtractor, so that the crawler frame to automatically extract Response accordance with the rules in all qualifying links and follow the link to crawl. Thus, CrawlSpider very suitable for comparison website crawling rules. And we just write the appropriate Rule and LinkExtractor, can be avoided to write a lot in the parse method to extract logic to extract links and data.

After using the characteristics of CrawlSpider (Rule and LinkExtractor), we only need to focus on those pages that we are interested in, write parse method to extract item data from these pages corresponding Response. Continue to follow up and crawling pages of things that make CrawlSpider do just fine.

Let us look at the role of the Rule and LinkExtractor:

First, we look at the main attributes and methods CrawlSpider of:

rules:列表,定义了从Response提取符合要求的链接的Rule对象。
parse_start_url:CrawlSpider默认的回调函数,我们在使用CrawlSpider时,不应该覆盖parse方法,而应该覆盖这个方法。因为CrawlSpider使用了parse函数来处理自己的逻辑,所以我们不能覆盖parse方法。

Which, Rule has the following parameters:

link_extractor:LinkExtractor对象,用于定义需要提取的链接。Link Extractors是链接提取器,用来从返回网页的Response中提取符合要求的链接。
callback:回调函数,当link_extractor提取到符合条件的链接时,就会把该链接继续处理获得相应的Response,然后把该Response传递给这个函数处理。由于CrawlSpider内部使用了parse方法,所以我们自己设置的该回调函数一般不能为parse方法。callback不为空时,follow为false,表示该提取的链接对应的Response应该传递给callback设置的函数处理,而不应该继续用于跟进。
follow:布尔值,指定了根据link_extractor规则从response提取的链接是否需要跟进,跟进的意思就是指让CrawlSpider对这个提取的链接进行处理,继续拿到该链接的Response,然后继续使用所有的rule去匹配和提取这个Response,然后再不断的做这些步骤。callback为空时,follow默认为true,否则就是false。
process_links:函数或者函数名,用于过滤link_extractor提取到的链接。
process_request:函数或者函数名,用于过滤Rule中提取到的request。

Which, LinkExtractor main targets are the following parameters:

allow:字符串或元组,满足括号中所有的正则表达式的那些值会被提取,如果为空,则全部匹配。
deny:字符串或元组,满足括号中所有的正则表达式的那些值一定不会被提取。优先于allow参数。
allow_domains:字符串或元组,会被提取的链接的域名列表。
deny_domains:字符串或元组,一定不会被提取链接的域名列表。
restrict_xpaths:字符串或元组,xpath表达式列表,使用xpath语法和allow参数一起提取链接。
restrict_css:字符串或元素,css表达式列表,使用css语法和allow参数一起提取链接。

Analysis of the structure of the page

169ee into the net, we found that the site in general, there are three types of links, as shown:

Write pictures described here

Rule processing and preparation of the above types of links LinkExtractor

So we can define different rule in the rules of these three variables are extracted and follow the link.

  1. For the album list page link, because the page is a list of albums in full, so we just need to follow the link;
  2. For the album link, since there are pictures, that is, the page which contains the data of interest to us, so this page, we should set the callback function Response processed to extract the data we need. Rather than simply follow the link.
  3. 对于相册翻页链接,由于翻页之后,页面其实跟相册链接页面一样,所以,我们需要把该类型链接的处理设置成跟2中一样。

最终,我们设置的rule列表如下:

Rule(SgmlLinkExtractor(allow=(r'^http://www.169ee.com/[a-z]+$',))),
Rule(SgmlLinkExtractor(allow=(r'^http://www.169ee.com/[a-z]*?/20[0-9]*?/[0-9]*?/[0-9]*?\.html$', r'^http://www.169ee.com/[a-z]*?/20[0-9]*?/[0-9]*?/\d*?_\d*?\.html$')), callback='parse_album'))

更详细的可以参见github源码!

编写parse_album函数,提取相册中的图片

设置好了rules之后,CrawlSpider会自动跟进链接了,所以我们集中精力编写提取图片的代码。

我们进入相册页面查看网页结构,发现,xpath表达式为//div[@id="content"]/div[@class="big-pic"]/div[@class="big_img"]的div标签中的每一个p标签里面都包含了一张图片。如,

Write pictures described here

定义Item

所以,提取这里面的所有图片链接,然后提取网页的title作为相册名字,从相册链接中提取年月日,和相册id,再从图片的链接中提取出图片的id。然后,我们定义的item如下:

class OnesixnineItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    _id = scrapy.Field()  # 该图片在该相册中的id
    title = scrapy.Field()  # 该相册的名字
    year = scrapy.Field()  # 该图片上传的年份
    folder_num = scrapy.Field()  # 该图片是该年份的第几个相册
    month_day = scrapy.Field()  # 该相册是该年份的几月几日创建的
    url = scrapy.Field()  # 该图片的url
    album_id = scrapy.Field()
    album_url = scrapy.Field()

提取图片

定义好item之后,我们继续提取图片信息。编写提取函数如下:

class OnesixnineSpider(CrawlSpider):
    name = "onesixnine"

    allowed_domains = ['www.169ee.com', '724.169pp.net']

    start_urls = ['http://www.169ee.com']

    year_monthday_albumid_pattern = re.compile(r'http://www\.169ee\.com/[a-z]+/(\d*?)/(\d*?)/(\d*?)\.html')

    image_year_foldernum_imageid_pattern = re.compile(r'http://724\.169pp\.net/169mm/(\d*?)/(\d*?)/(\d*?)\.[a-z]*?')

    rules = (Rule(LinkExtractor(allow=(r'^http://www.169ee.com/[a-z]+$',))),
             Rule(LinkExtractor(
                 allow=(r'^http://www.169ee.com/[a-z]*?/20[0-9]*?/[0-9]*?/[0-9]*?\.html$',
                        r'^http://www.169ee.com/[a-z]*?/20[0-9]*?/[0-9]*?/\d*?_\d*?\.html$')),
                 callback='parse_album'))

    def parse_album(self, response):
        img_srcs = response.xpath(
            '//div[@id="content"]/div[@class="big-pic"]/div[@class="big_img"]/p/img/@src').extract()
        if not img_srcs:
            return
        page_r_index = response.url.rfind('_')
        slash_r_index = response.url.rfind('/')
        if page_r_index > slash_r_index:
            # 当前页面链接中存在分页的部分,去掉当前链接中的分页部分
            album_url_prefix, album_url_suffix = os.path.split(response.url)
            album_url = response.urljoin(re.compile(r'_\d+').sub('', album_url_suffix))
        else:
            album_url = response.url
        # 从当前页面链接中解析出该相册的年、月日、相册id
        year_monthday_albumid_match = self.year_monthday_albumid_pattern.search(album_url)
        if not year_monthday_albumid_match:
            return
        year = year_monthday_albumid_match.group(1)
        month_day = year_monthday_albumid_match.group(2)
        album_id = year_monthday_albumid_match.group(3)
        for img_src in img_srcs:
            item = OnesixnineItem()
            item['title'] = response.xpath('/html/head/title/text()').extract_first()
            item['year'] = year
            image_year_foldernum_imageid_match = self.image_year_foldernum_imageid_pattern.search(img_src)
            if not image_year_foldernum_imageid_match:
                continue
            item['folder_num'] = image_year_foldernum_imageid_match.group(2)
            image_local_id = image_year_foldernum_imageid_match.group(3)
            item['_id'] = year + month_day + album_id + item['folder_num'] + image_local_id
            item['month_day'] = month_day
            item['album_id'] = album_id
            item['album_url'] = album_url
            item['url'] = img_src
            yield item

编写pipeline,保存图片

提取了item之后,scrapy会把Spider中产生的item转移给pipeline处理。所以我们可以在pipeline中保存图片。同时,我们可以保存图片信息到MongoDB中,待日后查看。

编写的pipeline如下:

class OnesixninePipeline(object):
    logger = logging.getLogger('OnesixninePipeline')
    connection = pymongo.MongoClient('localhost', 27017)

    def __init__(self):
        self.logger.info('pipeline init')
        self.db = self.connection.scrapy  # 切换到scrapy数据库
        self.collection = self.db.onesixnine  # 获取到onesixnine集合

    def process_item(self, item, spider):
        self.save_image(item)
        return item

    def save_image(self, item):
        if not item:
            return
        if os.path.exists(os.path.dirname(settings.DEFAULT_OUTPUT_FOLDER)):
            save_folder = settings.DEFAULT_OUTPUT_FOLDER
        else:
            save_folder = settings.CANDIDATE_DEFAULT_OUTPUT_FOLDER
        if not os.path.exists(save_folder):
            os.mkdir(save_folder)

        # 获取图片扩展名
        ext = os.path.splitext(item['url'])[1]
        # 得到图片保存的名字
        image_save_path = os.path.join(save_folder,
                                       item['year'] + '_' + item['month_day'] + '_' + item['folder_num'] + '_' + item[
                                           'album_id'] + '_' + item['_id'] + ext)
        urllib.request.urlretrieve(item['url'], image_save_path)
        if not self.connection:
            return item
        self.collection.save(item)

    def __del__(self):
        self.logger.info('pipeline exit!')

翻页抓取

我们还漏了一步。在相册链接页面,我们抓取了该页面所有的图片信息,但是一个相册有很多页,我们需要翻页抓取才不会漏掉相册中的图片。所以,我们通过xpath表达式,提取出下一页的链接://div[@id="content"]/div[@class="big-pic"]/div[@class="dede_pages"]/ul/li[last()]/a/@href,最终,翻页的处理为:

next_page_in_album = response.xpath('//div[@id="content"]/div[@class="bigpic"]/div[@class="dede_pages"]/ul/li[last()]/a/@href').extract_first()
if next_page_in_album and next_page_in_album != '#':
    next_page_url = response.urljoin(next_page_in_album)
    yield Request(next_page_url, callback=self.parse_album)

运行,查看结果

使用scrapy crawl onesixnine运行,可以自行查看运行结果

源码

源码在github上

喜欢的可以star,欢迎提issue或者pull request。

喜欢的可以关注微信公众号:
Write pictures described here

参考

  1. 我自己的头条号:爬虫进阶:CrawlSpider爬取169ee全站美女图片
Published 14 original articles · won praise 37 · views 110 000 +

Guess you like

Origin blog.csdn.net/c315838651/article/details/72791668