Scrapy reptile template --SitemapSpider

SitemapSiper allow us to crawl a site by site Sitemap file in the URL. Sitemap file each URL links throughout the site, which contains the last update, the update frequency and the URL of the weight (importance). Sitemap common file formats TXT, XML and HTML format, most of the site is displayed in XML format. Here we look at the Sitemap file format CSDN website.
CSDN website Sitemap file
Let's explain the meaning of each node image above:

  1. loc: full URL;
  2. lastmod: last modified;
  3. changefreq: update frequency;
  4. priority: to link to weight.

Let's explain SitemapSiper commonly used attributes:

  1. sitemap_urls: Sitemap contains a list of the url to be crawled;
  2. sitemap_rules: a list of tuples, the regular expression and the callback function is a format (REGEX, the callback) . regex can be a regular expression can also be a string. url callback for processing a callback function;
  3. sitemap_follow: Specifies the Sitemap need to follow up the list of regular expressions;
  4. sitemap_alternate_link: whether to follow up when specified with the optional url links, default is not to follow up. Here, the term link refers to the optional alternate URL, general format:
<url>
  <loc>http://aaa.com</loc>
  <!--备用网址/可选链接-->
  <xhtml:link rel="alternate" hreflang="en" href="http://aaa.com/en"/>
</url>

Zero, examples

Here we look at how to use SitemapSiper crawling through Sitemap CSDN point of view.

from scrapy.spiders import SitemapSpider
from ..items import CsdnItem


class csdnspider(SitemapSpider):
    name = 'csdn_spider'
    sitemap_urls = ['https://www.csdn.net/sitemap.xml']
    sitemap_rules = [
        ('beautifulsoup4', 'parse')
    ]

    def parse(self, response):
        docs = response.css('.local-toc li')
        for doc in docs:
            item = CsdnItem()
            item["title"] = doc.css(".reference::text").extract_first()
            item["url"] = doc.css(".reference::attr(href)").extract_first()
            yield item

import scrapy


class CsdnItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
Published 204 original articles · won praise 101 · Views 350,000 +

Guess you like

Origin blog.csdn.net/gangzhucoll/article/details/103900049