SitemapSiper allow us to crawl a site by site Sitemap file in the URL. Sitemap file each URL links throughout the site, which contains the last update, the update frequency and the URL of the weight (importance). Sitemap common file formats TXT, XML and HTML format, most of the site is displayed in XML format. Here we look at the Sitemap file format CSDN website.
Let's explain the meaning of each node image above:
- loc: full URL;
- lastmod: last modified;
- changefreq: update frequency;
- priority: to link to weight.
Let's explain SitemapSiper commonly used attributes:
- sitemap_urls: Sitemap contains a list of the url to be crawled;
- sitemap_rules: a list of tuples, the regular expression and the callback function is a format (REGEX, the callback) . regex can be a regular expression can also be a string. url callback for processing a callback function;
- sitemap_follow: Specifies the Sitemap need to follow up the list of regular expressions;
- sitemap_alternate_link: whether to follow up when specified with the optional url links, default is not to follow up. Here, the term link refers to the optional alternate URL, general format:
<url>
<loc>http://aaa.com</loc>
<!--备用网址/可选链接-->
<xhtml:link rel="alternate" hreflang="en" href="http://aaa.com/en"/>
</url>
Zero, examples
Here we look at how to use SitemapSiper crawling through Sitemap CSDN point of view.
from scrapy.spiders import SitemapSpider
from ..items import CsdnItem
class csdnspider(SitemapSpider):
name = 'csdn_spider'
sitemap_urls = ['https://www.csdn.net/sitemap.xml']
sitemap_rules = [
('beautifulsoup4', 'parse')
]
def parse(self, response):
docs = response.css('.local-toc li')
for doc in docs:
item = CsdnItem()
item["title"] = doc.css(".reference::text").extract_first()
item["url"] = doc.css(".reference::attr(href)").extract_first()
yield item
import scrapy
class CsdnItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()