Python Scrapy Total Site Crawler

Basic command:

scrapy startproject test2 create project

scrapy genspider test www.abc.com Create a crawler based on scrapy.Spider

scrapy genspider -t crawl test www.abc.com Create a crawler based on CrawlSpider

scrapy crawl test -o test.json Run the crawler test data and save it to test.json

The code for grabbing Baidu App Store information is as follows:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class test(CrawlSpider):
    name = 'test'
    allowed_domains = ['as.baidu.com']
    start_urls = [
               'https://as.baidu.com/',
               ]

    rules = (
        Rule(LinkExtractor(allow='https://as.baidu.com/software/',deny='https://as.baidu.com/software/\d+\.html'),  follow=True),
            
        Rule(LinkExtractor(allow='https://as.baidu.com/software/\d+\.html'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        i['url']=response.url
        return i

Total site crawler principle:

1. First obtain the web page content A in start_urls

2. Match the rules address in the obtained webpage content A

3. Get the page content B of the matching rules address. If callback is set, call the callback function. If follow=True, continue to match the rules address in page content B and repeat step 3

Notice:

When there are multiple matching rules in rules, if a url satisfies one of them, it will not continue to match. as above example

If the method is written as follows, the data cannot be obtained

    rules = (
        Rule(LinkExtractor(allow='https://as.baidu.com/software/'),  follow=True),
            
        Rule(LinkExtractor(allow='https://as.baidu.com/software/\d+\.html'), callback='parse_item', follow=True),
    )

But the data can be obtained by changing the location

    rules = (
      
        Rule(LinkExtractor(allow='https://as.baidu.com/software/',deny='https://as.baidu.com/software/\d+\.html'),  follow=True),
        Rule(LinkExtractor(allow='https://as.baidu.com/software/\d+\.html'), callback='parse_item', follow=True),

    )

Python Scrapy Total Site Crawler

Guess you like