Basic command:
scrapy startproject test2 create project
scrapy genspider test www.abc.com Create a crawler based on scrapy.Spider
scrapy genspider -t crawl test www.abc.com Create a crawler based on CrawlSpider
scrapy crawl test -o test.json Run the crawler test data and save it to test.json
The code for grabbing Baidu App Store information is as follows:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class test(CrawlSpider):
name = 'test'
allowed_domains = ['as.baidu.com']
start_urls = [
'https://as.baidu.com/',
]
rules = (
Rule(LinkExtractor(allow='https://as.baidu.com/software/',deny='https://as.baidu.com/software/\d+\.html'), follow=True),
Rule(LinkExtractor(allow='https://as.baidu.com/software/\d+\.html'), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = {}
i['url']=response.url
return i
Total site crawler principle:
1. First obtain the web page content A in start_urls
2. Match the rules address in the obtained webpage content A
3. Get the page content B of the matching rules address. If callback is set, call the callback function. If follow=True, continue to match the rules address in page content B and repeat step 3
Notice:
When there are multiple matching rules in rules, if a url satisfies one of them, it will not continue to match. as above example
If the method is written as follows, the data cannot be obtained
rules = (
Rule(LinkExtractor(allow='https://as.baidu.com/software/'), follow=True),
Rule(LinkExtractor(allow='https://as.baidu.com/software/\d+\.html'), callback='parse_item', follow=True),
)
rules = (
Rule(LinkExtractor(allow='https://as.baidu.com/software/',deny='https://as.baidu.com/software/\d+\.html'), follow=True),
Rule(LinkExtractor(allow='https://as.baidu.com/software/\d+\.html'), callback='parse_item', follow=True),
)