scrapy command

Create project:
    scrapy startproject myproject
    cd myproject
    Create a spider
        scrapy genspider spidername spiderurl.com

See all commands:
    scrapy -h

Global command:
    start project
    settings
    runspider
    shell
    fetch
    view
    version

Project command:
    crawl
    check
    list
    edit
    parse
    genspider
    bench

Create project:
start project
    scrapy startproject myproject
View templates:
    scrapy genspider -l
    (basic, crawl, csvfeed, xmlfeed)
Create spider in current project (using template: -t basic)
    scrapy genspider [-t template] <spiderName> <spiderUrl>
run spider
    scrapy crawl myspidername
Save the json file .xml, .jl...
    scrapy crawl myspider -o fileName.json
Check the project code:
    scrapy check [-l] [spider]
fetch to view the returned content of the webpage:
    scrapy fetch <url>
Generate static pages
    scrapy view url
scrapy terminal
    scrapy shell url
    ###
parse syntax:
    scrapy parse <url> [options]
settings: view settings
    scrapy settings --get BOT_NAME
    scrapy settings --get DOWNLOAD_DELAY
run a spider
runspider:
    scrapy runspider myspider.py

Selector use:
    Get the text under the title tag (the first)
        response.selector.xpath('//title/text()').extract_first()
        response.css('title::text').extract_first()
    Get the text under the title tag (all)
        response.selector.xpath('//title/text()').extract()

    get subtag text
        <div id="images">
            <a></a>
        </div>
        response.xpath('//div[@id="images"]/a/text()').extract_first()
    get attribute
        href attribute of base tag
            response.xpath('//base/@href').extract()
            response.css('base::attr(href)').extract()
        href contains image's
            response.css('a[href*=image]::attr(href)').extract()
            response.xpath('//a[contains(@href,"image")]/@href').extract()
        The a tag contains the src attribute of the subtag img of the image
            response.xpath('//a[contains(@href,"image")]/img/@src').extract()
            response.css('a[href*="image"] img::attr(src)').extract()
    reselector
        response.xpath().re('Name:(.*)') to get all matching(), re_first() to get the first one
    Returns None if no match
        Also .extract_first('custom return')

  

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324894280&siteId=291194637