[Scrapy Notes] How to use

 

installation:

  1, pip install wheel install wheel

  2. Install Twisted
    a. Visit http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted to download Twisted-17.9.0-cp36-cp36m-win_amd64.whl
    b. Enter the directory where the file is located pip install Twisted- 17.1.0-cp35-cp35m-win_amd64.whl

  3. pip3 install scrapy install scrapy

 

use:  

  1. scrapy startprojec project name create project
  2. cd project name enter project directory
  3. scrapy genspider xxx xxx.com create crawler file For example, to create an oppo crawler file, scrapy genspider oppo www.oppo.cn
  4. scrapy crawl xxx run The crawler file is in the current directory, run the oppo crawler scrapy crawl oppo
  5, scrapy crawl xxx -o file.json to run the crawler, and store the file in the specified file, used for debugging!

  Note: Go to settings.py and comment out ROBOTSTXT_OBEY = True (If the project has no other infringing functions, you can comment "Identify whether crawling is allowed") If the
  domain name has www, you need to add www, if the current page adds ssl encrypted transmission, then Need to modify http to https in oppo.py

 

Debugging method:

   

  scrapy shell domain name
  scrapy shell -s USER_AGENT = "Mozilla / 5.0 (X11; Linux i686; U;) Gecko / 20070322 Kazehakase / 0.4.5" domain name Note: increase the request header during debugging, USER_AGENT = must be followed by a double in English Quotation marks, single quotes will report an error
  scrapy shell https://www.oppo.cn
  scrapy shell https://www.oppo.cn/topic/index/thread.json?page=1&limit=20&type=3&id=856
  behind the domain name & newline If you cannot connect to the entire domain name later, you can: scrapy shell "www.oppo.cn/topic/index/thread.json?page=1&limit=20&type=3&id=856" remove "https: //"

  CSS selector usage

  response.css ("# ID dt :: text") extract text information
  response.css ('. class p :: attr (href)'). extract () extract attribute information and display all
  response.css ('. class p: : attr (href) '). extract_first () Extract the attribute information and display the first

 

  Spider Note :
    allowed_domains = ['qiushibaike.com'] This is the correct matching rule,
   wrong demonstration :
    1. Allowed_domains = ['www.qiushibaike.com'] Add www
    2before the main domain name, allowed_domains = ['qiushibaike.com/ text '] Add extra page numbers after the main domain name

 

Create crawl_spider

  1. scrapy startproject project name
  2. cd project name
  3. scrapy genspider xxx -t crawl xxx 'domain name'

 

About settings to call method:

   In the spider.py file, you can directly use self.settings.get ('XXX') to get

         

 

   

 

Guess you like

Origin www.cnblogs.com/Ray-2019/p/12745730.html