1. Create a new project:
scrapy startproject myproject
2. Create a new spider file in the new project:
scrapy genspider mydomain mydomain.com
mydomain is the spider file name, mydomain.com is the domain name of the crawling website
3. Global commands:
startproject genspider settings runspider shell fetch view version
4. Commands used only in the project (local commands):
crawl check list edit parse bench
5. Run the spider file:
scrapy crawl <spider>
5.1 Running the spider file does not display the log
scrapy crawl <spider> --nolog
6. Check the spider file for syntax errors:
scrapy check
7. List the spider files under the spider path:
scrapy list
8. Edit the spider file:
scrapy edit <spider>
It is equivalent to turning on vim mode, which is actually not easy to use, and editing in the IDE is more suitable.
9. Download the content of the web page, and then print the currently returned content in the terminal, which is equivalent to the request and urllib methods:
scrapy fetch <url>
10. Save the content of the webpage, and open the current webpage content in the browser to visually present the content of the webpage to be crawled:
scrapy view <url>
11. Open the scrapy display, similar to ipython, can be used for testing:
scrapy shell [url]
12. Output formatted content:
scrapy parse <url> [options]
13. Return to system setting information:
scrapy settings [options]
Such as:
$ scrapy settings --get BOT_NAME scrapybot
14. Run the spider:
scrapy runspider <spider_file.py>
15. Display the scrapy version:
scrapy version [-v]
Add -v later to display the version of scrapy dependent library
16. Test the current crawling speed performance of the computer:
scrapy bench