Python3 crawler Scrapy framework commonly used commands

table of Contents

 

Global command

fetch command

runspider command

settings command

shell command

startproject command

version command

 view command

Project command

Bench command

Genspider commands

Check command

Crawl command

List command

 Edit command

 Parse command


Global command

  •   bench         Run quick benchmark test
  •   fetch         Fetch a URL using the Scrapy downloader
  •   genspider     Generate new spider using pre-defined templates
  •   runspider     Run a self-contained spider (without creating a project)
  •   settings      Get settings values
  •   shell         Interactive scraping console
  •   startproject  Create new project
  •   version       Print Scrapy version
  •   view          Open URL in browser, as seen by Scrapy

fetch command

The fetch command is mainly used to display the crawling process of the crawler.
For example, we can display the process of crawling the corresponding URL in the form of scrapy fetch URL. Take the process of displaying the crawler to crawl the Baidu homepage (http://www.baidu.com) as an example, as shown below.

At this point, if you use this command outside the Scrapy project directory, Scrapy's default crawler will be invoked to crawl the web page. If you use this command in a project directory of Scrapy, the crawler in the project will be called to crawl the web page.
When we use the fetch command, we can also use certain parameters for corresponding control.
You can use scrapy fetch -h to list all available fetch related parameters.
For example, we can use the -headers parameter to control the display of the header information when the corresponding crawler crawls the web page, or use the -nolog parameter to control not to display the log information. At the same time, we can also use the --spider-SPIDER parameter to control which one to use The crawler uses the -logfile=FILE parameter to specify the file for storing log information, and uses the -loglevel-LEVEL
parameter to control the log level.
As shown below, we use the --headers parameter and the --nolog parameter to control the header information when the crawler crawls the Sina News homepage (ttp://news.sina.com.cn/) and not display the log information.

runspider command

Through the runspider command in Scrapy, we can realize a crawler project that does not rely on Scrapy and directly run a crawler file.
Below we will show an example of using the ruspioler command to run a crawling file. First, write a Scrapy crawler file, as shown below.

from scrapy.spiders import Spider

class FirstSpider(Spider):
    name = 'first'
    allowed_domains = ['baidu.com']
    start_urls = ['https://www.baidu.com',]

    def parse(self, response):
        pass

Here, you only need to understand the crawler file briefly, because we will learn how to write a crawler file in detail later. First, define the name of the crawler file as first, and at the same time, define the starting URL of the crawl as htp://www.baidu.com.
Then, you can use the runspider command to run the crawler file directly. At this time, you don't need to rely on a complete Scrapy project to run, you only need to have the corresponding crawler file.
As shown below, we ran the crawler file through scrapy runspider and set the log level to INFO.

It can be seen that through this instruction, the crawler file was finally successfully completed without relying on the Scrapy project.

settings command

We can view the configuration information corresponding to Scrapy through the settings command in Scrapy.
If you use the sttings command in the Scrapy project directory, you can view the configuration information of the corresponding project. If you use the settings command outside the Scrapy project directory, you can view the Scrapy default configuration information.

shell command

Scrapy's interactive terminal (Scrapy shell) can be started through shell commands.
Scrapy's interactive terminal is often used during development and debugging. Using Scrapy's interactive terminal can debug website responses without starting the Scrapy crawler. Similarly, in this interactive terminal, we can also write some Python The code is tested accordingly.
For example, you can use shell commands to create an interactive terminal environment for crawling Baidu homepage, and set it to not output log information, as shown below:
 

 It can be seen that after executing this command, there will be available Scrapy objects and shortcut commands, such as item, response, settings, spider, etc., and enter the interactive mode. After ">>>", you can enter interactive commands and The corresponding code.
In this interactive mode, you can extract the title of the crawled web page. At this time, we extract it through XPath expressions. Perhaps readers are not familiar with XPath expressions at this time. We will explain the basics of XPath expressions later. Here, you only need to simply know the meaning of the following XPath expression "/html/head/title": extract the information corresponding to the <title> tag in the <head> tag under the <html)> tag of the web page. We know that the information in the tag at this time is the title information of the webpage, so the purpose of the following XPath expression "/htm/headtitle" is to extract the title information of the crawled webpage.
As shown below, we extracted the corresponding information through sel.xpath, and output the mentioned information through Python code.

>>> ti=sel. xpath(" /html /head/title")
>>> print (ti)
[<selector xpath=' /html /head/title' data='<title> Baidu, you will know </tit1e>'>]
>>>
You can see that the content after data is the extracted data. The title "<title> Baidu on Baidu, you will know</title>" is successfully extracted.
In addition, we can also perform various development and debugging in the interactive terminal.
If we want to exit the interactive terminal, we can use exit() to achieve, as shown below:
>>> exit()
D: \Python35 \myweb\part12>
Above we analyzed how to use shell commands in Scrapy, and learn to use shell commands , To a certain extent, it can greatly facilitate the development and debugging of crawlers, because through shell commands, we can directly develop and debug crawlers without creating a Scrapy project.

startproject command

The startproject command. The previous section has been analyzed in detail and is mainly used to create projects.

version command

Through the version command, you can directly display Scrapy version related information.
For example, if you want to view the version information of Scrapy, you can use the following code:

If you also want to view other version information related to Scrapy (including the above-mentioned Scrapy version information of course), such as the version information of titelptotopltforo, you can add the -v parameter to the version command, as shown below:

As you can see, the version information related to Scrapy has been displayed in detail at this time.

 view command

Through the view command, we can achieve the function of downloading a web page and viewing it with a browser.
For example, we can download the NetEase News homepage (http://www.baidu.com/) through the following command, and automatically view the downloaded webpage with a browser.
(venv) D:\pycharmproject\pythonlearning>scrapy view http://www.baidu.com/ After
executing this command, it will automatically open the browser and display the page that has been downloaded to the local (note that the web page has been downloaded to the local , So the URL at this time is the local web page address), as shown in Figure 12-5.

Project command

Next, we analyze in detail the use of Scrapy project commands.
Since the Scrapy project commands can only be used based on the Scrapy crawler project, we first enter an already created Scrapy crawler project, as shown below.
 

(venv) D:\pycharmproject\pythonlearning>cd myfirstpjt

 Use scrapy -h to view the commands available in the project.

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Bench command

Use the bench command to test the performance of local hardware.
When we run scrapy bench, a local server will be created and crawled at the maximum speed. In order to test the performance of the local hardware and avoid the influence of too many other factors, we only carry out link follow-up without content processing.
As shown below, we used scrapy bench to test the performance of local hardware.
 

It can be seen that in the obtained test results, in terms of hardware performance alone, it is shown that approximately thousands of web pages can be crawled per minute. This is just a reference standard. When the crawler project is actually run, the speed will be different due to various factors. Generally speaking, the actual running speed can be compared with the reference speed to optimize and improve the crawler project.

Genspider commands

You can use the genspider command to create Scrapy crawler files, which is a quick way to create crawler files. Using this command can directly generate a new crawler file based on the existing crawler template, which is very convenient. Similarly, you need to be in the Scrapy crawler project directory to use this command.
You can use the -l parameter of this command to view the currently available crawler templates, as shown below.

As you can see, the currently available crawler templates are basic, crawl, csvfeed, and xmlfeed.
At this point, a crawler file can be generated based on any one of the crawler templates. For example, we can use the basic template to generate a crawler file in the format of "scrapy genspider -t template new crawler name new crawler crawled domain name", as follows Shown.
 

Check command

Crawler testing is more troublesome, so use contract (contract9) to test the crawler in Scrapy.
Use the check command in Scrapy to perform contract checking on a crawler file.
For example, to check the contract (contract) of the crawler file weisuen.py created just now based on the template, we can use "scrapy check crawler name" to implement it. Note that at this time, the crawler name after "check" is not the crawler file name, so it is Without suffix, as shown below.

It can be seen that the contract inspection of the crawler file is passed, and the displayed result is "OK".

Crawl command

You can start a crawler through the crawl command. The start format is "scrapy crawl crawler name".
It should be noted that crawler is followed by the crawler name, not the crawler project name.

List command

Through the list command in Scrapy, you can list the currently available crawler files.
For example, we can enter the directory where the crawler project myfirstpjit is located on the command line, and then use scrapy list to directly list the currently available crawler files, as shown below.

As you can see, there is one crawler file that can be used at this time, first.

 Edit command

Through the edit command in Scrapy, we can directly open the corresponding editor to edit the crawler file. This command will cause a little problem when executed in Windows, and in Windows, we generally use Python IDE (such as PyCharm) to directly target the crawler project For management and editing, this command is very convenient in Linux.

 Parse command

Through the parse command, we can obtain the specified URL URL and use the corresponding crawler file for processing and analysis.
For example, we can use "scrapy parse http://www.baidu.com" to get the Baidu homepage (htp://www.baidu.com), because there is no specified crawler file or
processing function, so at this time The default crawler file and default processing function will be used for corresponding processing, as shown below.

The "scrapy parse" command has many parameters. We can check the specific parameters through scrapy parse -h, as shown below.

Options
=======
--help, -h              show this help message and exit
--spider=SPIDER         use this spider without looking for one
-a NAME=VALUE           set spider argument (may be repeated)
--pipelines             process items through pipelines
--nolinks               don't show links to follow (extracted requests)
--noitems               don't show scraped items
--nocolour              avoid using pygments to colorize the output
--rules, -r             use CrawlSpider rules to discover the callback
--callback=CALLBACK, -c CALLBACK
                        use this callback for parsing, instead looking for a
                        callback
--meta=META, -m META    inject extra meta into the Request, it must be a valid
                        raw json string
--cbkwargs=CBKWARGS     inject extra callback kwargs into the Request, it must
                        be a valid raw json string
--depth=DEPTH, -d DEPTH
                        maximum depth for parsing requests [default: 1]
--verbose, -v           print each depth level one by one

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure
 

As you can see, these parameters are roughly divided into two categories: common parameters (Options) and global parameters (Global Options). We have basically seen global parameters in other commands, so here we mainly focus on the parameters corresponding to the command ( Options).
The following summarizes the commonly used parameters and their meanings, as shown in Table 12-2.

Through the table above, you can clearly know which common parameters are in the narse command.

Guess you like

Origin blog.csdn.net/someby/article/details/105440171