Debug xpath with Scrapy shell

Yesterday, a friend asked me if I would use scrapy shell to check xpath, and I said that I had never done it, and it was embarrassing. However, it is actually not difficult, so I will record the method here.

First of all, let's take a look at the scrapy document translated by Amway . Although the latest translation version is version 1.0, and scrapy has been released to version 1.3, this document is still very easy to use, but there are some small pits caused by version differences.

  1. Type at the command line

    scrapy shell

    Go to the scrapy shell terminal, if you have IPython installed, it will start with IPython. 
    At this point, some Scrapy startup information and usage methods will be displayed.

  2. Use fetch() to fetch web pages

    In [2]: fetch('https://www.baidu.com')
    2017-01-17 10:32:55 [scrapy.core.engine] INFO: Spider opened
    2017-01-17 10:32:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com> (referer: None)

    At this point, the web page has been saved in the response object. Of course, it can also be paid to a variable. Since it is used for debugging XPath, it feels unnecessary to save it.

  3. Use xpath to match web page elements

    The use of xpath is the same as in the Scrapy project. It should be noted that after the Scrapy 1.2 version, you no longer need to declare the selector by yourself, and use response.xpath() or response.css() directly. The same is true in the scrapy shell .

    In [2]: response.xpath('//*[@id="lg"]').extract()
    Out[2]: ['<div id="lg"> <img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129"> </div>']

    Before scrapy version 1.2, the response and selector returned by fetch() were separated, and they were written as follows:

    sel.xpath('//*[@id="lg"]')
  4. Can also be debugged in layers

    In [3]: a = response.xpath('//*[@id="lg"]')
    
    In [4]: a.xpath('./img').extract()
    Out[4]: ['<img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129">']
  5. view()

    Type in scrapy shell

    view(response)

    This command will call the local browser to display the web page you just requested, it is worth noting that:

    Open the given response in the native browser. It will add a tag to the body of the response, so that external links (such as pictures and css) can be displayed correctly. Note that this operation will create a temporary file locally, and the file will not be deleted automatically. - Chinese documentation

    When you feel that there is no problem with the xpath writing, but it just can't match, you might as well see what you have downloaded :-).

4/12/17 Supplement:

Method for adding UserAgent and request headers

When debugging Zhihu with scrapy shell:

fetch('http://www.zhihu.com')
2017-04-12 22:33:09 [scrapy.core.engine] INFO: Spider opened
2017-04-12 22:33:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.zhihu.com> (failed 1 times): 500 Internal Server Error
2017-04-12 22:33:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.zhihu.com> (failed 2 times): 500 Internal Server Error
2017-04-12 22:33:29 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.zhihu.com> (failed 3 times): 500 Internal Server Error
2017-04-12 22:33:29 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.zhihu.com> (referer: None)

Everyone (gang) (bai) (bai) (du) (de), 500 is an error status code inside the server, but my intuition tells me that the User-Agent of the scrapy shell should be "Scrapy + version number", so it was rejected by the server. of.

Configure User-Agent and start scrapy shell with the following commands :

scrapy shell -s USER_AGENT='Mozilla/5.0'

Then on the homepage of fetch() Zhihu, I found that it was already 200.

fetch('http://www.zhihu.com')
2017-04-12 22:41:11 [scrapy.core.engine] INFO: Spider opened
2017-04-12 22:41:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.zhihu.com/> from <GET http://www.zhihu.com>
2017-04-12 22:41:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/> (referer: None)

So how to configure request headers for requests in scrapy shell ? as follows:

$ scrapy shell
...
>>> from scrapy import Request
>>> req = Request('yoururl.com', headers={"header1":"value1"})
>>> fetch(req)

Reference: http://stackoverflow.com/questions/37010524/set-headers-for-scrapy-shell-request

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325263307&siteId=291194637