Yesterday, a friend asked me if I would use scrapy shell to check xpath, and I said that I had never done it, and it was embarrassing. However, it is actually not difficult, so I will record the method here.
First of all, let's take a look at the scrapy document translated by Amway . Although the latest translation version is version 1.0, and scrapy has been released to version 1.3, this document is still very easy to use, but there are some small pits caused by version differences.
-
Type at the command line
scrapy shell
Go to the scrapy shell terminal, if you have IPython installed, it will start with IPython.
At this point, some Scrapy startup information and usage methods will be displayed. -
Use fetch() to fetch web pages
In [2]: fetch('https://www.baidu.com') 2017-01-17 10:32:55 [scrapy.core.engine] INFO: Spider opened 2017-01-17 10:32:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.baidu.com> (referer: None)
At this point, the web page has been saved in the response object. Of course, it can also be paid to a variable. Since it is used for debugging XPath, it feels unnecessary to save it.
-
Use xpath to match web page elements
The use of xpath is the same as in the Scrapy project. It should be noted that after the Scrapy 1.2 version, you no longer need to declare the selector by yourself, and use response.xpath() or response.css() directly. The same is true in the scrapy shell .
In [2]: response.xpath('//*[@id="lg"]').extract() Out[2]: ['<div id="lg"> <img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129"> </div>']
Before scrapy version 1.2, the response and selector returned by fetch() were separated, and they were written as follows:
sel.xpath('//*[@id="lg"]')
-
Can also be debugged in layers
In [3]: a = response.xpath('//*[@id="lg"]') In [4]: a.xpath('./img').extract() Out[4]: ['<img hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270" height="129">']
-
view()
Type in scrapy shell
view(response)
This command will call the local browser to display the web page you just requested, it is worth noting that:
Open the given response in the native browser. It will add a tag to the body of the response, so that external links (such as pictures and css) can be displayed correctly. Note that this operation will create a temporary file locally, and the file will not be deleted automatically. - Chinese documentation
When you feel that there is no problem with the xpath writing, but it just can't match, you might as well see what you have downloaded :-).
4/12/17 Supplement:
Method for adding UserAgent and request headers
When debugging Zhihu with scrapy shell:
fetch('http://www.zhihu.com')
2017-04-12 22:33:09 [scrapy.core.engine] INFO: Spider opened
2017-04-12 22:33:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.zhihu.com> (failed 1 times): 500 Internal Server Error
2017-04-12 22:33:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.zhihu.com> (failed 2 times): 500 Internal Server Error
2017-04-12 22:33:29 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.zhihu.com> (failed 3 times): 500 Internal Server Error
2017-04-12 22:33:29 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.zhihu.com> (referer: None)
Everyone (gang) (bai) (bai) (du) (de), 500 is an error status code inside the server, but my intuition tells me that the User-Agent of the scrapy shell should be "Scrapy + version number", so it was rejected by the server. of.
Configure User-Agent and start scrapy shell with the following commands :
scrapy shell -s USER_AGENT='Mozilla/5.0'
Then on the homepage of fetch() Zhihu, I found that it was already 200.
fetch('http://www.zhihu.com')
2017-04-12 22:41:11 [scrapy.core.engine] INFO: Spider opened
2017-04-12 22:41:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.zhihu.com/> from <GET http://www.zhihu.com>
2017-04-12 22:41:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com/> (referer: None)
So how to configure request headers for requests in scrapy shell ? as follows:
$ scrapy shell
...
>>> from scrapy import Request
>>> req = Request('yoururl.com', headers={"header1":"value1"})
>>> fetch(req)
Reference: http://stackoverflow.com/questions/37010524/set-headers-for-scrapy-shell-request