First, the experimental environment
1.Windows7x64_SP1
2.anaconda3 + python3.7.3 (anaconda integrated, without having to install separate)
3.scrapy1.6.0
Second, usage example
1. Open scrapy shell, enter the following command line:
scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
The results are as follows:
2. Extract a node
-
Usage in xpath
result = response.xpath('//a')
The results are as follows:
[<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>, <Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]
-
Usage in css
result = response.css('a')
The results are as follows:
[<Selector xpath='descendant-or-self::a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='descendant-or-self::a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='descendant-or-self::a' data='<a href="image3.html">Name: My image 3 <'>, <Selector xpath='descendant-or-self::a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='descendant-or-self::a' data='<a href="image5.html">Name: My image 5 <'>]
3. Check the type of result
type(result)
The results are as follows:
scrapy.selector.unified.SelectorList
Description: result list is composed Selector, also SelectList type, they can continue to call xpath () and css () and other methods to further extract the data.
4. See entire contents extracted result data, using the extract () function
result.extract()
The results are as follows:
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>', '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>', '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
The extract node contents
-
Usage in xpath, using text () function
response.xpath('//a/text()')
The results are as follows:
[<Selector xpath='//a/text()' data='Name: My image 1 '>, <Selector xpath='//a/text()' data='Name: My image 2 '>, <Selector xpath='//a/text()' data='Name: My image 3 '>, <Selector xpath='//a/text()' data='Name: My image 4 '>, <Selector xpath='//a/text()' data='Name: My image 5 '>]
View HTML content
response.xpath('//a/text()').extract()
The results are as follows:
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
-
Usage in css
response.css('a::text').extract()
The results are as follows:
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
6. extracted attribute value
-
xpath in use, use / @ attribute name (such as / @ href)
response.xpath('//a/@href').extract()
The results are as follows:
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
- Usage in css
response.css('a::attr("href")').extract()
The results are as follows:
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
7. Extraction internal node subnode
-
xpath in usage / child node name
response.xpath('//a/img').extract()
The results are as follows:
['<img src="image1_thumb.jpg">', '<img src="image2_thumb.jpg">', '<img src="image3_thumb.jpg">', '<img src="image4_thumb.jpg">', '<img src="image5_thumb.jpg">']
-
Usage in css
response.css('a img').extract()
The results are as follows:
['<img src="image1_thumb.jpg">', '<img src="image2_thumb.jpg">', '<img src="image3_thumb.jpg">', '<img src="image4_thumb.jpg">', '<img src="image5_thumb.jpg">']
Re-extracted src attribute value which is the same as Step 6
-
xpath usage
response.xpath('//a/img/@src').extract()
-
css usage
response.css('a img::attr("src")').extract()
8. The method of public
- extract_first () # for extracting a first element
- extract_first ( 'default value') # above, add the default parameters