scrapy in xpath, css usage

First, the experimental environment

1.Windows7x64_SP1

2.anaconda3 + python3.7.3 (anaconda integrated, without having to install separate)

3.scrapy1.6.0

Second, usage example

1. Open scrapy shell, enter the following command line:

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

The results are as follows:

 

2. Extract a node

  • Usage in xpath

result = response.xpath('//a')

The results are as follows:

[<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>,
 <Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>,
 <Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>,
 <Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>,
 <Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]
  •  Usage in css

result = response.css('a')

The results are as follows:

[<Selector xpath='descendant-or-self::a' data='<a href="image1.html">Name: My image 1 <'>,
 <Selector xpath='descendant-or-self::a' data='<a href="image2.html">Name: My image 2 <'>,
 <Selector xpath='descendant-or-self::a' data='<a href="image3.html">Name: My image 3 <'>,
 <Selector xpath='descendant-or-self::a' data='<a href="image4.html">Name: My image 4 <'>,
 <Selector xpath='descendant-or-self::a' data='<a href="image5.html">Name: My image 5 <'>]

  

3. Check the type of result

 type(result)

The results are as follows:

scrapy.selector.unified.SelectorList

Description: result list is composed Selector, also SelectList type, they can continue to call xpath () and css () and other methods to further extract the data.

 

4. See entire contents extracted result data, using the extract () function

result.extract()  

The results are as follows:

['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

 

The extract node contents

  • Usage in xpath, using text () function

response.xpath('//a/text()')

The results are as follows:

[<Selector xpath='//a/text()' data='Name: My image 1 '>,
 <Selector xpath='//a/text()' data='Name: My image 2 '>,
 <Selector xpath='//a/text()' data='Name: My image 3 '>,
 <Selector xpath='//a/text()' data='Name: My image 4 '>,
 <Selector xpath='//a/text()' data='Name: My image 5 '>]

 

View HTML content

response.xpath('//a/text()').extract()

The results are as follows:

['Name: My image 1 ',
 'Name: My image 2 ',
 'Name: My image 3 ',
 'Name: My image 4 ',
 'Name: My image 5 ']
  •  Usage in css

response.css('a::text').extract()

The results are as follows:

['Name: My image 1 ',
 'Name: My image 2 ',
 'Name: My image 3 ',
 'Name: My image 4 ',
 'Name: My image 5 ']

    

6. extracted attribute value

  • xpath in use, use / @ attribute name (such as / @ href)

response.xpath('//a/@href').extract()

The results are as follows:

['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
  •  Usage in css
response.css('a::attr("href")').extract()

The results are as follows:

['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] 

  

7. Extraction internal node subnode

  • xpath in usage / child node name  

response.xpath('//a/img').extract()

The results are as follows:

['<img src="image1_thumb.jpg">',
 '<img src="image2_thumb.jpg">',
 '<img src="image3_thumb.jpg">',
 '<img src="image4_thumb.jpg">',
 '<img src="image5_thumb.jpg">']
  •  Usage in css

response.css('a img').extract()

The results are as follows:

['<img src="image1_thumb.jpg">',
 '<img src="image2_thumb.jpg">',
 '<img src="image3_thumb.jpg">',
 '<img src="image4_thumb.jpg">',
 '<img src="image5_thumb.jpg">']

  

  

Re-extracted src attribute value which is the same as Step 6

  • xpath usage

response.xpath('//a/img/@src').extract()
  • css usage

response.css('a img::attr("src")').extract()

  

8. The method of public

  • extract_first () # for extracting a first element
  • extract_first ( 'default value') # above, add the default parameters

Guess you like

Origin www.cnblogs.com/hester/p/11371384.html