[Scrapy framework] "Version 2.4.0 source code" selector (Selectors) detailed articles

All source code analysis article index directory portal

[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index

Introduction

When crawling a web page, the most common task that needs to be performed is to extract data from an HTML source.

There are many ways to extract data according to your own habits.

BeautifulSoup is a very popular web scraping library among Python programmers. It constructs Python objects based on the structure of HTML code and handles bad tags quite well, but it has the disadvantage of slow speed.

lxml is an XML parsing library (parse HTML), it uses ElementTree based. (lxml is not part of the Python standard library.)

Scrrapy has its own mechanism for extracting data called Selectors, which "selects" certain parts of an HTML document.

Use selector

Build selector

The response object exposes the selector property on the Selector instance

>>> response.selector.xpath('//span/text()').get()
'good'

Use XPath and CSS query methods: response.xpath() and response.css()

>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'

Text structure

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'

The HtmlResponse that constructs the response is a subclass of TextResponse

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').get()
'good'

Use selector

Open the page to be parsed and
Insert picture description here
construct an XPath to select the text in the title tag

>>> response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]

To extract text data, call the selector.get() or .getall()

>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'
>>> response.css('title::text').get()
'Example website'

The .xpath() and .css() methods return an instance of SelectorList, and you can also nest data quickly

>>> response.css('img').xpath('@src').getall()
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

To extract the first matching element, use .get() or .extract_first()

>>> response.xpath('//div[@id="images"]/a/text()').get()
'Name: My image 1 '

If the corresponding data element is not captured

>>> response.xpath('//div[@id="not-exists"]/text()').get() is None
True

>>> response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
'not-found'

Select the corresponding attribute content

>>> [img.attrib['src'] for img in response.css('img')]
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']
 
>>> response.css('img').attrib['src']
'image1_thumb.jpg'

When the result of the corresponding selection is unique

>>> response.css('base').attrib['href']
'http://example.com/'

Several ways to construct image url

>>> response.xpath('//base/@href').get()
'http://example.com/'

>>> response.css('base::attr(href)').get()
'http://example.com/'

>>> response.css('base').attrib['href']
'http://example.com/'

>>> response.xpath('//a[contains(@href, "image")]/@href').getall()
['image1.html',
 'image2.html',
 'image3.html',
 'image4.html',
 'image5.html']

>>> response.css('a[href*=image]::attr(href)').getall()
['image1.html',
 'image2.html',
 'image3.html',
 'image4.html',
 'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').getall()
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').getall()
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

CSS selector extension

According to the W3C standard, CSS selectors do not support the selection of text nodes or attribute values. But it is very important to select these content in the context of web scraping, so scrapy (Parsel) implements non-standard pseudo-elements:

  1. To select a text node, use ::text
  2. To select the attribute value, use ::attr(name) the name of the attribute

title::text selects the child text node of the descendant

>>> response.css('title::text').get()
'Example website'

*::text selects all child text nodes of the current selector context

>>> response.css('#images *::text').getall()
['\n   ',
 'Name: My image 1 ',
 '\n   ',
 'Name: My image 2 ',
 '\n   ',
 'Name: My image 3 ',
 '\n   ',
 'Name: My image 4 ',
 '\n   ',
 'Name: My image 5 ',
 '\n  ']

foo::text if the foo element, but does not contain text (that is, the text is empty)

>>> response.css('img::text').getall()
[]

>>> response.css('img::text').get()
>>> response.css('img::text').get(default='none')
'none'

a::attr(href) Select the attributes of the href child link

>>> response.css('a::attr(href)').getall()
['image1.html',
 'image2.html',
 'image3.html',
 'image4.html',
 'image5.html']

Nested selector

Selection method (.xpath() or .css()) returns a list of selectors of the same type

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
>>> for index, link in enumerate(links):
...     href_xpath = link.xpath('@href').get()
...     img_xpath = link.xpath('img/@src').get()
...     print(f'Link number {index} points to url {href_xpath!r} and image {img_xpath!r}')
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'

Select element attributes

XPath syntax

>>> response.xpath("//a/@href").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

CSS selector extension (::attr(…)) syntax

>>> response.css('a::attr(href)').getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

.attrib syntax

>>> [a.attrib['href'] for a in response.css('a')]
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

# 字典方式
>>> response.css('base').attrib
{
    
    'href': 'http://example.com/'}
>>> response.css('base').attrib['href']
'http://example.com/'

# 空结果
>>> response.css('foo').attrib
{
    
    }

Selector with regular expression

Selector also has a .re() method to extract data using regular expressions. But unlike using the .xpath() or .css() method, .re() returns a list of strings. So you cannot construct nested .re().

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1',
 'My image 2',
 'My image 3',
 'My image 4',
 'My image 5']

.get() .extract_first()) .re(), named .re_first(). Use it to extract only the first matching string:

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1'

extract() and extract_first()

.extract() and .extract_first() are compared to .get() and .getall()

SelectorList.get () and SelectorList.extract_first () identical

>>> response.css('a::attr(href)').get()
'image1.html'
>>> response.css('a::attr(href)').extract_first()
'image1.html'

SelectorList.getall () and SelectorList.extract () identical

>>> response.css('a::attr(href)').getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.css('a::attr(href)').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

Selector.get () and Selector.extract () identical

>>> response.css('a::attr(href)')[0].get()
'image1.html'
>>> response.css('a::attr(href)')[0].extract()
'image1.html'

List operation Selector.getall()

>>> response.css('a::attr(href)')[0].getall()
['image1.html']

Since the actual content given by the official website is rarely used in daily work, there will be a list of practical applications in the column for selective reading.

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113483971