Lecture 43: The usage of the flexible and easy-to-use Spider

In the last lesson, we learned about the basic use of Scrapy through examples. In this process, we used Spider to write crawler logic and some selectors to select the results.

In this lesson, we will summarize the basic usage of Spider and Selector.

Spider usage

In Scrapy, the link configuration, crawling logic, parsing logic, etc. to crawl the website are actually configured in Spider. In the example in the previous lesson, we found that the crawling logic is also done in Spider. In this lesson, we will specifically understand the basic usage of Spider.

Spider running process

When implementing the Scrapy crawler project, the core class is the Spider class, which defines the process and analysis method of how to crawl a website. To put it simply, Spider has to do the following two things:

Define the action of crawling the website;
Analyze the crawled web pages.

For the Spider class, the entire crawling cycle is as follows.

Initialize the Request with the initial URL and set the callback function. When the Request successfully requests and returns, a Response will be generated and passed to the callback function as a parameter.
Analyze the returned web page content in the callback function. The returned result can have two forms, one is that the effective result of the parse is returned to a dictionary or an Item object. The next step can be processed (or directly) saved, and the other is the next (next page) link that is parsed. You can use this link to construct a Request and set a new callback function to return to the Request.
If the returned is a dictionary or Item object, it can be stored in a file through Feed Exports, etc. If the Pipeline is set, it can be processed (such as filtering, correction, etc.) and saved through the Pipeline.
If the response is Reqeust, then the Request will be passed to the callback function defined in the Request again after the Response is successfully executed. The selector can be used again to analyze the newly obtained web content and generate an Item based on the analyzed data.

By repeating the above steps, the site crawling is completed.

Spider class analysis

In the example of the previous lesson, the Spider we defined inherited from scrapy.spiders.Spider. This class is the simplest and most basic Spider class. Every other Spider must inherit from this class, as well as some special ones to be explained later. The Spider class also inherits from it.

This class provides the default implementation of the start_requests method, reads and requests the start_urls attribute, and calls the parse method to parse the result according to the returned result. In addition, it has some basic properties, which are explained below.

name : The name of the spider, a string that defines the name of the spider. The name of the Spider defines how Scrapy locates and initializes the Spider, so it must be unique. However, we can generate multiple identical Spider instances without any restrictions. Name is the most important attribute of Spider, and it is required. If the spider crawls a single website, a common practice is to name the spider by the domain name of the website. For example, if a Spider crawls mywebsite.com, the Spider will usually be named mywebsite.
allowed_domains : The domain names allowed to be crawled are optional. Links not in this range will not be crawled.
start_urls : Start URL list, when we do not implement the start_requests method, we will start crawling from this list by default.
custom_settings : This is a dictionary that is exclusive to the configuration of this Spider. This setting will override the project's global settings, and this setting must be updated before initialization, so it must be defined as a class variable.
crawler : This attribute is set by the from_crawler method, which represents the Crawler object corresponding to this Spider class. The Crawler object contains many project components. We can use it to get some configuration information of the project. For example, the most common one is to get the project Setting information, namely Settings.
settings : is a Settings object, with which we can directly obtain the project's global settings variables.

In addition to some basic properties, Spider also has some commonly used methods, which are introduced here.

start_requests : This method is used to generate the initial request. It must return an iterable object. This method will use the URL in start_urls to construct the Request by default, and the Request is a GET request method. If we want to access a certain site by POST at startup, we can directly override this method and use FormRequest when sending POST requests.
parse : When Response does not specify a callback function, this method will be called by default. It is responsible for processing the Response, processing the returned result, and extracting the desired data and the next request from it, and then returning. This method needs to return an iterable object containing Request or Item.
closed : When the Spider is closed, this method will be called, where some operations to release resources or other closing operations are generally defined.

Selector usage

We previously introduced the use of Beautiful Soup, PyQuery, and regular expressions to extract web page data, which is really very convenient. And Scrapy also provides its own data extraction method, namely Selector (selector).

Selector is built on the basis of lxml, supports XPath selector, CSS selector, and regular expressions, with comprehensive functions, high resolution speed and accuracy.

Next we will introduce the usage of Selector.

Use directly

Selector is a module that can be used independently. We can directly use the Selector class to construct a selector object, and then call its related methods such as xpath, css, etc. to extract data.

For example, for a piece of HTML code, we can construct a Selector object to extract data in the following way:

from scrapy import Selector

body = '<html><head><title>Hello World</title></head><body></body></html>'
selector = Selector(text=body)
title = selector.xpath('//title/text()').extract_first()
print(title)

operation result:

Hello World

Here we did not run in the Scrapy framework, but used the Selector in Scrapy separately. Passing the text parameter when constructing, a Selector selector object is generated, and then it can be like the one in the Scrapy we used earlier The parsing method is the same, calling xpath, css and other methods to extract.

Here we are looking for the text in the title in the source code. Add the text method at the end of the XPath selector to extract the text.

The above content is the direct use of Selector. Similar to libraries such as Beautiful Soup, Selector is actually a powerful web parsing library. If convenient, we can also directly use Selector to extract data in other projects.

Next, we use examples to explain the use of Selector in detail.

Scrapy Shell

Since Selector is mainly used in combination with Scrapy, for example, the parameter response in the callback function of Scrapy directly calls the xpath() or css() method to extract data, so here we use Scrapy Shell to simulate the process of Scrapy request to explain related method of extraction.

We use a sample page of the official document to demonstrate: http://doc.scrapy.org/en/latest/_static/selectors-sample1.html .

Open the Scrapy Shell and enter the following command in the command line:

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

So we have entered the Scrapy Shell mode. This process is actually that Scrapy initiates a request. The requested URL is the URL entered under the command line just now, and then some operable variables are passed to us, such as request, response, etc., as shown in the figure.

Insert picture description here
We can enter commands in the command line mode to call some operation methods of the object, and then press Enter to display the results in real time. This is similar to Python's command line interactive mode.

Next, the demo examples all use the source code of the page as the analysis target. The source code of the page is as follows:

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

XPath selector

After entering the Scrapy Shell, we will mainly manipulate the response variable for analysis. Because we are parsing HTML code, Selector will automatically use HTML syntax to analyze.

Response has an attribute selector. The content returned by calling response.selector is equivalent to constructing a Selector object with the text of the response. Through this Selector object, we can call parsing methods such as xpath, css, etc., and extract information by passing in XPath or CSS selector parameters to the method.

Let's experience it with an example, as shown below:

>>> result = response.selector.xpath('//a')
>>> result
[<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>,
 <Selector xpath='//a' data='<a href="image2.html">Name: My image 2 <'>,
 <Selector xpath='//a' data='<a href="image3.html">Name: My image 3 <'>,
 <Selector xpath='//a' data='<a href="image4.html">Name: My image 4 <'>,
 <Selector xpath='//a' data='<a href="image5.html">Name: My image 5 <'>]
>>> type(result)
scrapy.selector.unified.SelectorList

The form of the printed result is a list composed of Selector. In fact, it is of type SelectorList. Both SelectorList and Selector can continue to call methods such as xpath and css to further extract data.

In the above example, we have extracted a node. Next, we try to continue calling the xpath method to extract the img node contained in the a node, as shown below:

>>> result.xpath('./img')
[<Selector xpath='./img' data='<img src="image1_thumb.jpg">'>,
 <Selector xpath='./img' data='<img src="image2_thumb.jpg">'>,
 <Selector xpath='./img' data='<img src="image3_thumb.jpg">'>,
 <Selector xpath='./img' data='<img src="image4_thumb.jpg">'>,
 <Selector xpath='./img' data='<img src="image5_thumb.jpg">'>]

We got all img nodes in node a, and the result is 5.

It is worth noting that the dot (dot) is added to the front of the selector, which means extracting the data inside the element, if no dot is added, it means extracting from the root node. Here we used the extraction method of ./img, which means extracting from the a node. If we use //img here, we still extract it from the html node.

We just used the response.selector.xpath method to extract the data. Scrapy provides two practical shortcuts, response.xpath and response.css, their functions are completely equivalent to response.selector.xpath and response.selector.css. For convenience, we will directly call the xpath and css methods of response to make selections.

Now what we get is a variable of the SelectorList type, which is a list of Selector objects. We can use the index to retrieve one of the Selector elements individually, as shown below:

>>> result[0]
<Selector xpath='//a' data='<a href="image1.html">Name: My image 1 <'>

We can manipulate this SelectorList like a list. But now the content obtained is of the Selector or SelectorList type, not the real text content. So how to extract specific content?
For example, if we want to extract the a node element, we can use the extract method, as shown below:

>>> result.extract()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>', '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>', '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

Using the extract method here, we can get what we really need.

We can also rewrite the XPath expression to select the internal text and attributes of the node, as shown below:

>>> response.xpath('//a/text()').extract()
['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
>>> response.xpath('//a/@href').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

We only need to add another layer of /text() to get the internal text of the node, or add a layer of /@href to get the href attribute of the node. Among them, the content after the @ symbol is the name of the attribute to be obtained.

Now we can use a rule to get all the nodes that meet the requirements, and the returned type is a list type.

But there is a problem: if there is only one node that meets the requirements, what will be the result returned? Let's use another example to feel it, as shown below:

>>> response.xpath('//a[@href="image1.html"]/text()').extract()
['Name: My image 1 ']

We use attributes to limit the scope of matching so that XPath can only match one element. Then use the extract method to extract the result, the result is still in the form of a list, and its text is the first element of the list. But in many cases, the data we actually want is the content of the first element. Here we get it by adding an index, as shown below:

'Name: My image 1 '

However, this writing is obviously risky. Once there is a problem with XPath, the result of extract may be an empty list. If we use the index to get, won't it cause the array to go out of bounds?
Therefore, another method can specifically extract a single element, it is called extract_first. We can rewrite the above example as follows:

>>> response.xpath('//a[@href="image1.html"]/text()').extract_first()
'Name: My image 1 '

In this way, we directly use the extract_first method to extract the first result of the match, and we don't have to worry about the problem of array out of bounds.

In addition, we can also set a default value parameter for the extract_first method, so that when the XPath rule cannot extract the content, the default value will be used directly. For example, change XPath to a non-existent rule, and re-execute the code, as shown below:

>>> response.xpath('//a[@href="image1"]/text()').extract_first()>>> response.xpath('//a[@href="image1"]/text()').extract_first('Default Image')
'Default Image'

Here, if XPath does not match any elements, calling extract_first will return null and no error will be reported. In the second line of code, we also pass a parameter as the default value, such as Default Image. In this way, if the XPath fails to match the result, the return value will use this parameter instead, and you can see that the output is exactly the same.

So far, we have understood the related usage of XPath in Scrapy, including nested queries, extracting content, extracting single content, obtaining text and attributes, etc.

CSS selector

Next, let's look at the usage of CSS selectors. Scrapy's selectors also dock with CSS selectors. Use the response.css() method to use CSS selectors to select corresponding elements.

For example, in the above we selected all a nodes, then the CSS selector can also do it, as shown below:

>>> response.css('a')
[<Selector xpath='descendant-or-self::a' data='<a href="image1.html">Name: My image 1 <'>, 
<Selector xpath='descendant-or-self::a' data='<a href="image2.html">Name: My image 2 <'>, 
<Selector xpath='descendant-or-self::a' data='<a href="image3.html">Name: My image 3 <'>, 
<Selector xpath='descendant-or-self::a' data='<a href="image4.html">Name: My image 4 <'>, 
<Selector xpath='descendant-or-self::a' data='<a href="image5.html">Name: My image 5 <'>]

Similarly, the node can be extracted by calling the extract method, as shown below:

['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>', '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>', '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

Usage and XPath selection are exactly the same. In addition, we can also perform attribute selection and nested selection, as shown below:

>>> response.css('a[href="image1.html"]').extract()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>']
>>> response.css('a[href="image1.html"] img').extract()
['<img src="image1_thumb.jpg">']

Here, [href=”image.html”] is used to limit the href attribute, you can see that there is only one matching result. In addition, if you want to find the img node in the a node, you only need to add a space and img. The selector is written in exactly the same way as the standard CSS selector.

We can also use the extract_first() method to extract the first element of the list, as shown below:

>>> response.css('a[href="image1.html"] img').extract_first()
'<img src="image1_thumb.jpg">'

The next two usages are not the same. The internal text and attributes of the node are obtained in this way, as shown below:

>>> response.css('a[href="image1.html"]::text').extract_first()
'Name: My image 1 '
>>> response.css('a[href="image1.html"] img::attr(src)').extract_first()
'image1_thumb.jpg'

To get text and attributes, you need to use ::text and ::attr(). Other libraries such as Beautiful Soup or PyQuery have separate methods.

In addition, CSS selectors can nest selections just like XPath selectors. We can first use XPath selector to select all a nodes, then use CSS selector to select img node, and then use XPath selector to get attributes. Let's experience it with an example, as shown below:

>>> response.xpath('//a').css('img').xpath('@src').extract()
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']

We successfully obtained the src attributes of all img nodes.
Therefore, we can freely use both xpath and css methods to achieve nested queries, and the two are completely compatible.

Regular match

Scrapy's selector also supports regular matching. For example, the text in the a node of the example is similar to Name: My image 1. Now we only want to extract the content after Name:. At this time, we can use the re method to achieve the following:

>>> response.xpath('//a/text()').re('Name:\s(.*)')
['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']

We pass a regular expression to the re method, where (.*) is the content to be matched, the output result is the group matched by the regular expression, and the result will be output in turn.

If there are two groups at the same time, the results will still be output in order, as shown below:

>>> response.xpath('//a/text()').re('(.*?):\s(.*)')
['Name', 'My image 1 ', 'Name', 'My image 2 ', 'Name', 'My image 3 ', 'Name', 'My image 4 ', 'Name', 'My image 5 ']

Similar to the extract_first method, the re_first method can select the first element of the list. The usage is as follows:

>>> response.xpath('//a/text()').re_first('(.*?):\s(.*)')
'Name'
>>> response.xpath('//a/text()').re_first('Name:\s(.*)')
'My image 1 '

No matter how many groups are matched by the regex, the result will be equal to the first element of the list.

It is worth noting that the response object cannot directly call the re and re_first methods. If you want to perform regular matching on the full text, you can call the xpath method first and then perform regular matching, as shown below:

>>> response.re('Name:\s(.*)')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'HtmlResponse' object has no attribute 're'
>>> response.xpath('.').re('Name:\s(.*)<br>')
['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image 5 ']
>>> response.xpath('.').re_first('Name:\s(.*)<br>')
'My image 1 '

Through the above example, we can see that calling the re method directly will prompt that there is no re attribute. But here we first call xpath('.') to select the full text, and then call the re and re_first methods to perform regular matching.

The above content is the usage of Scrapy selector, which includes two common selectors and regular matching functions. If you are proficient in XPath syntax, CSS selector syntax, and regular expression syntax, data extraction efficiency can be greatly improved.