Spider data mining-6, scrapy framework (2)

1. Secondary page acquisition and data transfer and splicing: (previously, the primary page was crawled) The
secondary page and the primary page are not on the same page, and the secondary page is another page entered by skipping the hyperlink on the primary page
The data is not stored locally, but is stored by pipeline.
The link to the secondary page webpage can be found from the source code of the
primary page. The movie details of the secondary page can no longer be parsed by the primary page parsing method. callback="self. "get_detail" reset analysis method The
location of the label of each secondary page details is different, change the path with the same parent label to // global search, and the exact property of the span @property="v:summary"
set up more A parsing method, the order of parsing is not (asynchronous, no matter who), to ensure efficiency

Storage problem: The data has no order, and meta parameters need to be used to ensure that the information of the same movie is together.
In the url request of the detail, you can add meta={"info (arbitrarily set key name)": item_pipe (means that you need to follow Data)},
here means that after following the data, the movie name data is passed into the pipeline and then meta will receive the information, and then specify another analytic function to also connect the movie details to the exact location in the movie information to
item=DbItem()
info =response.meta("info")
item.update(info) is mainly the update method that allows the movie details to accurately access the movie information, and it belongs to the pipeline to refer to the update method in the
engine, which specifies that meta will be sent to the crawler along with the response data obtained Analyzing, receiving the
yield through the response.meta of the analytic function is not only a loop, but also a return to transfer the data to the pipeline for storage. "Yield once" is to send
the yield of the movie information to "item=Dbitem once". In the pipeline process, put it in the engine, wait until the meta comes back to get the url and parsed data, then parse it for the crawler, and then send them all back to the engine and put them in the pipeline together.

Design:
Target data: movie information + movie introduction
Request flow: visit the primary page, extract movie information + secondary page url, visit secondary page url to extract data from it

2. Scrapy shell is
a command line tool used to debug Scrapy project code.
Some Scrapy objects are predefined at startup

The
syntax format of the command to start the Scrapy shell is as follows:

scrapy shell [option] [url或file]

url is the URL you want to crawl

Note: It is necessary to bring the path to analyze the local file, the scrapy shell is regarded as the url by default

The use of
shell Scrapy shell is essentially an ordinary python shell,
but it provides some objects that need to be used, and the shortcut method is convenient for us to debug.

Shortcut method: (Useful shortcuts)
shelp(): List commands

fetch(url[,redirect=True]): pull, get the response through url to get
the data of other pages when the homepage, url fill in the url of other pages, redirect is the default operation

fetch(request): pull, get the response through the request object

view(response): displays the response interface, which needs to be used in the local browser environment

scrapy 对象:
crawler
spider
request
response
settings

3. Scrapy selector
Scrapy provides a parsing mechanism based on the lxml library. They are called selectors (bs4, xpath, css, regular expressions)
because they "select" a certain part of the HTML document specified by XPath or CSS expressions .
The Scarpy selector API is very small and very simple.


First export the selector object, from scrapy.selector import selector
1. Construct the object through the text parameter (construct on the text)
and then construct the selector object through the text parameter
selc_text=selector (text (construct the object through the text parameter)=html_str) This is equivalent to the response object
print(selc_text.extract() ), to get the object wrapped in html and body, if there is already html and body,
print(selc_text.extract('//div/~~~')) You can enter tags to extract
print(selc_text.xpath('//div[@class=“info”]//div/a/span/text()').extract())

Parse under the xpath selector and then .extract() to extract
2. Construct a selector object through response (construct on the response)
from scrapy.http import HtmlResponse
response=HtmlResponse(url="http~~~", body=html_str( Put a text as above), encoding='utf-8') (body requires a byte string))
response1=Selector(response (construct selector object through response)=response)
print(response.Selector (can be omitted) ).Xpath('//div[@class="info"]//div/a/span/text()').extract())
Nested expression: use a selector after the CSS selector, Filter
selector objects can use css and xpath arbitrarily, because css and xpath get not data but re gets a data, so after using re, css and xpath cannot be used.
If you want to use re and css together, you can leave re Used at the end, but re is extracted into data, so there is no need to add
print(response.css("a").xpath('./span[1]/text()'))
css can extract tags faster , You don’t need to consider too many path issues, and then xpath can use read text to extract data
css to extract label elements, xpath traverses the data (not data, after adding text, it is a data generator, in scrapy, use get to extract, in selector Use extract under the class)
After re directly extracts the complete data, it is no longer a selector object. It does not need to be processed and cannot be converted to other selectors.

Constructing selector
Scrapy selector is
an instance constructed by passing text or TextResonse object through scrapy.Selector class .

It will automatically select the best parsing rules
XML and HTML according to the input type

Use selectors
Selectors provide 2 methods to extract tags
xpath() syntax rules based on xpath
css() syntax rules based on css selector
shortcuts
response.xpath()
response.css() the
list of selectors they return

Extract text:
selector.extract() returns a list of text
selector.extract_first() is equivalent to selector.extract() [0] returns the text of the first selector, without returning None

Nested selectors
Sometimes we need to call the selection method multiple times to get tags (.xpath() or .css())
response.css('img').xpath('@src')

Its principle is shown in the left pattern example:

Selector also has a .re() method to extract data using regular expressions.
It returns a string.
It is generally used after the xpath() and css() methods to filter text data.
re_first() is used to return the first matching string.
For example:
response.xpath('//a[contains(@href, “image”)]/text()').re(r'Name:\s*(.*)')

Four, scrapy. Spider

yield inner loop, play a small part of the inner loop in the for loop

Guess you like

Origin blog.csdn.net/qwe863226687/article/details/114117022