Spider data mining-5, scrapy framework use (1)

Scrapy framework: (compared to what we have learned before, the scrapy framework is more widely used)
Introduction: (developed in pure python, without other languages, you can use python to change the source code)
Scrapy is a tool for crawling website data and extracting structure An application framework written for sexual data.
Scrapy has a wide range of uses and can be used for data mining, monitoring and automated testing. Scrapy uses the Twisted asynchronous network library to process network communications.

Reasons for use:
1. In order to better focus our energy on requests and analysis (generally dealing with anti-climbing problems consumes a lot of energy, scrapy can solve such problems)
2. Enterprise-level requirements (enterprise-level frameworks are all It is developed based on the scrapy framework)

The crawling process is similar to the basic crawler steps: (Crawlers written by yourself must be followed)
1. Find the target data
2. Analyze the request process
3. Construct an http request (usually easy to be blocked by the anti-crawl mechanism, so just do not add the request header Will continue to redirect and finally report an error)
4. Extract data (xpath and BS4 both extract structured data (html). If you want to extract unstructured (such as extracting code in JavaScript), you can use the original method. To extract)
5. Data persistence

Module installation:
Twisted that directly downloads scrapy under windows is incompatible with it, and it is troublesome to solve it,
so it can be installed in anaconda environment

Module introduction:
component introduction
Scrapy Engine (engine) belongs to the core position of scrapy. The
engine is responsible for controlling the data flow between all components of the system and triggering events when certain operations occur.

scheduler The scheduler
receives requests from the engine and queues them so that the engine can request them later.

Downloader (downloader) The
downloader is responsible for obtaining web pages and providing them to the engine, which then provides them to the spider.

A spider
is a custom class written by the user to parse the response, extract data from it, or other requests to be crawled.
url data

The item pipeline
is responsible for subsequent processing after the data is extracted by the crawler. Typical tasks include cleaning, verification and persistence (such as storing data in a database)

Download middleware: (to perform certain processing on requests and responses)
Download middleware are specific hooks located between the engine and the downloader. They process requests from the engine to the downloader, and the response from the downloader to the engine.
If you want to do one of the following, please use Downloader middleware:
process the request before sending it to the downloader (that is, before scrapy sends the request to the website) and send the new request directly
before
sending the response to the crawler , instead of sending The received response is passed to the spider.
The response is passed to the crawler without obtaining the web page;
some requests are silently abandoned

Crawler middleware: (different from download middleware)
Crawler middleware is a specific hook between the engine and the crawler, capable of processing incoming responses and passed items and requests.
If you need the following operations, please use the crawler middleware:
process the request after the crawler callback or item
process the start_requests
process the crawler exception.
According to the response content, call errback instead of the callback request.

Basic command of running process flow chart:
data flow The
figure on the left shows the architecture of the Scrapy framework and its components, as well as the data flow that occurs within the system (shown by the red arrow.)
The data flow in Scrapy is controlled by the execution engine, and the flow is as follows:

1. First get the initial request from the crawler
2. Put the request into the scheduling module, and then get the next request that needs to be crawled
3. The scheduling module returns the next request that needs to be crawled to the engine
4. The engine sends the request to the downloader , Go through all the download middlewares in turn
5. Once the page is downloaded, the downloader will return a response containing the page data, and then go through all the download middlewares in turn.
6. The engine receives the response from the downloader, and then sends it to the crawler for analysis, passing through all crawler middleware in turn.
7. The crawler processes the received response, then parses out the item and generates a new request, and sends it to the engine
8. The engine sends the processed item to the pipeline component, sends the generated new request to the scheduling module, and requests the next request.
9. The process repeats until the scheduler no longer has a request.
In step 8, if there are other requests in step 7 received, it will continue to loop the first step, and if not, put the data into the pipeline.

Simple to use:
new project
1. Project command (the following is the operation performed on the console)
1. Create a project: init.py initialization file and directory file db at the same level
scrapy startproject <project_name> [project_dir] (create project directory)
ps : "<>" means required, "[]" means optional, [] fills in the project directory name, if not filled, the directory name will be written out with the same name as the project name
scrapy startproject db

2. cd to the project: enter the cd project file name in the console, and you can create the crawler project inside.
Enter cd db to enter the directory outside the db.
scrapy genspider [options] options are the inherited parent crawler template, do not write options
Means that the given crawler class scrapy genspider example example.com (create crawler file)
will be created under project/spider (load all the crawler files written by yourself later); where example is the crawler file name and example.com is the url (The domain name to be crawled, not the complete URL)
3. Run the project
scrapy crawl crawler file name#Pay attention to the process
4. Configure ROBOTSTXT_OBEY;DEFAULT_REQUEST_HEADERS in the setting

Crawl Douban movie information:
send request data: class DB250 (scrapy.Spider (inherited the template of this class)):
a=name (must be unique, cannot write duplicates )
b=domain name of the site allowed to be crawled, more than one can be added (Limit the websites that can be crawled by the crawler. The inputted site has multiple urls. Any domain name is fine if you don’t enter them.)
c=initial list
processing to obtain data: def parse (defined method, due to the return value, it is The default callback function) (self, response): (Method: Called after each initial URL is completed, this function must complete the following functions)
1. Parse the response (send back to the engine after the spider in step 7 is parsed) , Encapsulate into item, return this object (pipe)
2. Extract the new url that needs to be downloaded, create a new request, and return it

Scrapy mainly focuses on the parsing process. Just use the parse method
movie_name=node.xpath("./div/a/span/text()").get() Use get to extract a single data, getall() to extract a list, generally use get () OK, directly use [0] to display the path

item={} is defined as an empty dictionary, in python it can also be defined as dict()
and then item['name']=name, put the name data into the item dictionary as the value of name

json.dumps (item, ensure_ascii=false (indicating that the data is not converted into ascill code))

f.write (sssss,'\n (write data sssss once and carriage return once)')

追踪链接

The above crawler only crawls one page, which of course does not meet our requirements. We need to crawl the next page, the next page, until all the information is downloaded.
We extract the connection from the page, or build it according to the rules. Now let's look at our crawler modified to crawl the link of the next page recursively and extract data from it.

We created a class variable page_num to record the page number currently crawled, extract the course information in the parse function,
and then add 1 to the variable page__num through the crawler object to construct the url of the next page, and then create a scrapy.Request object and return.
If the course information cannot be extracted from the response, we judge that the last page has been reached, and the parse function directly ends with return.
In class Db250spider (scrapy, spider):
set a page_num=0
and then add yield item to the last line in def parse (return once for each next page, if it ends directly after returning with return, yield will not)
and then add self.page_num += 1 (The first page has been obtained in the front, so here is to add 1 to start the second page)
page_url='http~~{}~~'.format(self.page_num * 25)
yield scrapy.Request(page_url, callback=self.parse

(The parse method uses the parse set above)) Construct the request again
Finally:
Since the last page is over, the if judgment statement outside the for loop should be constructed:
if node_list:
for~~~~~in node_list:

							·····yield····

(If you can't send things to the pipeline here, then the if judgment will be performed else)
else:
return (no parameters are connected, and the end is directly)

定义item管道

So far, we have not seen the advantages of the crawler written by scrapy, and the above crawler still has a very serious problem, which is the operation of files.
Each call to the parse method will open the file and close the file. Generally, it is closed after one crawl, but not only is it closed multiple times, but the file is opened for reading and writing, which is a huge waste of resources.
After the parse function parses the information we need, it can pack the information into a dictionary object or scray.Item object (usually an item object, we will talk about it below), and then return.
This object will be sent to the item pipeline, which will process it by executing several components in sequence. Each item pipeline component is a Python class that implements simple methods.
They receive an item and perform operations on it, and at the same time decide whether the item should continue to pass through the pipeline or be discarded and no longer processed.
The typical use of the item pipeline is:
cleaning up HTML data,
verifying deleted data (checking whether the item contains certain fields),
checking duplicates (and deleting them), and
persisting the crawled items.
We first modify the crawler file as follows:
General pipeline class: the item in def process_item (self, item, spider) is the parsed data
return item passed into the pipeline (the default here is to pass the content out as soon as it comes in)

Change it to: def open_spider(self, spider):

	self.f=open(’~~‘,  ’w‘,enconding=’~~‘)

(After setting here, the code of the file written with open in the front will be removed)
def process_item(self, item, spider)
jsonst=json.dumps(dict(item), ensure_ascill=false)+'\n'
self. f.write (jsonst)
return item
def close_spider (this is the function executed only when the crawler is closed) (self, spider): (don't forget to set a close program at the end)
self.f.close() (so you don't have to finish writing a file Just switch the file once, but stop after the crawler is closed.)
Because there are many pipeline types that can be set, not all of them need to be used, so you can activate whichever
you need : find item_pipelins={'db.pipelins.Dbpiplins: 300' } Then remove the comment nature to activate, in this setting, assign an integer value to the class to determine their allowable order, from low to high

定义item

The main goal of crawling is to extract structured data from unstructured data sources (usually web pages).
Scrapy spider can return the extracted data as Python dicts.
Although convenient and familiar, Python's dicts lack structure: it is easy to make mistakes in field names or return inconsistent data, especially in large projects with many crawlers.

To define the common output data format, Scrapy provides the Item class. (A safer dictionary) The
Item object is a simple container for collecting cut and paste data.
They provide a dictionary-like API and a convenient syntax to declare their available fields.
The scray.Item object is a simple container used to collect scraped data, and its usage is similar to a python dictionary.
Edit the items.py file in the project directory.

Then we only need to import the Item class we defined in the crawler, and use it for data structuring after instantiation.

name (returned data) = scrapy.field() (method used for data structuring)
import it in the crawler file, from …items (item file location and name) import Dbitem (item class name)
import item The crawler file after the class needs to be changed item=Dbitem(), and the file transfer mode after json.dumps can be removed. The
instantiated item already has the function of data structuring

2. Shell interactive platform
scrapy shell url (url is the access site)
then response.xpath('//a/~~~').get() and response methods are available, response.status
gets in our project The response
tests the correctness of xpath and whether it can get the data. It
means that the crawler parsing comes with these functions and does not need to import the library

Project notes:
1. Robots: The default item in the settings configuration file is Robt~~~~=true, change to false to crawl data, unless the website does not require robot protocol
2, headers: You can also add request headers in settings
3. Open the pipeline
4. File writing: encoding='utf-8'
5. Transfer json data: ensure_ascii=False
allow_domain=true, when the web domain name is not entered, it is not filtered



Guess you like

Origin blog.csdn.net/qwe863226687/article/details/114116971