1: Review of the use of each file
1. items
Use of
items
The file is mainly used to define the data structure for storing crawled data, so as to facilitate Item Pipeline
the transfer of data between the crawler and.
items.py
import scrapy
class TencentItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
position = scrapy.Field()
date = scrapy.Field()
2. pipline
Use of
(1 pipelines
Introduction
Pipeline files pipelines.py
are mainly used to process the captured data: generally a class is a pipeline, such as creating a pipeline class for storing MySQL
and storing MangoDB
. The method in the pipeline file process_item()
is the specific method for processing the captured data.
(2) pipelines
Commonly used methods
process_item(self,item,spider)
: Processing the specific data captured by the crawlerprocess_item()
is necessary in the functionreturn item
, because when there are multiple pipelines, the return value of this function will be handed over to the next pipeline for continued processing;open_spider()
: Executed only once when the crawler project is started, generally used for database connection;close_spider()
: Executed only once at the end of the crawler project, generally used for finishing work, such as closing the database.
(3) pipelines
Points to note
pipeline
The smaller the corresponding value, the higher the priority.pipeline
The name of the methodprocess_item
cannot be changed to another name
2: Workflow review
1. How to handle page turning
2. scrapy.Request
Knowledge points
scrapy.Request(url, callback=None, method='GET', headers=None, body=None,cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, flags=None)
常用参数为:
callback:指定传入的URL交给那个解析函数去处理
meta:实现不同的解析函数中传递数据,meta默认会携带部分信息,比如下载延迟,请求深度
dont_filter:让scrapy的去重不会过滤当前URL,scrapy默认有URL去重功能,对需要重复请求的URL有重要用途
Three: Scrapy
Download middleware
Download middleware is provided for modifiable and extended functions scrapy
during the crawler process . Usage:Request
Response
scrapy
- Write one just like
Download Middlewares
we wrote onepipeline
, define a class, and thensettings
enableDownload Middlewares
the default method in - Processing requests and processing responses correspond to two methods:
process_request(self,request,spider):
当每个request通过下载中间件时,该方法被调用
process_response(self,request,response,spider):
当下载器完成http请求,传递响应给引擎的时候调用
When each Request
object passes through the download middleware, it will be called. The higher the priority middleware, the earlier it will be called; this method should return the following objects: None/Response
Object/ Request
Object/Throw IgnoreRequest
Exception
- Return
None:scrapy
will continue to execute the corresponding methods of other middleware; - Return
Response
object:scrapy
It will not call other middlewareprocess_request
methods or initiate downloads, but directly return theResponse
object. - Return
Request
object:scrapy
No other middlewareprocess_request()
methods will be called, but it will be placed in the scheduler to be scheduled for download. - If this method throws an exception,
process_exception
the method will be called
process_response(request,response,spider)
will be called when each Response passes through the download middleware. The higher the priority middleware, the later it will be called. On the process_request()
contrary; this method returns the following object: Response
Object/Request Object/Throws IgnoreRequest
Exception .
- Return
Response
object:scrapy
will continue to call other middlewareprocess_response
methods; - Return
Request
object: Stop the middleman call and place it in the scheduler to be scheduled for download; - Throws
IgnoreRequest
an exception:Request.errback
The function will be called to handle it. If it is not handled, it will be ignored and will not be written to the log.
1.Middleware workflow
Download middleware works as follows:
- When
Scrapy
the engine receives a request that needs to be downloaded, it will send the request to the download middleware. - After the download middleware receives the request, it can modify the request, such as adding
headers
, proxying, etc. - The modified request is sent to the target server, and the target server returns response data.
- After the download middleware receives the response data, it can modify the response, such as decrypting, decompressing, modifying the encoding, etc.
- The modified response is returned to
Scrapy
the engine, and the engine continues to process the response data.
2. Set randomness through middlewareUA
When a crawler frequently accesses a page, the request must remain consistent. Then it is easy to be discovered by the server, thereby prohibiting access to this request header. Therefore, we need to randomly change the request header before accessing this page, so as to avoid the crawler being caught. Randomly changing the request header can be implemented in the download middleware. Before the request is sent to the server, a request header is randomly selected. This avoids always using a request header.
需求: 通过中间件设置随机UA
中间件核心介绍:
# 拦截所有的请求
def process_request(self, request, spider):
# request 是请求对象 spider指向的是当前爬虫对象
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# 返回空 继续执行这个方法送往下载器 等合适的下载器进行处理
# - or return a Response object
# 返回的是一个响应对象 终止当前流程 直接将该响应通过引擎返回给爬虫
# - or return a Request object
# 返回的是一个请求对象 终止当前流程 将请求对象返回给调度器 大多数情况下是更换新的request请求
# - or raise IgnoreRequest: process_exception() methods of
# 抛出异常 异常会给process_exception()方法进行处理 如果没有一个异常处理该异常
# 那么该请求就直接被忽略了 也不会记录错误日志
# installed downloader middleware will be called
return None
# 自定义下载中间件
# 导入随机UA的库
import random
from fake_useragent import UserAgent
class UADownloaderMiddleware:
def process_request(self, request, spider):
ua = UserAgent()
user_agent = ua.random
request.headers['User-Agent'] = user_agent
注意: 在settings中开启当前中间件
DOWNLOADER_MIDDLEWARES = {
# 'mw.middlewares.MwDownloaderMiddleware': 543,
'mw.middlewares.UADownloaderMiddleware': 543,
}
爬虫程序.py
class UaSpider(scrapy.Spider):
name = 'ua'
allowed_domains = ['httpsbin.org']
start_urls = ['https://httpbin.org/user-agent']
def parse(self, response):
print(response.text)
# dont_filter=True scrapy会进行自动去重
yield scrapy.Request(url=self.start_urls[0],
callback=self.parse,
dont_filter=True)
3. Scrapy
Download pictures
scrapy
item
Provides a reusable version of the files included in the download item pipelines
. These pipeline
have some common methods and structures that you would generally useImages Pipeline
Built-in way to download pictures:
Steps to use images pipeline
download files:
- Define one
Item
, and thenitem
define two attributes in this one, namelyimage_urls
andimages
. It is used to store linksimage_urls
to files that need to be downloaded . A list needs to be given.url
- When the file download is completed, the relevant information about the file download will be stored
item
inimages
the attributes. Such as download path, downloadedurl
and image verification code, etc. settings.py
Configure in the configuration fileIMAGES_STORE
. This configuration is used to set the file download path.- Startup
pipeline
:ITEM_PIPELINES
Set inscrapy.pipelines.images.ImagesPipeline:1