12: Crawler-Scrapy framework (Part 1)

1: ScrapyIntroduction

1. ScrapyWhat is it?

ScrapyIt is an application framework (asynchronous crawler framework) written in Python to crawl website data and extract structural data.
Usually we can easily Scrapyimplement a crawler through the framework to crawl the content or pictures of the specified website
Scrapyusing Twistedan asynchronous network. Framework that can speed up our downloads

  • The difference between asynchronous and non-blocking

1.png

Asynchronous: After the call is issued, the call returns directly, regardless of whether there is a result.
Non-blocking: The focus is on the state of the program while waiting for the call result, which means that the call will not block the current thread until the result cannot be obtained immediately.

2. ScrapyAdvantages

Essential technologies for crawlers

  • Can make our crawler program more stable and efficient (multi-threaded)
  • Very configurable and scalable (very flexible)
  • downloaderThe downloader (based on multi-threading) sends a request to get a response

3. ScrapyReference learning

scrapyOfficial learning website: https://scrapy-chs.readthedocs.io/zh_CN/1.0/intro/overview.html
Latest: https://docs.scrapy.org/en/latest/

4. ScrapyInstallation

pip install scrapy==2.5.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

2: ScrapyWorkflow

One way to crawl:

Another way to crawl:

Workflow:

scrapy workflow.png
image.png

1. Introduction to the functions of each component:

Scrapy engine(engine) Commander-in-Chief: Responsible for the transmission of data and signals between different modules scrapyAlready realized
Scheduler(scheduler) requestA queue to store requests sent by the engine scrapyAlready realized
Downloader(Downloader) Download the request sent by the engine requestsand return it to the engine scrapyAlready realized
Spider(reptile) Process the data sent from the engine response, extract it url, and hand it over to the engine Requires handwriting
Item Pipline(pipeline) Process the data passed by the engine, such as storage Requires handwriting
Downloader Middlewares(Download middleware) Customizable download extensions, such as setting up a proxy Generally do not write by hand
Spider Middlewares(middleware) Can customize requestsrequests and responsefilter Generally do not write by hand
1 引擎(engine)   scrapy已经实现
scrapy的核心, 所有模块的衔接, 数据流程梳理

2 调度器(scheduler)   scrapy已经实现
本质上这东西可以看成是一个队列,里面存放着一堆我们即将要发送的请求,可以看成是一个url的容器
它决定了下一步要去爬取哪一个url,通常我们在这里可以对url进行去重操作。

3 下载器(downloader)  scrapy已经实现
它的本质就是用来发动请求的一个模块,小白们完全可以把它理解成是一个requests.get()的功能,
只不过这货返回的是一个response对象.

4 爬虫(spider)  需要手写 
这是我们要写的第一个部分的内容, 负责解析下载器返回的response对象,从中提取到我们需要的数据

5 管道(Item pipeline)
这是我们要写的第二个部分的内容, 主要负责数据的存储和各种持久化操作

6  下载中间件(downloader Middlewares)  一般不用手写
可以自定义的下载扩展 比如设置代理 处理引擎与下载器之间的请求与响应(用的比较多)

7  爬虫中间件(Spider Middlewares)  一般不用手写
可以自定义requests请求和进行response过滤(处理爬虫程序的响应和输出结果以及新的请求)

Three: ScrapyIntroduction and Summary

1. ScrapyGetting Started

前提:路径切换 cd  copy path  复制绝对路径 

1. 创建scrapy项目
scrapy startproject mySpider
scrapy startproject(固定的)
mySpider(不固定的 需要创建的项目的名字)

2. 进入项目里面:cd mySpider

3. 创建爬虫程序
scrapy genspider example example.com

scrapy genspider:固定的
example:爬虫程序的名字(不固定的)
example.com:可以允许爬取的范围(不固定的) 是根据你的目标url来指定的 其实很重要 后面是可以修改的

目标url:https://www.baidu.com/

scrapy genspider bd baidu.com

4. 执行爬虫程序
scrapy crawl bd
scrapy crawl:固定的
db:执行的爬虫程序的名字

可以通过start.py文件执行爬虫项目:
from scrapy import cmdline
cmdline.execute("scrapy crawl bd".split())

2.Document Scrapydescription

baidu.py爬虫文件 
    # 爬虫程序的名字
    name = 'bd'
    # 可以爬取的范围
    # 有可能我们在实际进行爬取的时候  第一页可能是xxx.com 第三页可能就变成了xxx.cn 
    # 或者xxx.yy 那么可能就会爬取不到数据
    # 所以我们需要对allowed_domains进行一个列表的添加
    allowed_domains = ['baidu.com']
    # 起始url地址  会根据我们的allowed_domains对网页前缀进行一定的补全 
    # 但有时候补全的url不对 所以我们也要去对他进行修改
    start_urls = ['https://www.baidu.com/']

    # 专门用于解析数据的
    def parse(self, response):  
        
items.py 数据封装的
middlewares.py 中间件(爬虫中间件和下载中间件)
pipelines.py 管道(保存数据的)

settings.py Scrapy的配置项

# 1 自动生成的配置,无需关注,不用修改
BOT_NAME = 'mySpider'
SPIDER_MODULES = ['mySpider.spiders']
NEWSPIDER_MODULE = 'mySpider.spiders'

# 2 取消日志
LOG_LEVEL = 'WARNING'

# 3 设置UA,但不常用,一般都是在MiddleWare中添加
USER_AGENT = 'mySpider (+http://www.yourdomain.com)'

# 4 遵循robots.txt中的爬虫规则,很多人喜欢False,当然我也喜欢....
ROBOTSTXT_OBEY = True

# 5 对网站并发请求总数,默认16
CONCURRENT_REQUESTS = 32

# 6 相同网站两个请求之间的间隔时间,默认是0s。相当于time.sleep()
DOWNLOAD_DELAY = 3

# 7 禁用cookie,默认是True,启用
COOKIES_ENABLED = False

# 8  默认的请求头设置
DEFAULT_REQUEST_HEADERS = {
    
    
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# 9 配置启用爬虫中间件,Key是class,Value是优先级
SPIDER_MIDDLEWARES = {
    
    
   'mySpider.middlewares.MyspiderSpiderMiddleware': 543,
}

# 10 配置启用Downloader MiddleWares下载中间件
DOWNLOADER_MIDDLEWARES = {
    
    
   'mySpider.middlewares.MyspiderDownloaderMiddleware': 543,
}

# 11 开启管道  配置启用Pipeline用来持久化数据
ITEM_PIPELINES = {
    
    
   'mySpider.pipelines.MyspiderPipeline': 300,
}

settingsMore reference for configuration items: https://www.cnblogs.com/seven0007/p/scrapy_setting.html

3. ScrapySummary

scrapyIn fact, it is a fragmented transformation of the crawler we usually write. Each function is encapsulated separately, and each module does not depend on each other. Everything is deployed by the engine. I hope you can adopt this idea. Know – decoupling. Make the relationship between modules looser. In this way, if we want to replace a certain module, it will be very easy. It will not have any impact on other modules.

Guess you like

Origin blog.csdn.net/qiao_yue/article/details/135281490