Scrapy of Python crawler (with example macOS environment)
1. Concept
Scrapy is an open source web crawler framework written in Python. It is a framework designed to crawl web data and extract structured data.
Scrapy uses the Twisted asynchronous network framework, which can speed up our download speed.
Official document address: Scrapy
2. Workflow
- The initial url in the crawler is constructed as a request object—> crawler middleware—> engine—> scheduler
- The scheduler puts request —> engine —> download middleware —> downloader
- The downloader sends a request and gets a response —> download middleware —> engine —> crawler middleware —> crawler
- The crawler extracts the url address and assembles it into a request object —> crawler middleware —> engine —> scheduler, repeat step 2
- Crawler extracts data —> engine —> pipeline processing and saving data
2.1 Basic crawler process
2.2 Basic crawler module relationship
2.3 Scrapy workflow
3. The specific role of each module in Scrapy
4. Example (Crawling of the latest 100 updates of the American drama Paradise)
4.1 Create a project
cd Desktop # 我习惯项目都放在桌面
Scrapy startproject movie
You can see the folder with the following directory on the desktop, then the project is created successfully.
4.2 Create a crawler
cd movie # 用 cd 先进入 movie 目录
Scrapy genspider meiju meijutt.tv # 创建了一个叫 meiju 的爬虫
Check the spiders directory and you can see that there is one more meiju.py, which is the crawler we just created.
4.3 Edit crawler
- The pipelines and items files are responsible for saving the data passed by the Spider (crawler). Where to save it depends on the developer's own needs.
- middlewares are used to define and implement middleware functions
- settings The configuration file for the entire project
- The spiders/ directory is our crawler file
4.3.1 meiju.py
import scrapy
from movie.items import MovieItem
class MeijuSpider(scrapy.Spider): # 继承这个类
name = 'test' # 名字
allowed_domains = ['meijutt.tv'] # 域名
start_urls = ['https://www.meijutt.tv/new100.html'] # 补充完整网址
def parse(self, response):
movies = response.xpath('//ul[@class="top-list fn-clear"]/li') # 选中所有的属性class值为"top-list fn-clear"的ul下的li标签内容
for each_movie in movies:
item = MovieItem()
item['number'] = each_movie.xpath('.//div[@class="lasted-num fn-left"]//text()').extract()[0] # 提取影片序号
item['name'] = each_movie.xpath('.//h5/a//text()').extract()[0] # 提取影片名
item['url'] = each_movie.xpath('.//h5/a/@href').extract()[0] # 提取影片网址
item['time'] = each_movie.xpath('.//div[@class="lasted-time new100time fn-right"]//text()').extract()[0] # 提取影片更新时间
yield item # 一种特殊的循环
4.3.2 items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class MovieItem(scrapy.Item):
# define the fields for your item here like:
number = scrapy.Field()
name = scrapy.Field()
url = scrapy.Field()
time = scrapy.Field()
# pass
4.3.3 Setting configuration file settings.py
Add this sentence:
ITEM_PIPELINES = {
'movie.pipelines.MoviePipeline': 100}
4.3.4 Set the data processing script pipelines.py
import json
class MoviePipeline(object):
def process_item(self, item, spider):
return item
4.4 Start the crawler
Start the terminal in the crawler program meiju.py directory
Scrapy crawl test # 启动的名字和你前面给他取的名字对应,注意不是文件名!!!也不是函数名!!!
As shown in the figure below, it is the successful crawling of the content we need