Scrapy framework introduction
The Scrapy framework is a framework for quickly developing crawlers. It only requires writing a small amount of code to achieve crawler effects.
Advantage
- Reduces the amount of duplicate code
- Improved development efficiency
Scrapy components
- spider folder: stores crawlers, with independent names between crawlers
- init.py : initialization file
- setting.py: Configuration file, set the configuration of the crawler
- middlewares.py: middleware file, used to intercept requests and configure middleware connections, such as setting cookies, UserAgent, IP, etc.
- pipelines.py: pipeline file, used to persist data
- item.py: Class file, used to set the attributes of the element entities that need to be obtained
Develop using the Scrapy framework
Crawl the Douban website.
In a python environment, use pip to install Scrapy and open the command line in the development directory.
pip install Scrapy
After the installation is complete, create a crawler project. The project name can be customized, here it is douban_scrapy
scrapy startproject douban_scrapy
After creation, enter the generated crawler directory
cd douban_scrapy
After entering the crawler directory, create a crawler. The name of the crawler can be customized. Here is douban. The crawler name is followed by the target website to be crawled, which can be customized.
scrapy genspider douban https://movie.douban.com/top250
After creating the crawler, you can see the crawler we just created in the spider folder, then open the setting.py file in the crawler directory to set the crawler configuration, and add the following code to this file
Configure log level
LOG_LEVEL="WARNING"
Configure the proxy and disguise the crawler. UserAgent can be obtained from the browser's developer tools and does not have to be the same as below.
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200"
Follow site local rules
ROBOTSTXT_OBEY = True
Maximum request line
CONCURRENT_REQUESTS = 8
Delay requests to avoid IP being blocked due to frequent requests
DOWNLOAD_DELAY = 3
After configuring the crawler information, open the item.py file and create the class to be crawled
import scrapy
class MovieItem(scrapy.Item):
title = scrapy.Field()
rank = scrapy.Field()
subject = scrapy.Field()
Open the crawler file (douban.py) just created and modify the parse method
# 记住要用这个方式导入类
from ..items import MovieItem
(此方法用于解析得到的html文件)
def parse(self, response):
sel = scrapy.Selector(response)
list_items = sel.css('#content > div > div.article > ol > li')
for list_item in list_items:
movie_item = MovieItem()
movie_item['title'] = list_item.css('span.title::text').extract_first()
movie_item['rank'] = list_item.css('span.rating_num::text').extract_first()
movie_item['subject'] = list_item.css('span.inq::text').extract_first()
yield movie_item
Open the pipelines.py file and configure persistence information:
Method 1: Save in excel format
and install the Excel library
pip install openpyxl
Configure Excel pipeline class
class ExcelPipeline:
def __init__(self):
self.wb = openpyxl.Workbook()
self.ws = self.wb.active
self.ws.title = 'Top'
self.ws.append(('标题', '评分', '主题'))
def close_spider(self, spider):
self.wb.save('电影数据.xlsx')
def process_item(self, item, spider):
title = item.get('title', '')
rank = item.get('rank', '')
subject = item.get('subject', '')
self.ws.append((title, rank, subject))
return item
Method 2: Save as mysql database
and install mysql database
pip install pymysql
Pipeline class to build mysql database
class MysqlPipeline:
def __init__(self):
self.conn = pymysql.connect(host='127.0.0.1', port=3306, user='root',
password='1234', database='spider', charset='utf8mb4')
self.cursor = self.conn.cursor()
self.data = []
def close_spider(self, spider):
if len(self.data) > 0:
self.__write_to_DB()
self.conn.close()
def process_item(self, item, spider):
title = item.get('title', '')
rank = item.get('rank', 0)
subject = item.get('subject', '')
self.data.append((title, rank, subject))
if len(self.data) == 100:
self.__write_to_DB()
self.data.clear()
return item
def __write_to_DB(self):
self.cursor.executemany(
'insert into tb_douban_movie(title, rating, subject) values (%s,%s,%s)',
self.data
)
self.conn.commit()
After the above two configurations, open the setting.py file again and configure the pipeline class information.
ITEM_PIPELINES = {
# 设置管道类的优先级,数值越小优先级越高
"douban_scrapy.pipelines.ExcelPipeline": 300,
"douban_scrapy.pipelines.MysqlPipeline": 200,
}
Open the command line and re-enter our crawler directory
cd douban_scrapy
Start the crawler and get an excel file named movie data/view the database to find the data and complete the crawling
scrapy scrawl douban