"Amazon Cloud Technology Product Review" event call for papers|Using scrapy to capture Douban book data
Authorization statement: This article authorizes the official Amazon Cloud Technology article to forward and rewrite the rights, including but not limited to Amazon Cloud Technology official channels such as Developer Centre, Zhihu, self-media platforms, third-party developer media, etc.
Article directory
- "Amazon Cloud Technology Product Review" event call for papers|Using scrapy to capture Douban book data
background
Hey friends, today I want to share an interesting and practical project with you~ Let’s learn how to build a crawler system based on Douban Books in the AWS Lightsail environment through Scrapy!
I have been studying recentlypython
crawlers. The company needs data from many data sources, but it needs to ensure the security and high availability of the data. Obviously, if we build it ourselves from scratch , then just spending a lot of time on the database may also cause our data problems. I have been considering using a cloud database, and I happened to see that Lightsail on AWS can deploy a highly available database with one click, which is very efficient. It’s high, and there are corresponding packages that can be ordered. As Lightsail has just been a prostitute for three months, it must be used!
With a high-performance and high-availability database instance, our crawling efficiency will become higher and higher, and the ability to use distributed concurrency features will greatly improve efficiency.
Next, I will share with you the whole process from basic environment construction, to Scrapy project development, to AWS Lightsail deployment. I believe that after deploying this system, you can obtain large amounts of book data easily and efficiently!
Introduction to Lightsail
Amazon Lightsail is a cloud computing service launched by AWS for developers and small businesses. It simplifies the process of deploying and managing virtual tenants on the AWS cloud platform.
Lightsail provides preset virtual server instances, similar to traditional VPS rental plans, but it is fully managed on the AWS cloud, bringing greater flexibility and security to developers. Using Lightsail, we can complete the following tasks with simple mouse operations:
- Select the hardware configuration. Currently, it supports several specifications of CPU/memory configuration ranging from 0.5G to 8G.
- Create a public IP address or set network access rules.
- Select the operating system, supporting commonly used Linux distributions and Windows systems.
- Click "Start Instance" to quickly obtain virtual host resources.
In addition, Lightsail provides a convenient pricing model, where you only need to pay monthly with unlimited usage and traffic. For individuals and small projects, the threshold for use is low and the cost efficiency is high.
Crawling Douban book data based on Lightsail
Since it is an experience, it must be used in depth. Here I start a Lightsail instance to store the data of Douban Books, using the scrapy
framework, mainly for < /span>top250
Crawl the data of comments and comments, and then transfer it to the Lightsail database instance. Next, follow my steps to experience it~
1. Lightsail database instance construction
The console address is: https://lightsail.aws.amazon.com/ls/webapp/home/databases
Click create database
Here we can use the minimum configuration. My configuration is as follows:
- Version: mysql8.0.35
- Type: standard
- Architecture: single node
- Computing power: 1GB 2vCPUs
We don’t need to download and install it on the server ourselves like the originalmysql
. Direct purchase will automatically generate a high-performance cloud database, which is very convenient! !
2. Build data tables through client connection instances
In order for the client to connect to the database instance, we need to open the external network connection of the instance first. The connection is not available by default.
This is the open state. We can use this host + port and the account password when purchasing to connect in the development environment.
We connect through Navicat to prepare the database required for the crawler service.
-
Create a new database
douban
-
Create book table and book_comment table
CREATE TABLE `book_comments` (
`id` int NOT NULL AUTO_INCREMENT,
`book_id` int NOT NULL,
`book_name` varchar(100) NOT NULL,
`username` varchar(30) NOT NULL,
`rating` int DEFAULT NULL,
`comment_time` datetime DEFAULT CURRENT_TIMESTAMP,
`useful_count` int DEFAULT '0',
`content` text,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
CREATE TABLE `books` (
`id` int NOT NULL AUTO_INCREMENT,
`book_id` int DEFAULT NULL,
`title` varchar(100) NOT NULL,
`cover` varchar(200) DEFAULT NULL,
`is_try_read` varchar(1) DEFAULT NULL,
`author` varchar(30) NOT NULL,
`publisher` varchar(50) DEFAULT NULL,
`publish_date` varchar(30) DEFAULT NULL,
`list_price` varchar(30) DEFAULT NULL,
`rating` float DEFAULT NULL,
`ratings_count` int DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=206 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
ClickSQL
to bring up the window. After we have the basic data table, we need to further build our development environment, because AWS has already helped with many database parameters, tuning, etc. We've got it taken care of, we just have to use
3. Build a crawler project
3.1. Basic scrapy framework construction
- Install scrapy
pip install scrapy
- Create scrapy project
scrapy startproject 项目名称
- Create a book crawler and review crawler
scrapy genspider 爬虫名 域名
At this time, the entire project framework is as shown below:
- Dependency management
In order to facilitate dependency management and project migration, I defined a requirements.txt
file with the following content:
scrapy
pymysql
- Install dependencies
pip install -r requirement.txt
scrapy
is our crawler framework, pymysql
is used to connect to our Lightsail Mysql
database
This completes the most basic framework construction, and then we mainly develop and modify the following files:
- book_spider.py book crawler (needs development)
- book_comment_spider.py book review crawler (needs development)
- items.py ORM file (needs development)
- pipelines.py pipeline file, used to store data into AWS database (requires development)
- settings.py configuration file (needs to be modified)
We first integrate Lightsail Mysql into the project
3.2. Build ORM based on data table
Add two model mappings in items.py, the code is as follows:
class DoubanBooksItem(scrapy.Item):
book_id = scrapy.Field()
title = scrapy.Field()
cover_link = scrapy.Field()
is_try_read = scrapy.Field()
author = scrapy.Field()
publisher = scrapy.Field()
publish_date = scrapy.Field()
price = scrapy.Field()
rating = scrapy.Field()
rating_count = scrapy.Field()
class DoubanBookCommentItem(scrapy.Item):
book_id = scrapy.Field()
book_name = scrapy.Field()
username = scrapy.Field()
rating = scrapy.Field()
comment_time = scrapy.Field()
useful_count = scrapy.Field()
content = scrapy.Field()
3.3. Integrate Lightsail Mysql
- Add database configuration
Add DATABASE configuration at the end of the file:
DATABASE = {
'host': 'ls-f78481475a51804987f6ff06db2e3d675421989d.c8ugxo23bwey.ap-northeast-2.rds.amazonaws.com',
'port': 3306,
'user': 'dbmasteruser',
'passwd': '4(Be}Sm>hjxbgw0*<C+TeTC&*Y?p#[lZ',
'db': 'douban',
'charset': 'utf8',
}
- Add pipelines configuration
ITEM_PIPELINES = {
'douban_books.pipelines.DoubanBookCommentAWSPipeline': 1,
}
This is what we need to develop nextpipeline
3.4. Development of DoubanBookCommentAWSPipeline
Because we have two different item
, we will judge in process_item
from twisted.enterprise import adbapi
from douban_books.items import DoubanBooksItem, DoubanBookCommentItem
from douban_books.settings import DATABASE
class DoubanBookCommentAWSPipeline:
def __init__(self):
self.conn = adbapi.ConnectionPool('pymysql', **DATABASE)
def do_insert_book(self, tx, item):
# 执行插入操作
tx.execute("""insert into books (book_id, title, cover, is_try_read, author, publisher, publish_date, list_price, rating, ratings_count) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)""", (item['book_id'], item['title'], item['cover_link'], item['is_try_read'], item['author'], item['publisher'], item['publish_date'], item['price'], item['rating'], item['rating_count']))
def do_insert_book_comment(self, tx, item):
# 执行插入操作
tx.execute("""insert into books (book_id, book_name, username, rating, comment_time, useful_count, content) values (%s,%s,%s,%s,%s,%s,%s)""", (item['book_id'], item['book_name'], item['username'], item['rating'], item['comment_time'], item['useful_count'], item['content']))
def process_item(self, item, spider):
if isinstance(item, DoubanBooksItem):
print('开始写入 book')
query = self.conn.runInteraction(self.do_insert_book, item)
query.addErrback(self.handle_error)
elif isinstance(item, DoubanBookCommentItem):
print(item)
print('开始写入 book item')
query = self.conn.runInteraction(self.do_insert_book_comment, item)
query.addErrback(self.handle_error)
return item
def handle_error(self, failure):
# 处理异步插入的错误
print(failure)
3.5. Develop Douban book crawler
Here we mainly crawl the books of top250
, and the crawler data will finally be transferred to through pipeline
Lightsail Mysql
import scrapy
from douban_books.items import DoubanBooksItem
import re
class DoubanBookSpider(scrapy.Spider):
name = 'douban_book_spider'
start_urls = ['https://book.douban.com/top250']
def parse(self, response):
# 解析书籍信息
for book_tr in response.css('tr.item'):
item = DoubanBooksItem()
# 提取书籍URL
book_url = book_tr.css('div.pl2 > a::attr(href)').get()
# 提取书籍ID
item['book_id'] = book_url.split('/')[-2] if book_url else None
item['title'] = book_tr.css('div.pl2 a::text').get().strip()
item['cover_link'] = book_tr.css('td a.nbg img::attr(src)').get()
item['is_try_read'] = "是" if book_tr.css('div.pl2 img[title="可试读"]') else "否"
# 提取作者、出版社、发行日期和价格的信息
details = book_tr.css('p.pl::text').get().strip().split(' / ')
item['author'] = details[0]
item['publisher'] = details[-3]
item['publish_date'] = details[-2]
item['price'] = details[-1]
item['rating'] = book_tr.css('span.rating_nums::text').get()
rating_count_text = book_tr.css('span.pl::text').get()
item['rating_count'] = re.search(r'(\d+)人评价', rating_count_text).group(1) if rating_count_text else None
yield item
# 翻页处理
next_page = response.css('span.next a::attr(href)').get()
if next_page:
yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse)
Now we can run a test crawl on this crawler
scrapy crawl douban_book_spider
- Check whether the data is written normally through
Navicat
3.6. Develop Douban book review crawler
Here we need to obtain the of all crawled books through Lightsail Mysql
and obtain commentsid
import pymysql
import scrapy
import json
from douban_books.items import DoubanBookCommentItem
from douban_books.settings import DATABASE
class DoubanBookCommentSpider(scrapy.Spider):
name = 'douban_book_comment_spider'
db = pymysql.connect(**DATABASE)
# 使用Cursor执行SQL
cursor = db.cursor()
sql = "SELECT book_id FROM books"
cursor.execute(sql)
# 获取结果
results = cursor.fetchall()
# 提取book_id
book_ids = [result[0] for result in results]
# Step 2: Generate start_urls from book.csv
start_urls = [f'https://book.douban.com/subject/{
book_id}/' for book_id in book_ids]
def parse(self, response):
self.logger.info(f"Parsing: {
response.url}")
# Extract music name
book_title = response.css('h1 span::text').get()
print(book_title)
# Construct the initial comments URL
book_id = response.url.split("/")[4]
comments_url = f'https://book.douban.com/subject/{
book_id}/comments/?start=0&limit=20&status=P&sort=new_score&comments_only=1'
print(comments_url)
yield scrapy.Request(url=comments_url, callback=self.parse_comments, meta={
'book_title': book_title, 'book_id': book_id})
def parse_comments(self, response):
# Extract the HTML content from the JSON data
html_content = response.json()['html']
selector = scrapy.Selector(text=html_content)
book_name = response.meta['book_title']
book_id = response.meta['book_id']
data = json.loads(response.text)
print(selector.css('li.comment-item'))
# 解析评论
for comment in selector.css('li.comment-item'):
item = DoubanBookCommentItem()
item['book_id'] = book_id
item['book_name'] = book_name
item['username'] = comment.css('a::attr(title)').get()
item['rating'] = comment.css('.comment-info span.rating::attr(title)').get()
# rating_class = comment.css('span.rating::attr(class)').get()
# item['rating'] = self.parse_rating(rating_class) if rating_class else None
item['comment_time'] = comment.css('span.comment-info > a.comment-time::text').get()
# item['comment_time'] = comment.css('span.comment-time::text').get()
item['useful_count'] = comment.css('span.vote-count::text').get()
item['content'] = comment.css('span.short::text').get()
yield item
book_id = response.url.split("/")[4]
base_url = f"https://book.douban.com/subject/{
book_id}/comments/"
next_page = selector.css('#paginator a[data-page="next"]::attr(href)').get()
if next_page:
next_page_url = base_url + next_page + '&comments_only=1'
yield scrapy.Request(url=next_page_url, callback=self.parse_comments, meta={
'book_title': book_name, 'book_id': book_id})
- Running a comment crawler
scrapy crawl douban_book_comment_spider
This completes the development of the entire crawler service. The entire Lightsail Mysql experience is still very good. We don’t need to consider availability and performance ourselves. AWS naturally helps us complete these types of work, greatly improving the efficiency of our work!