"Amazon Cloud Technology Product Review" event call for papers|Using scrapy to capture Douban book data

"Amazon Cloud Technology Product Review" event call for papers|Using scrapy to capture Douban book data

Authorization statement: This article authorizes the official Amazon Cloud Technology article to forward and rewrite the rights, including but not limited to Amazon Cloud Technology official channels such as Developer Centre, Zhihu, self-media platforms, third-party developer media, etc.

background

Hey friends, today I want to share an interesting and practical project with you~ Let’s learn how to build a crawler system based on Douban Books in the AWS Lightsail environment through Scrapy!

I have been studying recentlypython crawlers. The company needs data from many data sources, but it needs to ensure the security and high availability of the data. Obviously, if we build it ourselves from scratch , then just spending a lot of time on the database may also cause our data problems. I have been considering using a cloud database, and I happened to see that Lightsail on AWS can deploy a highly available database with one click, which is very efficient. It’s high, and there are corresponding packages that can be ordered. As Lightsail has just been a prostitute for three months, it must be used!

With a high-performance and high-availability database instance, our crawling efficiency will become higher and higher, and the ability to use distributed concurrency features will greatly improve efficiency.

Next, I will share with you the whole process from basic environment construction, to Scrapy project development, to AWS Lightsail deployment. I believe that after deploying this system, you can obtain large amounts of book data easily and efficiently!

Introduction to Lightsail

Amazon Lightsail is a cloud computing service launched by AWS for developers and small businesses. It simplifies the process of deploying and managing virtual tenants on the AWS cloud platform.

Lightsail provides preset virtual server instances, similar to traditional VPS rental plans, but it is fully managed on the AWS cloud, bringing greater flexibility and security to developers. Using Lightsail, we can complete the following tasks with simple mouse operations:

  • Select the hardware configuration. Currently, it supports several specifications of CPU/memory configuration ranging from 0.5G to 8G.
  • Create a public IP address or set network access rules.
  • Select the operating system, supporting commonly used Linux distributions and Windows systems.
  • Click "Start Instance" to quickly obtain virtual host resources.

In addition, Lightsail provides a convenient pricing model, where you only need to pay monthly with unlimited usage and traffic. For individuals and small projects, the threshold for use is low and the cost efficiency is high.

Insert image description here

Crawling Douban book data based on Lightsail

Since it is an experience, it must be used in depth. Here I start a Lightsail instance to store the data of Douban Books, using the scrapy framework, mainly for < /span>top250 Crawl the data of comments and comments, and then transfer it to the Lightsail database instance. Next, follow my steps to experience it~

1. Lightsail database instance construction

The console address is: https://lightsail.aws.amazon.com/ls/webapp/home/databases

Insert image description here
Click create database

Insert image description here

Insert image description here

Here we can use the minimum configuration. My configuration is as follows:

  • Version: mysql8.0.35
  • Type: standard
  • Architecture: single node
  • Computing power: 1GB 2vCPUs

We don’t need to download and install it on the server ourselves like the originalmysql. Direct purchase will automatically generate a high-performance cloud database, which is very convenient! !

2. Build data tables through client connection instances

In order for the client to connect to the database instance, we need to open the external network connection of the instance first. The connection is not available by default.

Insert image description here

This is the open state. We can use this host + port and the account password when purchasing to connect in the development environment.

Insert image description here

We connect through Navicat to prepare the database required for the crawler service.

  • Create a new database douban
    Insert image description here

  • Create book table and book_comment table

CREATE TABLE `book_comments` (
  `id` int NOT NULL AUTO_INCREMENT,
  `book_id` int NOT NULL,
  `book_name` varchar(100) NOT NULL,
  `username` varchar(30) NOT NULL,
  `rating` int DEFAULT NULL,
  `comment_time` datetime DEFAULT CURRENT_TIMESTAMP,
  `useful_count` int DEFAULT '0',
  `content` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

CREATE TABLE `books` (
  `id` int NOT NULL AUTO_INCREMENT,
  `book_id` int DEFAULT NULL,
  `title` varchar(100) NOT NULL,
  `cover` varchar(200) DEFAULT NULL,
  `is_try_read` varchar(1) DEFAULT NULL,
  `author` varchar(30) NOT NULL,
  `publisher` varchar(50) DEFAULT NULL,
  `publish_date` varchar(30) DEFAULT NULL,
  `list_price` varchar(30) DEFAULT NULL,
  `rating` float DEFAULT NULL,
  `ratings_count` int DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=206 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Insert image description here

ClickSQL to bring up the window. After we have the basic data table, we need to further build our development environment, because AWS has already helped with many database parameters, tuning, etc. We've got it taken care of, we just have to use

3. Build a crawler project

3.1. Basic scrapy framework construction
  • Install scrapy
pip install scrapy 
  • Create scrapy project
 scrapy startproject 项目名称
  • Create a book crawler and review crawler
scrapy genspider 爬虫名 域名

At this time, the entire project framework is as shown below:

Insert image description here

  • Dependency management

In order to facilitate dependency management and project migration, I defined a requirements.txt file with the following content:

scrapy
pymysql
  • Install dependencies
pip install -r requirement.txt

scrapy is our crawler framework, pymysql is used to connect to our Lightsail Mysql database

This completes the most basic framework construction, and then we mainly develop and modify the following files:

  • book_spider.py book crawler (needs development)
  • book_comment_spider.py book review crawler (needs development)
  • items.py ORM file (needs development)
  • pipelines.py pipeline file, used to store data into AWS database (requires development)
  • settings.py configuration file (needs to be modified)

We first integrate Lightsail Mysql into the project

3.2. Build ORM based on data table

Add two model mappings in items.py, the code is as follows:


class DoubanBooksItem(scrapy.Item):
    book_id = scrapy.Field()
    title = scrapy.Field()
    cover_link = scrapy.Field()
    is_try_read = scrapy.Field()
    author = scrapy.Field()
    publisher = scrapy.Field()
    publish_date = scrapy.Field()
    price = scrapy.Field()
    rating = scrapy.Field()
    rating_count = scrapy.Field()


class DoubanBookCommentItem(scrapy.Item):
    book_id = scrapy.Field()
    book_name = scrapy.Field()
    username = scrapy.Field()
    rating = scrapy.Field()
    comment_time = scrapy.Field()
    useful_count = scrapy.Field()
    content = scrapy.Field()

3.3. Integrate Lightsail Mysql
  • Add database configuration

Insert image description here

Add DATABASE configuration at the end of the file:

DATABASE = {
   'host': 'ls-f78481475a51804987f6ff06db2e3d675421989d.c8ugxo23bwey.ap-northeast-2.rds.amazonaws.com',
   'port': 3306,
   'user': 'dbmasteruser',
   'passwd': '4(Be}Sm>hjxbgw0*<C+TeTC&*Y?p#[lZ',
   'db': 'douban',
   'charset': 'utf8',
}
  • Add pipelines configuration

Insert image description here

ITEM_PIPELINES = {
    'douban_books.pipelines.DoubanBookCommentAWSPipeline': 1,
}

This is what we need to develop nextpipeline

3.4. Development of DoubanBookCommentAWSPipeline

Because we have two different item, we will judge in process_item

from twisted.enterprise import adbapi

from douban_books.items import DoubanBooksItem, DoubanBookCommentItem
from douban_books.settings import DATABASE

class DoubanBookCommentAWSPipeline:

    def __init__(self):
        self.conn = adbapi.ConnectionPool('pymysql', **DATABASE)

    def do_insert_book(self, tx, item):
        # 执行插入操作
        tx.execute("""insert into books (book_id, title, cover, is_try_read, author, publisher, publish_date, list_price, rating, ratings_count) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)""", (item['book_id'], item['title'], item['cover_link'], item['is_try_read'], item['author'], item['publisher'], item['publish_date'], item['price'], item['rating'], item['rating_count']))

    def do_insert_book_comment(self, tx, item):
        # 执行插入操作
        tx.execute("""insert into books (book_id, book_name, username, rating, comment_time, useful_count, content) values (%s,%s,%s,%s,%s,%s,%s)""", (item['book_id'], item['book_name'], item['username'], item['rating'], item['comment_time'], item['useful_count'], item['content']))

    def process_item(self, item, spider):
        if isinstance(item, DoubanBooksItem):
            print('开始写入 book')
            query = self.conn.runInteraction(self.do_insert_book, item)
            query.addErrback(self.handle_error)
        elif isinstance(item, DoubanBookCommentItem):
            print(item)
            print('开始写入 book item')

            query = self.conn.runInteraction(self.do_insert_book_comment, item)
            query.addErrback(self.handle_error)

        return item

    def handle_error(self, failure):
        # 处理异步插入的错误
        print(failure)
3.5. Develop Douban book crawler

Here we mainly crawl the books of top250, and the crawler data will finally be transferred to through pipelineLightsail Mysql

import scrapy
from douban_books.items import DoubanBooksItem
import re


class DoubanBookSpider(scrapy.Spider):
    name = 'douban_book_spider'
    start_urls = ['https://book.douban.com/top250']

    def parse(self, response):
        # 解析书籍信息
        for book_tr in response.css('tr.item'):
            item = DoubanBooksItem()
            # 提取书籍URL
            book_url = book_tr.css('div.pl2 > a::attr(href)').get()
            # 提取书籍ID
            item['book_id'] = book_url.split('/')[-2] if book_url else None

            item['title'] = book_tr.css('div.pl2 a::text').get().strip()
            item['cover_link'] = book_tr.css('td a.nbg img::attr(src)').get()
            item['is_try_read'] = "是" if book_tr.css('div.pl2 img[title="可试读"]') else "否"

            # 提取作者、出版社、发行日期和价格的信息
            details = book_tr.css('p.pl::text').get().strip().split(' / ')
            item['author'] = details[0]
            item['publisher'] = details[-3]
            item['publish_date'] = details[-2]
            item['price'] = details[-1]

            item['rating'] = book_tr.css('span.rating_nums::text').get()
            rating_count_text = book_tr.css('span.pl::text').get()
            item['rating_count'] = re.search(r'(\d+)人评价', rating_count_text).group(1) if rating_count_text else None
            yield item

        # 翻页处理
        next_page = response.css('span.next a::attr(href)').get()
        if next_page:
            yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse)

Now we can run a test crawl on this crawler

scrapy crawl douban_book_spider

Insert image description here

  • Check whether the data is written normally throughNavicat

Insert image description here

3.6. Develop Douban book review crawler

Here we need to obtain the of all crawled books through Lightsail Mysql and obtain commentsid

import pymysql
import scrapy
import json
from douban_books.items import DoubanBookCommentItem
from douban_books.settings import DATABASE


class DoubanBookCommentSpider(scrapy.Spider):
    name = 'douban_book_comment_spider'

    db = pymysql.connect(**DATABASE)
    # 使用Cursor执行SQL
    cursor = db.cursor()
    sql = "SELECT book_id FROM books"
    cursor.execute(sql)
    # 获取结果
    results = cursor.fetchall()


    # 提取book_id
    book_ids = [result[0] for result in results]
    # Step 2: Generate start_urls from book.csv
    start_urls = [f'https://book.douban.com/subject/{
      
      book_id}/' for book_id in book_ids]

    def parse(self, response):
        self.logger.info(f"Parsing: {
      
      response.url}")

        # Extract music name
        book_title = response.css('h1 span::text').get()
        print(book_title)

        # Construct the initial comments URL
        book_id = response.url.split("/")[4]
        comments_url = f'https://book.douban.com/subject/{
      
      book_id}/comments/?start=0&limit=20&status=P&sort=new_score&comments_only=1'
        print(comments_url)
        yield scrapy.Request(url=comments_url, callback=self.parse_comments, meta={
    
    'book_title': book_title, 'book_id': book_id})

    def parse_comments(self, response):
        # Extract the HTML content from the JSON data
        html_content = response.json()['html']
        selector = scrapy.Selector(text=html_content)
        book_name = response.meta['book_title']
        book_id = response.meta['book_id']


        data = json.loads(response.text)
        print(selector.css('li.comment-item'))
        # 解析评论
        for comment in selector.css('li.comment-item'):
            item = DoubanBookCommentItem()
            item['book_id'] = book_id
            item['book_name'] = book_name
            item['username'] = comment.css('a::attr(title)').get()
            item['rating'] = comment.css('.comment-info span.rating::attr(title)').get()
            # rating_class = comment.css('span.rating::attr(class)').get()
            # item['rating'] = self.parse_rating(rating_class) if rating_class else None
            item['comment_time'] = comment.css('span.comment-info > a.comment-time::text').get()

            # item['comment_time'] = comment.css('span.comment-time::text').get()
            item['useful_count'] = comment.css('span.vote-count::text').get()
            item['content'] = comment.css('span.short::text').get()
            yield item

        book_id = response.url.split("/")[4]
        base_url = f"https://book.douban.com/subject/{
      
      book_id}/comments/"
        next_page = selector.css('#paginator a[data-page="next"]::attr(href)').get()
        if next_page:
            next_page_url = base_url + next_page + '&comments_only=1'
            yield scrapy.Request(url=next_page_url, callback=self.parse_comments, meta={
    
    'book_title': book_name, 'book_id': book_id})
  • Running a comment crawler
scrapy crawl douban_book_comment_spider

Insert image description here
This completes the development of the entire crawler service. The entire Lightsail Mysql experience is still very good. We don’t need to consider availability and performance ourselves. AWS naturally helps us complete these types of work, greatly improving the efficiency of our work!

Guess you like

Origin blog.csdn.net/2301_79448738/article/details/134527600