scrapy crawls Douban top250 and inserts it into the MySQL database (entry level)

scrapy crawls Douban top250 and inserts it into the MySQL database (entry level)

  1. Install python locally, go to the official website to download the version you want, and enter python through the command line to verify after installation:
C:\Users\jrkj-kaifa08\PycharmProjects\fzreport>python
Python 3.7.8 (tags/v3.7.8:4b47a5b6ba, Jun 28 2020, 08:53:46) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

If the above scenario appears, the installation is successful.

  1. To install the scrapy framework in the local environment, you can enter pip install scrapy on the command line. After installation, you can enter scrapy to verify:
C:\Users\jrkj-kaifa08\PycharmProjects\fzreport>scrapy
Scrapy 2.2.0 - project: fzreport

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  commands
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

If the above scenario appears, the installation is successful.

  1. In this experiment, the code is written in PyCharm, you can also go to the official website to download the community version, it is free, just a fool-proof installation.

  2. Next enter the topic, start creating the project:

    (1) Find a folder locally on the computer to store your python project, cmd in the path above, and press Enter

This will open the command line, and then enter scrapy startproject DoubanMovie in the command line (create a project called DoubanMovie)

d:\>scrapy startproject DoubanMovie
New Scrapy project 'DoubanMovie', using template directory 'c:\users\jrkj-kaifa08\appdata\local\programs\python\python37\lib\site-packages\scrapy\templates\project', created in:
    d:\DoubanMovie

You can start your first spider with:
    cd DoubanMovie
    scrapy genspider example example.com

Then execute cd DoubanMovie (here you can use the tab key to complete it like in linux)

After entering the project folder, execute scrapy genspider douban movie.douban.com (create a web crawler based on movie.douban.com)

d:\>cd DoubanMovie
d:\DoubanMovie>scrapy genspider douban  movie.douban.com
Created spider 'douban' using template 'basic' in module:
  DoubanMovie.spiders.douban

(2) Open the created project through PyCharm and the following structure will appear:

Insert picture description here
First modify the configuration file setting.py

ROBOTSTXT_OBEY = False

Change the original True to False (whether to comply with the crawler protocol, we have to change it to False in the learning phase)

ITEM_PIPELINES = {
    
    
    'DoubanMovie.pipelines.Dou': 300,
}

Also open the comment of this code, the lower the value here, the faster the speed.

(3) Then open Douban Movie

Press F12 again, you can see that all movie information is contained in the <ol class="grid_view"> tagInsert picture description here

Because it is just a test, we only take the four fields of ranking movie name, rating, number of people.
Insert picture description here

You can see that the information of each movie is contained in a li tag, so its xpath path is

'//ol[@class="grid_view"]/li'

The ranking information is in the em tag, and its xpath is

'.//div[@class="pic"]/em/text()'

The movie name information is in the first span under the a tag, and its xpath is

'.//div[@class="hd"]/a/span[1]/text()'

The movie rating information is in span class="rating_num" under the div class=star tag, and its xpath is

'.//div[@class="star"]/span[@class="rating_num"]/text()'

The information on the number of movie ratings is in the span under the div class=star tag, and its xpath is

'.//div[@class="star"]/span/text()'

Then you can start writing items.py

import scrapy

class DouItem(scrapy.Item):
    # define the fields for your item here like:
    # 排名
    ranking = scrapy.Field()
    # 电影名称
    movie_name = scrapy.Field()
    # 评分
    score = scrapy.Field()
    # 评论人数
    score_num = scrapy.Field()
    pass

If I understand this place myself, it may be similar to the bo class of java. When you open the items.py file, there will be an example above, just follow the example.

After that is the highlight douban.py

# -*- coding: utf-8 -*-
import scrapy
from DoubanMovie.items import DouItem


class DouSpider(scrapy.Spider):
    name = 'douban'

    # allowed_domains = ['movie.douban.com']
    def start_requests(self):
        start_urls = 'https://movie.douban.com/top250'
        yield scrapy.Request(start_urls)

    def parse(self, response):
        item = DouItem()
        movies = response.xpath('//ol[@class="grid_view"]/li')
        for movie in movies:
            item['ranking'] = movie.xpath('.//div[@class="pic"]/em/text()').extract()[0]
            item['movie_name'] = movie.xpath('.//div[@class="hd"]/a/span[1]/text()').extract()[0]
            item['score'] = movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
            item['score_num'] = movie.xpath('.//div[@class="star"]/span/text()').re(r'(\d+)人评价')[0]  # Selector也有一种.re()
            print(item['movie_name'], "------------------------")
            yield item
        next_url = response.xpath('//span[@class="next"]/a/@href').extract()
        if next_url:
            next_url = 'https://movie.douban.com/top250' + next_url[0]
            yield scrapy.Request(next_url)

Then there is pipelines.py to connect to the database.
Here you need to download the pymysql component and execute it on the command line

pip install pymysql

Next, you need to create a table to open the mysql database, enter the following statement, because of the test, so all data types are set to varchar.

create table movieTable(
	ranking				varchar(5),
	movie_name			varchar(100),
	score				varchar(10),
	score_num			varchar(10)
)

Next is the code of pipelines.py

import pymysql
import pymysql.cursors


class Dou(object):

    def __init__(self):
        # 连接MySQL数据库
        self.connect = pymysql.connect(host='localhost', user='zds', password='zds', db='zds', port=3306)
        self.cursor = self.connect.cursor()
        print("______________数据库连接已建立")

    def process_item(self, item, Spider):
        # 往数据库里面写入数据
        print("--------------正在插入数据")
        self.cursor.execute(
            'insert into movieTable(ranking,movie_name,score,score_num)VALUES ("{}","{}","{}","{}")'.format(item['ranking'], item['movie_name'], item['score'], item['score_num']))
        self.connect.commit()
        return item

    # 关闭数据库
    def close_spider(self, Spider):
        print("============正在关闭数据库连接")
        self.cursor.close()
        self.connect.close()

After all the code is written and saved, it can be executed on the command line

C:\Users\jrkj-kaifa08\PycharmProjects\DoubanMovie>scrapy crawl douban

After the execution is complete without error, go to the database to check the data and find that the data has been inserted.
Insert picture description here
If you want to store the data in a file, you can execute it on the command line

C:\Users\jrkj-kaifa08\PycharmProjects\DoubanMovie>scrapy crawl douban douban.csv

In this case, the data will be stored in the douban.csv file under the current path.

Guess you like

Origin blog.csdn.net/qq_37823979/article/details/107201422