scrapy crawls Douban top250 and inserts it into the MySQL database (entry level)
- Install python locally, go to the official website to download the version you want, and enter python through the command line to verify after installation:
C:\Users\jrkj-kaifa08\PycharmProjects\fzreport>python
Python 3.7.8 (tags/v3.7.8:4b47a5b6ba, Jun 28 2020, 08:53:46) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
If the above scenario appears, the installation is successful.
- To install the scrapy framework in the local environment, you can enter pip install scrapy on the command line. After installation, you can enter scrapy to verify:
C:\Users\jrkj-kaifa08\PycharmProjects\fzreport>scrapy
Scrapy 2.2.0 - project: fzreport
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
commands
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
If the above scenario appears, the installation is successful.
-
In this experiment, the code is written in PyCharm, you can also go to the official website to download the community version, it is free, just a fool-proof installation.
-
Next enter the topic, start creating the project:
(1) Find a folder locally on the computer to store your python project, cmd in the path above, and press Enter
This will open the command line, and then enter scrapy startproject DoubanMovie in the command line (create a project called DoubanMovie)
d:\>scrapy startproject DoubanMovie
New Scrapy project 'DoubanMovie', using template directory 'c:\users\jrkj-kaifa08\appdata\local\programs\python\python37\lib\site-packages\scrapy\templates\project', created in:
d:\DoubanMovie
You can start your first spider with:
cd DoubanMovie
scrapy genspider example example.com
Then execute cd DoubanMovie (here you can use the tab key to complete it like in linux)
After entering the project folder, execute scrapy genspider douban movie.douban.com (create a web crawler based on movie.douban.com)
d:\>cd DoubanMovie
d:\DoubanMovie>scrapy genspider douban movie.douban.com
Created spider 'douban' using template 'basic' in module:
DoubanMovie.spiders.douban
(2) Open the created project through PyCharm and the following structure will appear:
First modify the configuration file setting.py
ROBOTSTXT_OBEY = False
Change the original True to False (whether to comply with the crawler protocol, we have to change it to False in the learning phase)
ITEM_PIPELINES = {
'DoubanMovie.pipelines.Dou': 300,
}
Also open the comment of this code, the lower the value here, the faster the speed.
(3) Then open Douban Movie
Press F12 again, you can see that all movie information is contained in the <ol class="grid_view"> tag
Because it is just a test, we only take the four fields of ranking movie name, rating, number of people.
You can see that the information of each movie is contained in a li tag, so its xpath path is
'//ol[@class="grid_view"]/li'
The ranking information is in the em tag, and its xpath is
'.//div[@class="pic"]/em/text()'
The movie name information is in the first span under the a tag, and its xpath is
'.//div[@class="hd"]/a/span[1]/text()'
The movie rating information is in span class="rating_num" under the div class=star tag, and its xpath is
'.//div[@class="star"]/span[@class="rating_num"]/text()'
The information on the number of movie ratings is in the span under the div class=star tag, and its xpath is
'.//div[@class="star"]/span/text()'
Then you can start writing items.py
import scrapy
class DouItem(scrapy.Item):
# define the fields for your item here like:
# 排名
ranking = scrapy.Field()
# 电影名称
movie_name = scrapy.Field()
# 评分
score = scrapy.Field()
# 评论人数
score_num = scrapy.Field()
pass
If I understand this place myself, it may be similar to the bo class of java. When you open the items.py file, there will be an example above, just follow the example.
After that is the highlight douban.py
# -*- coding: utf-8 -*-
import scrapy
from DoubanMovie.items import DouItem
class DouSpider(scrapy.Spider):
name = 'douban'
# allowed_domains = ['movie.douban.com']
def start_requests(self):
start_urls = 'https://movie.douban.com/top250'
yield scrapy.Request(start_urls)
def parse(self, response):
item = DouItem()
movies = response.xpath('//ol[@class="grid_view"]/li')
for movie in movies:
item['ranking'] = movie.xpath('.//div[@class="pic"]/em/text()').extract()[0]
item['movie_name'] = movie.xpath('.//div[@class="hd"]/a/span[1]/text()').extract()[0]
item['score'] = movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
item['score_num'] = movie.xpath('.//div[@class="star"]/span/text()').re(r'(\d+)人评价')[0] # Selector也有一种.re()
print(item['movie_name'], "------------------------")
yield item
next_url = response.xpath('//span[@class="next"]/a/@href').extract()
if next_url:
next_url = 'https://movie.douban.com/top250' + next_url[0]
yield scrapy.Request(next_url)
Then there is pipelines.py to connect to the database.
Here you need to download the pymysql component and execute it on the command line
pip install pymysql
Next, you need to create a table to open the mysql database, enter the following statement, because of the test, so all data types are set to varchar.
create table movieTable(
ranking varchar(5),
movie_name varchar(100),
score varchar(10),
score_num varchar(10)
)
Next is the code of pipelines.py
import pymysql
import pymysql.cursors
class Dou(object):
def __init__(self):
# 连接MySQL数据库
self.connect = pymysql.connect(host='localhost', user='zds', password='zds', db='zds', port=3306)
self.cursor = self.connect.cursor()
print("______________数据库连接已建立")
def process_item(self, item, Spider):
# 往数据库里面写入数据
print("--------------正在插入数据")
self.cursor.execute(
'insert into movieTable(ranking,movie_name,score,score_num)VALUES ("{}","{}","{}","{}")'.format(item['ranking'], item['movie_name'], item['score'], item['score_num']))
self.connect.commit()
return item
# 关闭数据库
def close_spider(self, Spider):
print("============正在关闭数据库连接")
self.cursor.close()
self.connect.close()
After all the code is written and saved, it can be executed on the command line
C:\Users\jrkj-kaifa08\PycharmProjects\DoubanMovie>scrapy crawl douban
After the execution is complete without error, go to the database to check the data and find that the data has been inserted.
If you want to store the data in a file, you can execute it on the command line
C:\Users\jrkj-kaifa08\PycharmProjects\DoubanMovie>scrapy crawl douban douban.csv
In this case, the data will be stored in the douban.csv file under the current path.