Python3.x使用Scrapy将爬取数据存储到MySQL

Python3.x使用Scrapy将爬取数据存储到MySQL

豆瓣电影排名前250链接 https://movie.douban.com/top250

注:前提安装好python及所需的环境

1.scrapy安装

pip install scrapy

如果提示:no module named ‘win32api’ 则使用下面方式解决

pip install pypiwin32

pip install scrapy

2.创建项目(进入要创建的目录)

scrapy startproject [项目名称]
例如:scrapy startproject douban_scrapy

3.创建爬虫

scrapy genspider 项目名 "域名" 
例如:scrapy genspider douban_scrapy "douban.com" 

4.使用工具打开刚刚创建的项目(我是用的是pycharm)

douban_spider.py可见就是编写爬虫的地方了

souban
5.打开settings.py简单的设置如下

ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3

DEFAULT_REQUEST_HEADERS = {
    
    
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'
}

ITEM_PIPELINES = {
    
    
#存储数据需要将爬取的数据送到pipelines.py,300为优先级
   'douban_scrapy.pipelines.DoubanScrapyPipeline': 300,
}

6.item.py编写需要获取的字段名

import scrapy


class DoubanScrapyItem(scrapy.Item):

    title = scrapy.Field()
    daoyan = scrapy.Field()
    bianju = scrapy.Field()
    zhuyan = scrapy.Field()
    type = scrapy.Field()
    time = scrapy.Field()

7.进入douban_spider.py编写爬虫代码

扫描二维码关注公众号,回复: 12431444 查看本文章
首先将start_urls改成要爬取的url
如:start_urls = ['https://movie.douban.com/top250']
import scrapy
from douban_scrapy.items import DoubanScrapyItem

class DoubanSpiderSpider(scrapy.Spider):
    name = 'douban_spider'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        urls = response.xpath('//ol[@class="grid_view"]/li//div[@class="pic"]/a/@href').getall()
        print(len(urls))
        for url in urls:
            yield scrapy.Request(
                url=url,
                callback=self.detail
            )
        next_urls = response.xpath('//span[@class="next"]/a/@href').get()
        next_url = "https://movie.douban.com/top250" + next_urls
        print("我是下一页",next_url,"="*1000)
        if next_url:
            yield scrapy.Request(
                url=next_url,
                callback=self.parse
            )

    def detail(self,response):
        item = DoubanScrapyItem()
        item["title"] = response.xpath('//div[@id="content"]/h1/span[1]/text()').get() #标题
        item["daoyan"] = response.xpath('//div[@id="info"]/span[1]/span[@class="attrs"]/a/text()').get() #导演
        item["bianju"] = "".join(response.xpath('//div[@id="info"]/span[2]/span[@class="attrs"]//a/text()').getall()) #导演
        item["zhuyan"] = "".join(response.xpath('//div[@id="info"]/span[3]/span[@class="attrs"]//text()').getall()) #导演
        item["type"] = "".join(response.xpath('//div[@id="info"]//span[@property="v:genre"]//text()').getall()) #类型
        item["time"] = response.xpath('//div[@id="info"]//span[@property="v:runtime"]/text()').get() #时长
        yield item

8.编写pipelines.py代码

import pymysql

class DoubanScrapyPipeline(object):

    def __init__(self):
        username = "root"
        password = "root"
        dbname = "python_test"
        host = "localhost"
        self.db = pymysql.connect(host,username,password,dbname)

    def open_spider(self,spider):
        cursor = self.db.cursor()
        cursor.execute("drop table if exists test1")
        sql = """
            create table test1(
            id int primary key auto_increment,
            title varchar(255),
            daoyan varchar(255),
            bianju varchar(255),
            zhuyan text(255),
            type varchar(255),
            time varchar(255)
            )character set utf8
        """
        cursor.execute(sql)

    def process_item(self, item, spider):
        try:
            cursor = self.db.cursor()
            value = (item["title"], item["daoyan"], item["bianju"], item["zhuyan"], item["type"], item["time"])
            sql = "insert into test1(title,daoyan,bianju,zhuyan,type,time) value (%s,%s,%s,%s,%s,%s)"
            cursor.execute(sql,value)
            self.db.commit()
        except Exception as e:
            self.db.rollback()
            print("存储失败",e)
        return item

    def close_spider(self,spider):
        self.db.close()


9.创建start.py文件

from scrapy import cmdline

cmdline.execute("scrapy crawl douban_spider".split())

10.右击运行"Run Start"执行爬虫

步骤9,10可省略,直接进入该项目,运行scrapy crawl douban_spider也可执行

如果对您有帮助,麻烦点个赞。您的鼓励就是我的动力!
Python3.x使用Scrapy将爬取数据存储成Json

猜你喜欢

转载自blog.csdn.net/weixin_45167444/article/details/108636658
今日推荐