scrapy 当当网书籍信息爬取存储MySQL

这里使用到MySQL,对小白还算挺友好的。

当然还有其他数据库

  • redis、mongodb(非关系数据库)
  • influxdb (时序数据库)一般用作监控框架,单机版免费,了解一下?

废话少说,开始正题.

1、先创建scrapy项目

scrapy startproject dangdang

2、创一个爬虫,模式basic,crawl

scrapy genspider -t basic dd dangdang.com

3、了解项目相关内容


items.py   用于定义容器,在dd.py中可以使用,传递给pipelines.py处理

setting.py   设置scrapy项目的属性,例如user-agent、pipelines等设置

middlewares.py 中间件


一般编辑步骤    items->xx->pipelines->setting(按个人习惯吧)


items.py

import scrapy
class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title= scrapy.Field()
    link= scrapy.Field()
    comment= scrapy.Field()
    pass

dd.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from dangdang.items import DangdangItem
class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains = ['dangdang.com']
    #start_urls = ['http://dangdang.com/']
    ua = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}
    def start_requests(self):
        return [Request('http://search.dangdang.com/?key=python&act=input&show=big&page_index=1#J_tab',headers=self.ua,callback=self.parse)]
    def parse(self, response):
        item=DangdangItem()
        item['title']=response.xpath("//a[@class='pic']/@title").extract()
        item['link'] = response.xpath("//a[@class='pic']/@href").extract()
        item['comment'] = response.xpath("//a[@dd_name='单品评论']/text()").extract()
        yield item
        for i in range(2,33):
            url='http://search.dangdang.com/?key=python&act=input&show=big&page_index='+str(i)+'#J_tab'
            yield Request(url,callback=self.parse,headers=self.ua)
pipelines.py
import pymysql
class DangdangPipeline(object):
    def process_item(self, item, spider):
        con=pymysql.connect('127.0.0.1','root','123','dangdang',charset='utf8')
        cursor = con.cursor()
        for i in range(len(item['title'])):
            title=item['title'][i]
            link = item['link'][i]
            comment = item['comment'][i]
            sql="""insert into books(title,link,comment) VALUES(%s,%s,%s)"""
            cursor.execute(sql,(title,link,comment))
            con.commit()
        con.close()
        return item

setting.py

 
 
ITEM_PIPELINES = {
    'dangdang.pipelines.DangdangPipeline': 300,        #取消注释
}
USER_AGENT = 'xxxxxxxxxx'   #修改user_agent
ROBOTSTXT_OBEY = False      #robot改为false
4、运行

单独运行dd爬虫(--nolog不加载log,界面整洁)

scrapy crawl dd

数据库中效果图:








猜你喜欢

转载自blog.csdn.net/nonoroya_zoro/article/details/80149371