scrapy Dangdang reptiles

Foreword

Long time did not write a real blog, because a few months before the internship, blog update is delayed down, and now affected by the epidemic can not return to school, but still can not lose the skills, write an article today use scrapy crawling Dangdang the actual practice of it.


Creating scrapy project

Target site: http://search.dangdang.com/?key=python&category_path=01.00.00.00.00.00&page_index=1 This is a search keyword in Dangdang get python page

The first step is still using the command line switch to the working directory created scrapy project

  • D: \ pythonwork \ cnblog> scrapy startproject cnblog_dangdang

 

 

 Then use the cd command to enter the spiders files in the project folder using the file command to create crawler ( Note: this command with the URL of the destination URL is the domain name, rather than the entire website )

  • D: \ pythonwork \ cnblog \ cnblog_dangdang \ cnblog_dangdang \ spiders> scrapy genspider dangdang_spider dangdang.com

 

At this point we project basis with reptiles file has been created, then write code to open the project using pycharm


 

 content analysis

We need to open the target site analysis crawling what

 

For commodity book destination site, we need to crawl its title, price, author, score and summarized into five parts

 

So first we declare what we need crawling in items.py file for the project.

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class CnblogDangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
    author = scrapy.Field()
    star = scrapy.Field()
    detail = scrapy.Field()

Therefore, our data sql statement to create a table as follows:

CREATE TABLE IF NOT EXISTS dangdang_item (
id INT UNSIGNED AUTO_INCREMENT,
title CHAR(100) NOT NULL,
price CHAR(100) NOT NULL,
author CHAR(100) NOT NULL,
star CHAR(10) NOT NULL,
detail VARCHAR(1000),
PRIMARY KEY (id)
)ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

 


 

爬虫文件编写 

内容分析完成之后我们到了最关键的爬虫文件编写部分,首先我们要测试下该网站有没有反爬措施。

这一步我们只需要简单的将spiders文件夹中的dangdang_spider.py文件进行简单的修改让其输出目标站点的响应内容即可

dangdang_spider.py

# -*- coding: utf-8 -*-
import scrapy


class DangdangSpiderSpider(scrapy.Spider):
    name = 'dangdang_spider'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://search.dangdang.com/?key=python&category_path=01.00.00.00.00.00&page_index=1']

    def parse(self, response):
        print(response.text)
        pass

为了方便我们进行调试,我们在项目下创建一个main.py文件用于启动爬虫,不然我们每次启动都需要在命令行中使用scrapy命令。

main.py

from scrapy import cmdline
cmdline.execute('scrapy crawl dangdang_spider'.split())

然后直接运行main.py文件,发现输出了目标网站的html源代码,所以目标网站并没有反爬措施,我们可以直接拿取内容,接下来就开始拿取内容了。

五部分内容使用xpath拿取,网页结构很简单,直接从源码分析xpath即可。

开始实际编写爬虫文件dangdang_spider.py

# -*- coding: utf-8 -*-
import scrapy
import re
from cnblog_dangdang.items import CnblogDangdangItem

str_re = re.compile('\d+')

class DangdangSpiderSpider(scrapy.Spider):
    name = 'dangdang_spider'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://search.dangdang.com/?key=python&category_path=01.00.00.00.00.00&page_index=1']


    def parse(self, response):
        book_item = CnblogDangdangItem()
        items = response.xpath("//ul[@class='bigimg']/li")#不用加get 因为此步骤为了拿到一个xpath对象
        for item in items:
            book_item['title'] = item.xpath("./a/@title").get()
            book_item['price'] = item.xpath("./p[@class='price']").xpath("string(.)").get()#使用string(.)方法为了拿取目标节点下的所有子节点文本
            book_item['author'] = item.xpath("./p[@class='search_book_author']").xpath("string(.)").get()
            book_item['star'] = int(str_re.findall(item.xpath("./p[@class='search_star_line']/span/span/@style").get())[0])/20
            book_item['detail'] = item.xpath("./p[@class='detail']//text()").get()
            print(book_item)
            yield book_item

        next_url_end = response.xpath("//li[@class='next']/a/@href").get()
        #如果拿到了下一页链接,则访问
        if next_url_end:
            next_url ='http://search.dangdang.com/'+ next_url_end
            yield scrapy.Request(next_url,callback=self.parse)

再次运行爬虫,发现现在已经可以输出拿取到的信息

 

 

 说明我们的爬虫文件编写成功,接下来就是对我们拿取到的数据进行处理。


 

数据的存储

此次我们选择使用mysql进行数据的存储,那么我们首先要干什么呢?是直接编写pipeline.py文件吗?并不是,我们还有一个很重要的地方没有弄,就是settings.py文件。

我们想要通过pipeline.py文件来处理爬取到的数据,首先就需要去settings.py中开启我们的pipeline选项,很简单只需要在settings.py中将ITEM_PIPELINES的注释消掉即可如下图

 

接下来就可以编写pipeline.py文件来对我们的数据进行操作了

pipeline.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql

number = 0
class DangdangPipeline(object):

    # open_spider()爬虫开启时执行一次
    def open_spider(self,spider):
        # 连接数据库
        print("连接数据库,准备写入数据")
        self.db = pymysql.connect('localhost', '你的mysql账户', '你的mysql密码', '你的数据库名称')
        self.cursor = self.db.cursor()

    def process_item(self, item, spider):
        global number
        number = number+1
        print('当前写入第'+str(number)+'个商品数据')
        #使用replace是为了避免数据中存在引号与sql语句冲突
        title=str(item['title']).replace("'","\\'").replace('"','\\"')
        price=str(item['price']).replace("'","\\'").replace('"','\\"')
        author=str(item['author']).replace("'","\\'").replace('"','\\"')
        star=str(item['star']).replace("'","\\'").replace('"','\\"')
        detail=str(item['detail']).replace("'","\\'").replace('"','\\"')
        sql = f'INSERT INTO dangdang_item (title,price,author,star,detail) VALUES (\'{title}\',\'{price}\',\'{author}\',\'{star}\',\'{detail}\');'
        #执行sql语句
        self.cursor.execute(sql)
        #数据库提交修改
        self.db.commit()
        return item

    # close_spider()爬虫关闭后执行
    def close_spider(self,spider):
        print('写入完成,一共'+str(number)+'个数据')
        # 关闭连接
        self.cursor.close()
        self.db.close()

接下来再次运行main.py文件,看看爬虫效果。

 

 

 我们去数据库中看一下我们刚刚爬取的数据

 

ok,大功完成了,我们的当当网scrapy爬虫就编写好了。

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/CYHISTW/p/12377124.html