Python Reptile combat Scrapy crawl and product information into the database

This article describes the framework Scrapy crawling coupons book information, and writes the result mysql database.



You need to install the package:

  • pip install wheel
  • pip install twisted
  • pip install lxml
  • pip install scrapy

Anaconda Prompt recommended installation.


1 Scrapy framework

1.1 Introduction

Scrapy is a website for crawling data, extract structured data written application framework. Applications can include program data mining, processed, or stored in a series of historical data.

It was originally intended to crawl the page (more precisely, the web crawler) designed, it can also be used in obtaining the data returned by the API (such as Amazon Associates Web Services) or general-purpose web crawler.


Common commands 1.2

  • scrapy startproject tutorial : Create a crawler project;
  • scrapy genspider -l : View reptile template;
  • scrapy genspider -t basic example example.com(Example): template basic, reptiles file name example, the domain name example.com;
  • scrapy crawl : Run reptiles;
  • scrapy list: List all available spider current project. Each line of output a spider.

2. Basic use

1. Open a command line, enter the following command to create scrapy project

scrapy startproject FirstProject

2. Project directory structure
Here Insert Picture Description
3. Go to the project, see the reptiles template:

>cd FirstProject
>scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

4. Create reptiles, note that the domain name does not include the host name ; (such as www.)
For example, suppose crawling video CCTV5 world football's site at: http://tv.cctv.com/lm/txzq/videoset/ , generate You can also modify later in the program, wrong does not matter, just modify the program when there is a note field is the domain name.

>scrapy genspider -t basic first cctv.com/lm/txzq/videoset/
Created spider 'first' using template 'basic' in module:
  FirstProject.spiders.first

Create a good document:
Here Insert Picture Description
Here Insert Picture Description
5. Run reptiles

>scrapy crawl first

6. Review the reptile currently available

>scrapy list

7. View scrapy instruction

>scrapy

3. Scrapy combat: crawling product data Dangdang

Write a Scrapy reptiles project process:

  • Create a crawler project;
  • Write items;
  • Create a crawler files;
  • Writing reptile file;
  • Write pipelines;
  • Configuration settings;

Dangdang computer books like Home: http://category.dangdang.com/cp01.54.26.00.00.00.html

3.1 Book Name field analysis

Opening the page, right-click the blank space, view the page source code, find the following fields, through analysis and search results, may determine that the field contains the book title information. Comments can be positioned by the number of commodities interface, and then determine the field contains the name of the product.
Here Insert Picture Description

3.2 Review of Books digital segment analysis

After analyzing books Name field, analyze comments digital segment, through the search, you can determine the following field is a field containing a number of product reviews, we can use this field to build a regular expression.
Here Insert Picture Description

3.3 product link field analysis

After analysis we can know, link href attribute field contains the product name field, so you can use this to build a regular expression.
Here Insert Picture Description

URL changes of 3.4 different pages of analysis

Dangdang results per page 60, if you need to turn the page processing is necessary to analyze the variation URL different page numbers. We first Web links page 1-5 pasted into word document, to facilitate comparison, URL of the page shown in the following figure different:
Here Insert Picture Description
you can see only pgfield has changed, but the first page does not contain this field, we can this guess, the first page URL is: http://category.dangdang.com/pg1-cp01.54.26.00.00.00.html , if you open the image above to open with the URL on the first page of the interface is the same, you can prove our conjecture is correct. Proven, indeed, we could then build the page changes, changes of the URL.


4. The complete code

4.1 Creating crawler project

scrapy startproject dangdang

cd dangdang

scrapy genspider dd_books dangdang.com

4.2 modify items.py file

import scrapy


class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    link = scrapy.Field()
    comment = scrapy.Field()

4.3 modify / create reptile file dd_books.py

import scrapy
from dangdang.items import DangdangItem
from scrapy.http import Request # 实现翻页

class DdBooksSpider(scrapy.Spider):
    name = 'dd_books'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://category.dangdang.com/pg1-cp01.54.26.00.00.00.html']

    def parse(self, response):
        item = DangdangItem()
        item["title"] = response.xpath("//a[@name='itemlist-title']/@title").extract()
        item["link"] = response.xpath("//a[@name='itemlist-title']/@href").extract()
        item["comment"] = response.xpath("//a[@name='itemlist-review']/text()").extract()
        yield item

        for i in range(2, 3):
            url = "http://category.dangdang.com/pg"+str(i)+"-cp01.54.26.00.00.00.html"
            yield Request(url, callback=self.parse)

4.4 file written pipeline.py

This section implements a write operation PyMySQL the database, if you do not need to configure the mysql configuration, if you do not want to write, this part of the comment out just fine.

import pymysql

class DangdangPipeline(object):
    def process_item(self, item, spider):
        conn = pymysql.connect("localhost", "root", "mysql105", "ddbooks", charset='utf8')
        
        for i in range(0, len(item["title"])):
            title = item["title"][i]
            link = item["link"][i]
            comment = item["comment"][i]
            sql = "insert into dangdang(title, link, comment) values('"+title+"', '"+link+"', '"+comment+"')"
            conn.query(sql)
            conn.commit()
        conn.close()

        return item

4.5 modify setting.py

Here Insert Picture Description

4.6 command line to start the crawler

(keras)D:\Project\05 Python\02 爬虫\05 Scrapy\dangdang>scrapy crawl dd_books

Output:

Here Insert Picture Description
Here Insert Picture Description
Database results:
Here Insert Picture Description


Note: The modification pymysql module connections, to prevent distortion, charset set utf8, not utf-8!
Here Insert Picture Description


Reference:
Scrapy official document: https://doc.scrapy.org/en/latest/intro/tutorial.html
Scrapy Chinese document: https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/overview.html
Related courses: https://edu.aliyun.com/course/1994


Published 147 original articles · won praise 606 · views 40000 +

Guess you like

Origin blog.csdn.net/weixin_39653948/article/details/105323021