This article describes the framework Scrapy crawling coupons book information, and writes the result mysql database.
Article Directory
You need to install the package:
- pip install wheel
- pip install twisted
- pip install lxml
- pip install scrapy
Anaconda Prompt recommended installation.
1 Scrapy framework
1.1 Introduction
Scrapy is a website for crawling data, extract structured data written application framework. Applications can include program data mining, processed, or stored in a series of historical data.
It was originally intended to crawl the page (more precisely, the web crawler) designed, it can also be used in obtaining the data returned by the API (such as Amazon Associates Web Services) or general-purpose web crawler.
Common commands 1.2
scrapy startproject tutorial
: Create a crawler project;scrapy genspider -l
: View reptile template;scrapy genspider -t basic example example.com
(Example): templatebasic
, reptiles file nameexample
, the domain nameexample.com
;scrapy crawl
: Run reptiles;scrapy list
: List all available spider current project. Each line of output a spider.
2. Basic use
1. Open a command line, enter the following command to create scrapy project
scrapy startproject FirstProject
2. Project directory structure
3. Go to the project, see the reptiles template:
>cd FirstProject
>scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
4. Create reptiles, note that the domain name does not include the host name ; (such as www.)
For example, suppose crawling video CCTV5 world football's site at: http://tv.cctv.com/lm/txzq/videoset/ , generate You can also modify later in the program, wrong does not matter, just modify the program when there is a note field is the domain name.
>scrapy genspider -t basic first cctv.com/lm/txzq/videoset/
Created spider 'first' using template 'basic' in module:
FirstProject.spiders.first
Create a good document:
5. Run reptiles
>scrapy crawl first
6. Review the reptile currently available
>scrapy list
7. View scrapy instruction
>scrapy
3. Scrapy combat: crawling product data Dangdang
Write a Scrapy reptiles project process:
- Create a crawler project;
- Write items;
- Create a crawler files;
- Writing reptile file;
- Write pipelines;
- Configuration settings;
Dangdang computer books like Home: http://category.dangdang.com/cp01.54.26.00.00.00.html
3.1 Book Name field analysis
Opening the page, right-click the blank space, view the page source code, find the following fields, through analysis and search results, may determine that the field contains the book title information. Comments can be positioned by the number of commodities interface, and then determine the field contains the name of the product.
3.2 Review of Books digital segment analysis
After analyzing books Name field, analyze comments digital segment, through the search, you can determine the following field is a field containing a number of product reviews, we can use this field to build a regular expression.
3.3 product link field analysis
After analysis we can know, link href attribute field contains the product name field, so you can use this to build a regular expression.
URL changes of 3.4 different pages of analysis
Dangdang results per page 60, if you need to turn the page processing is necessary to analyze the variation URL different page numbers. We first Web links page 1-5 pasted into word document, to facilitate comparison, URL of the page shown in the following figure different:
you can see only pg
field has changed, but the first page does not contain this field, we can this guess, the first page URL is: http://category.dangdang.com/pg1-cp01.54.26.00.00.00.html , if you open the image above to open with the URL on the first page of the interface is the same, you can prove our conjecture is correct. Proven, indeed, we could then build the page changes, changes of the URL.
4. The complete code
4.1 Creating crawler project
scrapy startproject dangdang
cd dangdang
scrapy genspider dd_books dangdang.com
4.2 modify items.py file
import scrapy
class DangdangItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
link = scrapy.Field()
comment = scrapy.Field()
4.3 modify / create reptile file dd_books.py
import scrapy
from dangdang.items import DangdangItem
from scrapy.http import Request # 实现翻页
class DdBooksSpider(scrapy.Spider):
name = 'dd_books'
allowed_domains = ['dangdang.com']
start_urls = ['http://category.dangdang.com/pg1-cp01.54.26.00.00.00.html']
def parse(self, response):
item = DangdangItem()
item["title"] = response.xpath("//a[@name='itemlist-title']/@title").extract()
item["link"] = response.xpath("//a[@name='itemlist-title']/@href").extract()
item["comment"] = response.xpath("//a[@name='itemlist-review']/text()").extract()
yield item
for i in range(2, 3):
url = "http://category.dangdang.com/pg"+str(i)+"-cp01.54.26.00.00.00.html"
yield Request(url, callback=self.parse)
4.4 file written pipeline.py
This section implements a write operation PyMySQL the database, if you do not need to configure the mysql configuration, if you do not want to write, this part of the comment out just fine.
import pymysql
class DangdangPipeline(object):
def process_item(self, item, spider):
conn = pymysql.connect("localhost", "root", "mysql105", "ddbooks", charset='utf8')
for i in range(0, len(item["title"])):
title = item["title"][i]
link = item["link"][i]
comment = item["comment"][i]
sql = "insert into dangdang(title, link, comment) values('"+title+"', '"+link+"', '"+comment+"')"
conn.query(sql)
conn.commit()
conn.close()
return item
4.5 modify setting.py
4.6 command line to start the crawler
(keras)D:\Project\05 Python\02 爬虫\05 Scrapy\dangdang>scrapy crawl dd_books
Output:
Database results:
Note: The modification pymysql module connections, to prevent distortion, charset set utf8
, not utf-8!
Reference:
Scrapy official document: https://doc.scrapy.org/en/latest/intro/tutorial.html
Scrapy Chinese document: https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/overview.html
Related courses: https://edu.aliyun.com/course/1994