Foreword
Crawling Douban books TOP250 data, title, link, author, publisher, publication date, price, ratings, reviews, and stores the data in a CSV file
This article is finishing the code, carding ideas, verify the validity of the code --2019.12.15
Environment:
A Python 3 (Anaconda3)
PyCharm
Chrome browser
Main modules:
lxml
Requests
CSV
1.
Crawling watercress following books Home
2.
URL law analysis
https://book.douban.com/top250? # 首页
https://book.douban.com/top250? start=25 # 第二页
https://book.douban.com/top250? start=50 # 第三页
https://book.douban.com/top250? start=75 # 第四页
...
Home can be found in the URL and other URL format is not the same, but can be found by testing the URL https://book.douban.com/top250?start=0
to access the home page
we list analytic formula to construct the corresponding URL list
urls = ['https://book.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]
3.
Crawling title, link, author, publisher, publication date, price, ratings, reviews and other data
analysis source code is parsed
using the Xpath their resolve
# 所有信息均在tr class="item"中,先将该模块提取出来方便进一步解析
infos = selector.xpath('//tr[@class="item"]')
for info in infos:
name = info.xpath('td/div/a/@title')[0] # 书名
url = info.xpath('td/div/a/@href')[0] # 链接
book_infos = info.xpath('td/p/text()')[0]
author = book_infos.split('/')[0] # 作者
publisher = book_infos.split('/')[-3] # 出版社
date = book_infos.split('/')[-2] # 出版时间
price = book_infos.split('/')[-1] # 价格
rate = info.xpath('td/div/span[2]/text()')[0] # 评分
comments = info.xpath('td/p/span/text()') # 评语
comment = comments[0] if len(comments) != 0 else "空"
3.
The data is stored in the CSV file
stored procedure is relatively simple, "the elephant put into the refrigerator" three steps
- "Open the refrigerator"
# 创建csv
fp = open('doubanbook.csv', 'wt', newline='', encoding='utf-8')
- "The elephant loaded into"
# 写入数据
writer.writerow((name, url, author, publisher, date, price, rate,comment))
- "Close the refrigerator."
# 关闭csv文件
fp.close()
So far, data crawling Douban books on the end of the TOP250
A. The complete code
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 导入相应的库文件
from lxml import etree
import requests
import csv
# 创建csv
fp = open('doubanbook.csv', 'wt', newline='', encoding='utf-8')
# 写入header
writer = csv.writer(fp)
writer.writerow(('name', 'url', 'author', 'publisher', 'date', 'price', 'rate', 'comment'))
# 构造urls
urls = ['https://book.douban.com/top250? start={}'.format(str(i)) for i in range(0,250,25)]
# 加入请求头
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36'
'(KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
for url in urls:
html = requests.get(url, headers=headers)
selector = etree.HTML(html.text)
# 取大标签,以此循环
infos = selector.xpath('//tr[@class="item"]')
for info in infos:
name = info.xpath('td/div/a/@title')[0] # 书名
url = info.xpath('td/div/a/@href')[0] # 链接
book_infos = info.xpath('td/p/text()')[0]
author = book_infos.split('/')[0] # 作者
publisher = book_infos.split('/')[-3] # 出版社
date = book_infos.split('/')[-2] # 出版时间
price = book_infos.split('/')[-1] # 价格
rate = info.xpath('td/div/span[2]/text()')[0] # 评分
comments = info.xpath('td/p/span/text()') # 评语
comment = comments[0] if len(comments) != 0 else "空"
# 写入数据
writer.writerow((name, url, author, publisher, date, price, rate,comment))
# 关闭csv文件
fp.close()