Reptile practice - Data crawling Douban books of TOP250

Foreword

Crawling Douban books TOP250 data, title, link, author, publisher, publication date, price, ratings, reviews, and stores the data in a CSV file

This article is finishing the code, carding ideas, verify the validity of the code --2019.12.15


Environment:
A Python 3 (Anaconda3)
PyCharm
Chrome browser

Main modules:
lxml
Requests
CSV

1.

Crawling watercress following books Home
Here Insert Picture Description

2.

URL law analysis

https://book.douban.com/top250?  # 首页
https://book.douban.com/top250? start=25  # 第二页
https://book.douban.com/top250? start=50  # 第三页
https://book.douban.com/top250? start=75  # 第四页
...

Home can be found in the URL and other URL format is not the same, but can be found by testing the URL https://book.douban.com/top250?start=0to access the home page
we list analytic formula to construct the corresponding URL list

urls = ['https://book.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]

3.

Crawling title, link, author, publisher, publication date, price, ratings, reviews and other data
Here Insert Picture Description
analysis source code is parsed
Here Insert Picture Description
using the Xpath their resolve

# 所有信息均在tr class="item"中,先将该模块提取出来方便进一步解析
infos = selector.xpath('//tr[@class="item"]')

for info in infos:
     name = info.xpath('td/div/a/@title')[0]  # 书名
     url = info.xpath('td/div/a/@href')[0]  # 链接
     book_infos = info.xpath('td/p/text()')[0]   
     author = book_infos.split('/')[0]  # 作者
     publisher = book_infos.split('/')[-3]  # 出版社
     date = book_infos.split('/')[-2]  # 出版时间
     price = book_infos.split('/')[-1]  # 价格
     rate = info.xpath('td/div/span[2]/text()')[0]  # 评分
     comments = info.xpath('td/p/span/text()')  # 评语
     comment = comments[0] if len(comments) != 0 else "空"

3.

The data is stored in the CSV file
stored procedure is relatively simple, "the elephant put into the refrigerator" three steps

  1. "Open the refrigerator"
# 创建csv
fp = open('doubanbook.csv', 'wt', newline='', encoding='utf-8')
  1. "The elephant loaded into"
# 写入数据
writer.writerow((name, url, author, publisher, date, price, rate,comment))
  1. "Close the refrigerator."
# 关闭csv文件
fp.close()

So far, data crawling Douban books on the end of the TOP250


A. The complete code

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 导入相应的库文件
from lxml import etree
import requests
import csv

# 创建csv
fp = open('doubanbook.csv', 'wt', newline='', encoding='utf-8')

# 写入header
writer = csv.writer(fp)
writer.writerow(('name', 'url',  'author', 'publisher', 'date', 'price', 'rate', 'comment'))

# 构造urls
urls = ['https://book.douban.com/top250? start={}'.format(str(i)) for i in range(0,250,25)]

# 加入请求头
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36'
                 '(KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}

for url in urls:
    html = requests.get(url, headers=headers)
    selector = etree.HTML(html.text)
    # 取大标签,以此循环
    infos = selector.xpath('//tr[@class="item"]')

    for info in infos:
        name = info.xpath('td/div/a/@title')[0]  # 书名
        url = info.xpath('td/div/a/@href')[0]  # 链接
        book_infos = info.xpath('td/p/text()')[0]   
        author = book_infos.split('/')[0]  # 作者
        publisher = book_infos.split('/')[-3]  # 出版社
        date = book_infos.split('/')[-2]  # 出版时间
        price = book_infos.split('/')[-1]  # 价格
        rate = info.xpath('td/div/span[2]/text()')[0]  # 评分
        comments = info.xpath('td/p/span/text()')  # 评语
        comment = comments[0] if len(comments) != 0 else "空"
        # 写入数据
        writer.writerow((name, url, author, publisher, date, price, rate,comment))

# 关闭csv文件
fp.close()

Published 56 original articles · won praise 70 · views 8903

Guess you like

Origin blog.csdn.net/weixin_44835732/article/details/103546841