But also in terms of reptile friends

Just to have a job today is to do reptiles, then followed yesterday's momentum continues to talk about reptiles

task:

Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Basically task is crawling information watercress TOP250 books, written documents and pictures to download

Reptile analysis

Here Insert Picture Description

  • Can be seen on every page you can get all the information we need, we do not have to go crawling the details page.
  • Then the problem is flip each page url: https: //book.douban.com/top250 start = (Page 1) * 25?
  • Csv file is written to use like
  • Resolve to use xpath, I think xpath and re use them a little more familiar, of course, css selectors as well, so the master thirty-four kinds of ways to resolve basic enough.

Code section:

import requests
from lxml import etree
import csv
import os

#请求网页并解析得到我们想要的内容
def get_informations(url):
    res=requests.get(url,headers=headers)
    selector=etree.HTML(res.text)
    infos=selector.xpath('//tr[@class="item"]')
    for info in infos:
        image=info.xpath("td/a[@class='nbg']/img/@src")[0]#图片网址
        pic_list.append(image)
        name=info.xpath('td/div/a/@title')[0]#标题
        names.append(name)
        book_infos=info.xpath('td/p/text()')[0]
        author=book_infos.split('/')[0]#作者
        publisher=book_infos.split('/')[-3]#出版社
        date=book_infos.split('/')[-2]#出版日期
        price=book_infos.split('/')[-1]#价格
        num=info.xpath('td/div/span[3]/text()')[0]#评价人数
        rate=info.xpath('td/div/span[2]/text()')[0]#评分
        coments=info.xpath('td/p/span/text()')#简介
        coment=coments[0] if len(coments)!=0 else "空"
        writer.writerow((author,date,price,coment,num,rate,name))

#下载图片
def get_image():
    savePath = './豆瓣图书250图片'
    #创建文件夹保存图片
    if not os.path.exists(savePath):
        os.makedirs(savePath)
    #遍历图片网址,保存图片
    for i in range(len(pic_list)):
        html = requests.get(pic_list[i], headers=headers)
        if html.status_code == 200:
            with open(savePath + "/%s.jpg" % names[i], "wb") as f:
                f.write(html.content)
        elif html.status_code == 404:
            continue

#主函数,调用上面的函数
def main():
    for url in urls:
        get_informations(url)
    get_image()


if __name__ == '__main__':
	#请求头
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'}
        
    urls = ['https://book.douban.com/top250?start={}'.format(str(i)) for i in range(0, 226, 25)]#爬取的网址
    # 新建并打开csv文件,写入表头
    fp = open(r"./豆瓣图书.csv", 'wt', newline="", encoding="utf-8")
    writer = csv.writer(fp)
    writer.writerow(('author', 'press_time', 'price', 'produce', 'rating_num', 'rating_score', 'title'))
    #保存图片网址和图片名,方便保存
    pic_list = []
    names = []
    main()
    fp.close()
    print("文件和图片都爬取完毕!")

operation result

Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Required to complete it, write it down about ten minutes long, are interested can try their own.

Published 85 original articles · won praise 55 · views 20000 +

Guess you like

Origin blog.csdn.net/shelgi/article/details/103639433