Just to have a job today is to do reptiles, then followed yesterday's momentum continues to talk about reptiles
task:
Basically task is crawling information watercress TOP250 books, written documents and pictures to download
Reptile analysis
- Can be seen on every page you can get all the information we need, we do not have to go crawling the details page.
- Then the problem is flip each page url: https: //book.douban.com/top250 start = (Page 1) * 25?
- Csv file is written to use like
- Resolve to use xpath, I think xpath and re use them a little more familiar, of course, css selectors as well, so the master thirty-four kinds of ways to resolve basic enough.
Code section:
import requests
from lxml import etree
import csv
import os
#请求网页并解析得到我们想要的内容
def get_informations(url):
res=requests.get(url,headers=headers)
selector=etree.HTML(res.text)
infos=selector.xpath('//tr[@class="item"]')
for info in infos:
image=info.xpath("td/a[@class='nbg']/img/@src")[0]#图片网址
pic_list.append(image)
name=info.xpath('td/div/a/@title')[0]#标题
names.append(name)
book_infos=info.xpath('td/p/text()')[0]
author=book_infos.split('/')[0]#作者
publisher=book_infos.split('/')[-3]#出版社
date=book_infos.split('/')[-2]#出版日期
price=book_infos.split('/')[-1]#价格
num=info.xpath('td/div/span[3]/text()')[0]#评价人数
rate=info.xpath('td/div/span[2]/text()')[0]#评分
coments=info.xpath('td/p/span/text()')#简介
coment=coments[0] if len(coments)!=0 else "空"
writer.writerow((author,date,price,coment,num,rate,name))
#下载图片
def get_image():
savePath = './豆瓣图书250图片'
#创建文件夹保存图片
if not os.path.exists(savePath):
os.makedirs(savePath)
#遍历图片网址,保存图片
for i in range(len(pic_list)):
html = requests.get(pic_list[i], headers=headers)
if html.status_code == 200:
with open(savePath + "/%s.jpg" % names[i], "wb") as f:
f.write(html.content)
elif html.status_code == 404:
continue
#主函数,调用上面的函数
def main():
for url in urls:
get_informations(url)
get_image()
if __name__ == '__main__':
#请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'}
urls = ['https://book.douban.com/top250?start={}'.format(str(i)) for i in range(0, 226, 25)]#爬取的网址
# 新建并打开csv文件,写入表头
fp = open(r"./豆瓣图书.csv", 'wt', newline="", encoding="utf-8")
writer = csv.writer(fp)
writer.writerow(('author', 'press_time', 'price', 'produce', 'rating_num', 'rating_score', 'title'))
#保存图片网址和图片名,方便保存
pic_list = []
names = []
main()
fp.close()
print("文件和图片都爬取完毕!")
operation result
Required to complete it, write it down about ten minutes long, are interested can try their own.