Python爬虫——豆瓣读书

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/csdnlinyongsheng/article/details/85045133

准备

豆瓣读书网址是:https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=880&type=T

红色箭头标记的就是我们要获取的信息,我们有了目标信息,就能找到页面的源码,我们通过解析源码来获取信息数据,我们如何获取源码呢?这时可引入request来解决,实现代码如下:

import requests
 
resp = requests.get('https://book.douban.com/top250?start=0')
print(resp.text)

运行程序我们能就能得到HTML信息,问题来了,获取了HTML信息,怎样获取我们想要的目标信息呢?

打开浏览器,按键盘F12,从页面源码找到我们想要的目标信息,如图所示:

可以看到书名信息包含在class='info' h2标签里的a标签。发现目标位置后,我们可以利用BeautifulSoup来获得一个对象,按找标准的缩进显示的html代码:

#python环境中如果没有ba4和lxml,要先安装 pip install bs4 and pip install lxml
from bs4 import BeautifulSoup
 
soup = BeautifulSoup(resp.text, 'lxml')


推荐大家使用lxml解析器,因为他快。当然,如果大家怕麻烦,也完全可以使用Python的内置标准库html.parser.对我们获得结果并没有影响。

爬虫——豆瓣读书代码如下:

import requests;
from bs4 import BeautifulSoup;

def get_html(url):
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
    resp = requests.get(url, headers = headers).text;
    return resp;
def html_parser():
    for url in all_page():
        soup = BeautifulSoup(get_html(url), 'lxml');
        # 书名
        allDiv = soup.find_all('div', class_= 'info')
        names = [a.find("a")["title"] for a in allDiv];

        #作者
        versions = [];
        pubs = soup.find_all(class_="pub");
        versions = [i.get_text().strip() for i in pubs];

        #评分
        ratingNums = soup.find_all(class_="rating_nums");
        ratings = [i.get_text().strip() for i in ratingNums];

        #简介
        allDiv2 = soup.select('.info p');
        jianjie = [i.get_text().strip() for i in allDiv2];
        # jianjie = [a.find("p").get_text().strip() for a in allDiv2];
        for name, version, rating, p in zip(names, versions, ratings, jianjie):
            name = "书名:" + str(name) + "\n";
            version = "作者:" + str(version) + "\n";
            rating = "评分:" + str(rating) + "\n";
            p = "简介:" + str(p) + "\n";
            data = name + version + rating + p;
            f.writelines(data + "==================" + "\n");
def all_page():
    base_url = "https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=";
    urlList = [];
    for page in range(0, 900, 20):
        allurl = base_url+ str(page);
        urlList.append(allurl);
    return urlList;
filename = "豆瓣读书.txt";
f = open(filename, 'w', encoding = "utf-8");
html_parser();
f.close();
print("保存成功 ");
扫描二维码关注公众号,回复: 5761849 查看本文章

猜你喜欢

转载自blog.csdn.net/csdnlinyongsheng/article/details/85045133