python爬虫 — 爬取豆瓣最受关注图书榜

一个简单的爬取豆瓣最受关注图书榜的小爬虫,在爬取相关信息后,将结果保存在 mongo 中

整个流程分为以下几步:

(1)构造url

(2)分析网页

(3)编写程序,提取信息

解析,将分别介绍以上几步

一 构造url

首先打开网页,可以看到下面的图片


从图片中,可以看到其分为虚构类作品榜和非虚构类作品榜两个榜单,分别点击这两个榜单,可以看到其下面的变化

非虚构类作品榜:


虚构类作品榜:


因此,根据这两个变化,可以构造相应的榜单的 url。

接下来分析,源代码页面

二 分析网页

分析源代码页面,可以看到,每一本书的信息,都是包含在

 <li class="media clearfix">
标签中的,如图


因此,便可以提取对应的信息了

三 编写程序,提取信息
经过上面的分析,便可以编写程序,提取信息了。源码如下:
import requests
from pyquery import PyQuery as pq 
import pymongo



client = pymongo.MongoClient('mongodb://localhost:27017/')
db = client['douban']
collection = db['douban']

def html_get(url_start, key_list):

    for keyword in key_list:       #构造url
        url = url_start + keyword

        headers={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',\
                'Accept-Encoding':'gzip, deflate, sdch, br',\
                'Accept-Language':'zh-CN,zh;q=0.8',\
                'Cache-Control':'max-age=0',\
                'Connection':'keep-alive',\
                'Cookie':'ll="129059"; bid=LGKvV4eiPRY; gr_user_id=5e12d333-3fef-4dec-9a13-492827d22a9f; __yadk_uid=mdRyurjdvN9YpaTediph1fGvXm5kFwDv; _pk_ref.100001.3ac3=%5B%22%22%2C%22%22%2C1530035830%2C%22http%3A%2F%2Fwww.so.com%2Flink%3Fm%3DatcFavfrIqqbjFSuBtImGs6LQuG1bzV%252BwXIUv9RvMyu76xYAgQ1zKuR6VXzhExd6NGqKRNCgBJUSrQwjAFlBd9UixoDzpOIwO%22%5D; Hm_lvt_16a14f3002af32bf3a75dfe352478639=1530035976; Hm_lpvt_16a14f3002af32bf3a75dfe352478639=1530035976; viewed="30155720_26830570_30230525"; gr_session_id_22c937bbd8ebd703f2d8e9445f7dfd03=62136c96-12c5-410f-975e-954ef37fe69b_true; _vwo_uuid_v2=DB79F3914C55D5BDCBACD8F77796784E0|a988dff47123f572c5efd29563436c85; _pk_id.100001.3ac3=bca92b247691e53a.1529798027.2.1530037201.1529798070.; _pk_ses.100001.3ac3=*; __utma=30149280.317649996.1529798022.1529798022.1530035826.2; __utmb=30149280.29.10.1530035826; __utmc=30149280; __utmz=30149280.1530035826.2.2.utmcsr=so.com|utmccn=(referral)|utmcmd=referral|utmcct=/link; __utma=81379588.1306981855.1529798027.1529798027.1530035830.2; __utmb=81379588.29.10.1530035830; __utmc=81379588; __utmz=81379588.1530035830.2.2.utmcsr=so.com|utmccn=(referral)|utmcmd=referral|utmcct=/link; ap=1',\
                'Host':'book.douban.com',\
                'Referer':'https://book.douban.com/chart?subcat=F',\
                'Upgrade-Insecure-Requests':'1',\
                'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',}
        try:                                                  
            r = requests.get(url, headers=headers)  #请求网页
            html = r.text
        except:
            print('error')
        else:
            html_parse(html)         #调用提取信息的函数

def html_parse(html):
    item = {}
    doc = pq(html)
    for media in doc.find('.media').items():   #提取信息
        item = {
            'book_name': media.find('h2').text(),
            'abstract': media.find('.subject-abstract').text(),
            'score': media.find('.font-small').text(),
            'score_people': media.find('.ml8').text().replace('(','').replace(')',''),
        }
        save_to_mongo(item)         #调用存储函数

def save_to_mongo(item):
    collection.insert(item)         #存储信息

def main():
    url_start = 'https://book.douban.com/chart?subcat='
    kwy_list = ['I','F']           #关键词
    html_get(url_start, key_list)
    client.close()

if __name__ == '__main__':
    main()

猜你喜欢

转载自blog.csdn.net/qq_34246164/article/details/80823190