爬虫入门五:练习爬取汽车之家新闻阅读量信息

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/qq_18505209/article/details/99883942

爬虫入门五(练习)

入门练习

分享一个pyecharts学习网址:
Python:数据可视化pyecharts的使用.

爬取汽车之家新闻阅读量信息

python
import requests
from bs4 import BeautifulSoup
from pyecharts import Page, Pie, Bar
url = "https://www.autohome.com.cn/news/"
response = requests.get(url)
#print(response.content.decode('utf-8'))
#gbk2313解码
soup = BeautifulSoup(response.content.decode('gb2312'), 'lxml')
#print(soup)
all_news = soup.find('div', id = "auto-channel-lazyload-article")
#print(all_news)
#用于存储新闻信息
all_news_info = []
for each_news in all_news.find_all('a'):
    news = each_news.find('h3').text
    bandc = each_news.find_all('em')
    time = each_news.find('span', class_ = "fn-left")
    #上万阅读量转换
    if '万' not in bandc[0].text:
        browse = bandc[0].text
    else:
        browse = int(float(bandc[0].text.replace('万', ''))*10000)
    comment = bandc[1].text
    time = time.text
    #print(news,browse,comment,time)
    #一天前的不要
    if '天' not in time:
        all_news_info.append({'name': news, 'browse': browse, 'time': time})
#阅读量排序
sort_by_browse = sorted(all_news_info, key = lambda x: int(x['browse']))
#print(sort_by_browse)
#提取前十新闻
ten_news = []
ten_news = sort_by_browse[len(sort_by_browse)-10: len(sort_by_browse)]
#print(ten_news)
names = [i['name'] for i in ten_news]
browse = [i['browse'] for i in ten_news]
#准备展示数据
browse_rank = Bar('24小时内新闻的阅读量TOP10')
browse_rank.add('阅读量', names, browse, is_convert=True, is_label_show=True, label_pos='right')
#browse_rank
#存储到browse_rank.html中
browse_rank.render()
效果展示:

图一
图二

猜你喜欢

转载自blog.csdn.net/qq_18505209/article/details/99883942