【Day5】项目实战.CSDN热门文章爬取 - 代码天地

【Day5】项目实战.CSDN热门文章爬取

其他 2019-11-20 15:34:07 阅读次数: 0

import urllib.request as ur
import lxml.etree as le
import user_agent

keyword = input('请输入关键词:')
pn_start = int(input('起始页:'))
pn_end = int(input('终止页:'))

def getRequest(url):
    return ur.Request(
        url=url,
        headers={
            'User-Agent':user_agent.get_user_agent_pc(),
        }
    )

def getProxyOpener():
    proxy_address = ur.urlopen('http://api.ip.data5u.com/dynamic/get.html?order=d314e5e5e19b0dfd19762f98308114ba&sep=4').read().decode('utf-8').strip()
    proxy_handler = ur.ProxyHandler(
        {
            'http':proxy_address
        }
    )
    return ur.build_opener(proxy_handler)


for pn in range(pn_start,pn_end+1):
    request = getRequest(
        'https://so.csdn.net/so/search/s.do?p=%s&q=%s&t=blog&domain=&o=&s=&u=&l=&f=&rbg=0' % (pn,keyword)
    )
    try:
        response = getProxyOpener().open(request).read()
        href_s = le.HTML(response).xpath('//span[@class="down fr"]/../span[@class="link"]/a/@href')
        for href in href_s:
            try:
                response_blog = getProxyOpener().open(
                    getRequest(href)
                ).read()
                title = le.HTML(response_blog).xpath('//h1[@class="title-article"]/text()')[0]
                print(title)
                with open('blog/%s.html' % title,'wb') as f:
                    f.write(response_blog)
            except Exception as e:
                print(e)
    except:pass

猜你喜欢

转载自www.cnblogs.com/zsczsc/p/11897987.html

【Day5】项目实战.CSDN热门文章爬取

【项目实战】爬取csdn指定专栏的文章

[python爬虫之路day5]：实战之电影天堂2019精选电影爬取

CSDN文章爬取

爬取CSDN文章代码

Spring实战Day5

ssm项目day5

项目实战！爬取5万篇好奇心日报文章，适合小白练手的实战案例！

python爬虫学习淘宝页面定向爬取 DAY5

Day5《青春有你2》评论数据爬取与词云分析

使用phpquerylist爬取csdn文章

python爬取csdn的文章内容

python项目实战:爬取东方财富热门股票数据

想看最热门的文章？用Python爬取7日热门的文章

项目问题总结day5

项目Alpha冲刺 Day5

热情组——项目冲刺 Day5

urllib实战5--爬取CSDN首页博文（022）

Python知乎热门话题数据的爬取实战

抓取CSDN博客热门文章

爬取github上热门项目并绘制图表

Python爬虫项目--爬取链家热门城市新房

使用scrapy中crawlspider爬取csdn文章

将csdn的文章爬取，并将图片保存到本地

python爬取CSDN文章保存至本地

python简单爬虫实例，爬取CSDN文章

python爬取CSDN博客文章并制作成PDF文件

网络爬虫——项目实战（爬取糗事百科所有文章）

shell实战训练营Day5

一周MySQL集训day5：MySQL 实战

今日推荐

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

周排行

计算机组成与设计（七）—— 除法器

Integer Approximation(分治+枚举)

大话数据库索引

windows10系统JDK的配置及下载地址

mysql实现秒值转换中原六仔平台搭建

Codeforces Round #556 (Div. 1)

百练1064 网线主管

Codeforces 995F Cowmpany Cowmpensation

子集生成之增量构造法，位向量法，二进制法

ERROR: cmd.exe failed with args /c "/APK\gradle\rungradle.bat...

每日归档

更多

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)