爬虫_古诗文网（正则表达式） - 代码天地

爬虫_古诗文网（正则表达式）

其他 2018-08-06 12:06:41 阅读次数: 0

程序中请求到的和网页中内容不一样，但也是古诗，不是道是不是因为请求头的原因，使得网站推荐的古诗有差异

 1 import requests
 2 import re
 3 
 4 headers = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
 5 
 6 def get_html(url):
 7     try:
 8         response = requests.get(url, headers)
 9         response.raise_for_status()
10         response.encoding = response.apparent_encoding
11         return response.text
12     except:
13         print('get_html(url) faild')
14 
15 
16 def parse_html(html):
17     titles = re.findall(r'<div class="cont">.*?<b>(.*?)</b>', html, re.DOTALL)
18     dynasties = re.findall(r'<p class="source"><a.*?>(.*?)</a>', html, re.DOTALL)
19     authors = re.findall(r'<p class="source"><a.*?><a.*?>(.*?)</a>', html, re.DOTALL)
20     content_tags = re.findall(r'<div class="contson".*?>(.*?)</div>', html, re.DOTALL)
21     contents = []
22     for content in content_tags:
23         content = re.sub(r'<.*?>', '', content)
24         contents.append(content.strip())
25     poems = []    
26     for value in zip(titles, dynasties, authors, contents):
27         title, dynasties, authors, content = value
28         poem = {
29             'title': title,
30             'dynasties': dynasties,
31             'authors': authors,
32             'content': content
33         }
34         print(poem)
35         poems.append(poem)
36     return poems
37 
38 
39 def main():
40     page_num = 10
41     for i in range(1, page_num+1):
42         url = 'https://www.gushiwen.org/default_{0}.aspx'.format(i)
43         html = get_html(url)
44         parse_html(html)
45 
46 
47 if __name__ == '__main__':
48     main()

运行结果

猜你喜欢

转载自www.cnblogs.com/MC-Curry/p/9429442.html

爬虫_古诗文网（正则表达式）

爬虫实战——利用正则表达式爬取古诗文网

【Python3 爬虫】U20_正则表达式爬取古诗文网

「python爬虫之路day9」:实战之使用正则表达式爬取抓狂网，古诗文网信息

初识python 之爬虫：使用正则表达式爬取”古诗文“网页数据

第二篇，使用re正则表达式获取古诗文信息

Python 正则表达式之爬取古诗文名句

古诗文网爬虫

初识python 之爬虫：使用正则表达式爬取“糗事百科 - 文字版”网页数据初识python 之爬虫：使用正则表达式爬取”古诗文“网页数据

正则提取案例(古诗文网)

爬虫_古诗文网(队列，多线程，锁，正则，xpath)

爬虫古诗文网站

爬虫之验证码识别--古诗文网

python爬虫---代理、Cookie、模拟登录古诗文网

Python爬虫——爬取古诗文网

爬虫-requests-cookie登录古诗文网

古诗文网站之网络爬虫

python爬取古诗文网

云打码古诗文网

中国古诗文网

用正则表达式爬取古诗词网

爬取古诗文网古诗词

爬取古诗文网的推荐古诗

python小白学习记录结合scrapy编写爬虫爬取古诗文网右侧的标签

python爬虫学习（十六）古诗文网验证码识别

爬虫day01(上午) 模拟登录古诗文网

古诗文网验证码识别

正则表达式_爬取中国古诗词网与豆瓣热门图书

21天打造分布式爬虫-中国天气网和古诗文网实战（四）

爬虫之正则表达式

今日推荐

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

周排行

计算机组成与设计（七）—— 除法器

Integer Approximation(分治+枚举)

大话数据库索引

windows10系统JDK的配置及下载地址

mysql实现秒值转换中原六仔平台搭建

Codeforces Round #556 (Div. 1)

百练1064 网线主管

Codeforces 995F Cowmpany Cowmpensation

子集生成之增量构造法，位向量法，二进制法

ERROR: cmd.exe failed with args /c "/APK\gradle\rungradle.bat...

每日归档

更多

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)