python 爬虫（五）爬取多页内容 - 代码天地

python 爬虫（五）爬取多页内容

其他 2019-01-03 02:34:16 阅读次数: 0

import urllib.request
import ssl
import re

def ajaxCrawler(url):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"}
    req = urllib.request.Request(url,headers=headers)

    #使用ssl创建未验证的上下文
    context = ssl._create_unverified_context()

    response = urllib.request.urlopen(req,context=context)
    jsonStr = response.read().decode("utf-8")

    return jsonStr

url = "https://www.qiushibaike.com/text/page/1/" #然后循环爬取page/2/ 、、、
#filePath = "qiushi.html"
par1 = r'''article block untagged mb15(.*?)class="stats-comments'''
re_ob = re.compile(par1,re.S)
listStr = re_ob.findall(ajaxCrawler(url))

jsonStr ={}

for ss in listStr:
    re_Content = re.compile(r'''class="content".*?<span>(.*?)</span>''',re.S)  #前期不要写的太严格，防止有的匹配不到
    userContent = re_Content.findall(ss)[0] #返回的是一个数组，取第一个

    re_name = re.compile(r'''<h2>(.*?)</h2>''',re.S)
    userName = re_name.findall(ss)[0]

    jsonStr[userName] = userContent
for k,v in jsonStr.items():
    print(k+"：说"+v)

猜你喜欢

转载自blog.csdn.net/weixin_40938748/article/details/85310881

python 爬虫（五）爬取多页内容

Python爬虫之Scrapy框架系列（14）——实战ZH小说爬取【多页爬取】

python爬取数据多页

Python爬虫：一个完整的爬取多页短租房的案例！

python3爬虫入门(正则+requests 糗事百科多页图片爬取)

Python爬虫爬取新浪新闻内容

Python 爬虫爬取多页数据

Python爬虫实战入门五：获取JS动态内容—爬取今日头条

python爬虫小练习之四：糗事百科第二次，多页爬取

Python爬虫爬取懂球帝足球新闻（分类，分标签，多页，存数据库，去重）

Python爬取多页糗事百科

python爬取多页商品评论详解

Python：【2】使用Selenium爬取多页表格数据

python3 爬虫爬取blog内容

python爬虫-爬取壁纸酷主页内容

Python爬虫：selenium挂shadowsocks代理爬取网页内容

python爬虫-爬取天气预报内容

Python爬虫入门：爬取某个网页的小说内容

Python爬虫：学习啦网站文章内容爬取

Python爬虫爬取搜狗搜索到的内容页面

python爬虫爬取内容为乱码（解决方法）

Python爬虫教你爬取视频内容

python爬虫基础Ⅳ——多协程：爬取食物热量

Python爬虫——爬取网站多页数据

python爬虫五：爬取小说，下载到本地

【Python爬虫】之爬取页面内容、图片以及用selenium爬取

Python爬虫之Scrapy框架系列（21）——重写媒体管道类实现保存图片名字自定义及多页爬取

python爬取elasticsearch内容

Python爬取网页内容

Python 爬虫爬取网页

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

Java自定义时间格式

同步整形电路

在开发中最最最常用的字符串的属性大集合

Linux 查看端口占用并杀掉

Java基础四：ArrayList

多线程之死锁就是这么简单

mysql 基础命令集

awk 命令详解

Centos6.3编译安装nginx+php步骤

OCR （Optical Character Recognition，光学字符识别）

每日归档

更多

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)