Python3.6 爬取网页图片 - 代码天地

Python3.6 爬取网页图片

其他 2018-09-17 13:22:02 阅读次数: 0

目标URL = https://tieba.baidu.com/p/5316245951

查看网页的源代码：

可以发现，该贴吧的图片链接都包含在<image class="BDE_Image">的标签中的，例如：

<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=10191d3660600c33f079dec02a4d5134/ee1b9d16fdfaaf5188a45f9d875494eef01f7a49.jpg" size="219669" changedsize="true" width="560" height="320" size="219669">

因此写出以下正则表达式：

r'<img class="BDE_Image".*?src="[^"]*\.jpg".*?>'

测试如下代码：

import urllib.request
import re
response = urllib.request.urlopen("http://tieba.baidu.com/p/3823765471")
html = response.read().decode('utf-8')
p = r'<img class="BDE_Image".*?src="[^"]*\.jpg".*?>'
imglist = re.findall(p,html)
for each in imglist:
    print(each)

输出：

<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=f9cf09409c25bc312b5d01906ede8de7/8f0ede0735fae6cdafb377ef0ab30f2443a70fda.jpg" pic_ext="jpeg" changedsize="true" width="560" height="497">
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=35c4709bb9315c6043956be7bdb0cbe6/cc223ffae6cd7b894b6be60d0a2442a7d8330eda.jpg" pic_ext="jpeg" changedsize="true" width="560" height="497">

...

为下载图片，需要知道图片的准确地址，如何从上面的字符串中取出图片的地址呢？

解决方法如下：

p = r'<img class="BDE_Image".*?src="([^"]*\.jpg)".*?>'

其实就是将图片的地址用小括号分组。

最后整理代码，得到最后完整的程序：

import urllib.request
import re
def open_url(url):
    req = urllib.request.Request(url)
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')
    return html
def get_image(html):
    p = r'<img class="BDE_Image".*?src="([^"]*\.jpg)".*?>'
    imglist = re.findall(p,html)
    num = 1
    for each in imglist:
        #读取图片数据
        response = urllib.request.urlopen(each)
        image = response.read()#不能进行'utf-8'编码,不能调用open_url()函数
        
        with open('%s.jpg'%num,'wb') as fp:
            fp.write(image)
            print("正在下载第%s张图片"%num)
            num = num+1
    return 
url = "https://tieba.baidu.com/p/5316245951"
get_image(open_url(url))

运行效果：

猜你喜欢

转载自blog.csdn.net/qq_21905401/article/details/77935209

Python3.6 爬取网页图片

Python爬取网页图片

【python】爬取网页图片

python3爬取网页图片

Python爬取网页图片03

Python爬取网页图片02

Python爬取网页图片01

Python——网络爬虫（爬取网页图片）

Python爬取网页的图片数据

使用Python爬取网页图片

利用Python爬取网页图片

python3.7---爬取网页图片

Python爬虫入门——爬取网页图片

python爬虫爬取网页图片

python爬取网页图片详解

Python应用开发——爬取网页图片

python爬虫：批量爬取网页图片

Python 网页爬虫爬取网页图片demo

python3.6 微信公众号抓爬

Python爬取网页中的图片（搜狗图片）详解

基于python3.6的OpenCV读取并打印图片数据

基于Python3.6的OpenCV图片色彩空间的转换

Python3.6 爬取QQ空间说说并输出词云

爬取大半导体网新闻内容保存到word（基于python3.6）

爬取实例-Python3.6，Xpath，BeautifulSoup4, 正则表达式

2021-7-3 爬网页22-爬取某小说保存到txt(python3.6，静态页面，requests.get，去除特定字符串）

python3爬虫爬取网页图片简单示例

python3爬虫之二：爬取网页图片

python3 从网页上爬取图片

python 爬取动态网页（百度图片）

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

Java自定义时间格式

同步整形电路

在开发中最最最常用的字符串的属性大集合

Linux 查看端口占用并杀掉

Java基础四：ArrayList

多线程之死锁就是这么简单

mysql 基础命令集

awk 命令详解

Centos6.3编译安装nginx+php步骤

OCR （Optical Character Recognition，光学字符识别）

每日归档

更多

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)