Python3.6 爬取网页图片

目标URL = https://tieba.baidu.com/p/5316245951

查看网页的源代码:


可以发现,该贴吧的图片链接都包含在<image class="BDE_Image">的标签中的,例如:

<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=10191d3660600c33f079dec02a4d5134/ee1b9d16fdfaaf5188a45f9d875494eef01f7a49.jpg" size="219669" changedsize="true" width="560" height="320" size="219669">

因此写出以下正则表达式:

r'<img class="BDE_Image".*?src="[^"]*\.jpg".*?>'

测试如下代码:

import urllib.request
import re
response = urllib.request.urlopen("http://tieba.baidu.com/p/3823765471")
html = response.read().decode('utf-8')
p = r'<img class="BDE_Image".*?src="[^"]*\.jpg".*?>'
imglist = re.findall(p,html)
for each in imglist:
    print(each)

输出:

<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=f9cf09409c25bc312b5d01906ede8de7/8f0ede0735fae6cdafb377ef0ab30f2443a70fda.jpg" pic_ext="jpeg" changedsize="true" width="560" height="497">
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=35c4709bb9315c6043956be7bdb0cbe6/cc223ffae6cd7b894b6be60d0a2442a7d8330eda.jpg" pic_ext="jpeg" changedsize="true" width="560" height="497">

...

为下载图片,需要知道图片的准确地址,如何从上面的字符串中取出图片的地址呢?

解决方法如下:

p = r'<img class="BDE_Image".*?src="([^"]*\.jpg)".*?>'

其实就是将图片的地址用小括号分组。

最后整理代码,得到最后完整的程序:

import urllib.request
import re
def open_url(url):
    req = urllib.request.Request(url)
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')
    return html
def get_image(html):
    p = r'<img class="BDE_Image".*?src="([^"]*\.jpg)".*?>'
    imglist = re.findall(p,html)
    num = 1
    for each in imglist:
        #读取图片数据
        response = urllib.request.urlopen(each)
        image = response.read()#不能进行'utf-8'编码,不能调用open_url()函数
        
        with open('%s.jpg'%num,'wb') as fp:
            fp.write(image)
            print("正在下载第%s张图片"%num)
            num = num+1
    return 
url = "https://tieba.baidu.com/p/5316245951"
get_image(open_url(url))
运行效果:




猜你喜欢

转载自blog.csdn.net/qq_21905401/article/details/77935209