目标URL = https://tieba.baidu.com/p/5316245951
查看网页的源代码:
可以发现,该贴吧的图片链接都包含在<image class="BDE_Image">的标签中的,例如:
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=10191d3660600c33f079dec02a4d5134/ee1b9d16fdfaaf5188a45f9d875494eef01f7a49.jpg" size="219669" changedsize="true" width="560" height="320" size="219669">
因此写出以下正则表达式:
r'<img class="BDE_Image".*?src="[^"]*\.jpg".*?>'
测试如下代码:
import urllib.request
import re
response = urllib.request.urlopen("http://tieba.baidu.com/p/3823765471")
html = response.read().decode('utf-8')
p = r'<img class="BDE_Image".*?src="[^"]*\.jpg".*?>'
imglist = re.findall(p,html)
for each in imglist:
print(each)
输出:
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=f9cf09409c25bc312b5d01906ede8de7/8f0ede0735fae6cdafb377ef0ab30f2443a70fda.jpg" pic_ext="jpeg" changedsize="true" width="560" height="497">
<img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=35c4709bb9315c6043956be7bdb0cbe6/cc223ffae6cd7b894b6be60d0a2442a7d8330eda.jpg" pic_ext="jpeg" changedsize="true" width="560" height="497">
...
为下载图片,需要知道图片的准确地址,如何从上面的字符串中取出图片的地址呢?
解决方法如下:
p = r'<img class="BDE_Image".*?src="([^"]*\.jpg)".*?>'
其实就是将图片的地址用小括号分组。
最后整理代码,得到最后完整的程序:
import urllib.request
import re
def open_url(url):
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
return html
def get_image(html):
p = r'<img class="BDE_Image".*?src="([^"]*\.jpg)".*?>'
imglist = re.findall(p,html)
num = 1
for each in imglist:
#读取图片数据
response = urllib.request.urlopen(each)
image = response.read()#不能进行'utf-8'编码,不能调用open_url()函数
with open('%s.jpg'%num,'wb') as fp:
fp.write(image)
print("正在下载第%s张图片"%num)
num = num+1
return
url = "https://tieba.baidu.com/p/5316245951"
get_image(open_url(url))
运行效果: