Python图片爬取：一键下载万张图片

如何批量爬取网址中的所有图片，可能大家还是觉得功能不够强大，自己手动操作比写代码还要快，今天给大家介绍一下进阶版，我们批量把几百个网址的上万张照片分主题名字在一分钟内全部下载下来。

首先先回顾一下上次的完整代码：

#encoding = utf-8from urllib.request import urlretrieveimport osurl='http://blog.sina.com.cn/s/blog_4d79089f0102xojl.html'S=requests.Session()html=S.get(url)c=html.contentdef down_image(url): name=url[-12:]+'.jpg' urlretrieve(url, name)rs=str(c)e=1while e>0: e=rs.find('&690') s=rs[:e].rfind('real_src') ad=rs[s+11:e]+'&690' if len(ad)<=80: down_image(ad) rs=rs[e+1:] else: break

小编对上面这份代码做个修改，改为使用BeautifulSoup 库，提高一下容错率，修改后的代码功能木有改变，新代码如下：

encoding = utf-8from urllib.request import urlretrieveimport requestsimport osfrom bs4 import BeautifulSoupos.chdir('E:python-codephoto')S=requests.Session()def down_image(im_url): name=im_url[-12:]+'.jpg' urlretrieve(im_url,name)def get_path(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') title=soup.find("title").string if not os.path.isdir(title): os.makedirs(title) os.chdir(title) path=soup.find("div",attrs={'id':'sina_keyword_ad_area2'}).find_all('img') for i in path: x=i.attrs['real_src'] down_image(x)url='http://blog.sina.com.cn/s/blog_4d79089f0102xojl.html'get_path(url)

修改后的代码多了个功能，可以把图片保存到文章标题的文件夹下，效果如图：

好，那现在我们放大招，一样以上次的博客为例，我们把他全部博客图片都使用此办法爬取（此处仅为技术交流，不要用于违法用途）：

第一步：获取全部博文地址

可以参考批量获取图片地址的方式，代码如下：

#encoding = utf-8import requestsfrom bs4 import BeautifulSoupfrom multiprocessing.dummy import Pool as ThreadPoolS=requests.Session()pool=ThreadPool(5) #多线程no=range(1,11)def get_url(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') url=soup.find_all("span",attrs={'class':'atc_title'}) for i in url: x=i.a.attrs['href'] print (x)url=['http://blog.sina.com.cn/s/articlelist_1299777695_0_'+str(j)+'.html' for j in no]data = pool.map(get_url,url)

效果：

使用了多线程操作

第二步：合并代码

#encoding = utf-8from urllib.request import urlretrieveimport requestsimport osfrom bs4 import BeautifulSoupos.chdir('E:python-codephoto')S=requests.Session()no=range(1,11)def down_image(im_url): name=im_url[-12:]+'.jpg' urlretrieve(im_url,name)def get_path(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') title=soup.find("h2",attrs={'class':'titName SG_txta'}).string if not os.path.isdir(title): os.chdir('E:python-codephoto') os.makedirs(title) os.chdir(title) path=soup.find("div",attrs={'id':'sina_keyword_ad_area2'}).find_all('img') for i in path: x=i.attrs['real_src'] down_image(x)def get_url(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') url=soup.find_all("span",attrs={'class':'atc_title'}) for i in url: x=i.a.attrs['href'] get_path(x)for j in no: url='http://blog.sina.com.cn/s/articlelist_1299777695_0_'+str(j)+'.html' get_url(url)

效果：

多线程会导致图片存贮位置出错，所以取消了上面的多线程操作

结果:

500篇博文仅爬取了39篇就报错了，报错日志为：

直觉告诉我是因为冒号，于是修改了一下代码，将title中:替换为为-，问题解决：

title=soup.find("h2",attrs={'class':'titName SG_txta'}).string.replace(':','-')

不使用多线程，代码速度过慢，于是还是用了，同时使用绝对路径的方式避免图片保存路径出错，最终代码：

#encoding = utf-8from urllib.request import urlretrieveimport requestsimport osfrom bs4 import BeautifulSoupfrom multiprocessing.dummy import Pool as ThreadPoolp='E:\python-code\photo'S=requests.Session()pool=ThreadPool(10)pool_url=ThreadPool(20)no=range(1,11)def down_image(im_url,s): name=s+'\'+im_url[-12:]+'.jpg' urlretrieve(im_url,name)def get_path(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') title=soup.find("h2",attrs={'class':'titName SG_txta'}).string.replace(':','-') if not os.path.isdir(title): s=p+'\'+title os.makedirs(s) path=soup.find("div",attrs={'id':'sina_keyword_ad_area2'}).find_all('img') path_l=[] for i in path: path_l=i.attrs['real_src'] down_image(path_l,s)def get_url(url): html=S.get(url) c=html.content soup = BeautifulSoup(c,'lxml') url=soup.find_all("span",attrs={'class':'atc_title'}) url_l=[] for i in url: x=i.a.attrs['href'] url_l.append(x) data = pool_url.map(get_path,url_l)url=['http://blog.sina.com.cn/s/articlelist_1299777695_0_'+str(j)+'.html' for j in no]data = pool.map(get_url,url)

最终耗时120s左右，爬取博文494篇，图片7480张

我刚整理了一套2018最新的0基础入门和进阶教程，无私分享，加Java学习q-u-n ：六七八，二四一，五六三即可获取，内附：开发工具和安装包，以及系统学习路线图

请看下面的代码：

Python图片爬取：一键下载万张图片

猜你喜欢