Python爬虫爬取相关图片


       简要的实现实现Python爬虫爬取百度贴吧页面上的图片,下面的网页就是本篇博客所要爬的网页,当然看到的只是其中的一部分图片,是所要爬取的页面,

而下图则是最终的爬取的图片:



接下来就简要的讲讲爬取的整个过程:

首先你需要一个好的编程工具,博主所用的就是自己感觉比较好用的Pycharm工具,这是官网的下载网址Pycharm下载,大家可以按照自己的电脑配置进行下载,环境的搭建我就不细说了,网上都比较多,大家可以参考参考。

接下来就是爬虫代码的编写了,话不多说看代码吧:

 
 
import urllib.request
import re
import os

''' 这是需要引入的三个文件包 '''
def open_url(url):
    req = urllib.request.Request(url)
    req.add_header('User-Agent', 'mozilla/5.0 (windows nt 6.3; win64; x64) applewebkit/537.36 (khtml, like gecko) chrome/65.0.3325.181 safari/537.36')
    '''User_Ahent是爬虫所需要模拟浏览器访问所需要的一些标识信息,如浏览器类型,操作系统等'''
    page = urllib.request.urlopen(req)
    html = page.read().decode('utf-8')
    return html

def get_img(html):
    p = r'<img class="BDE_Image".*?src="([^"]*\.jpg)".*?>' '''这是网站里面图片超链接所对应的索引正值表达式'''
    imglist = re.findall(p, html)
    try:
        os.mkdir("NewPics")
    except FileExistsError:
        # 若文件夹存在则保存
        pass
        os.chdir("NewPics")
    for each in imglist:
        filename = each.split("/")[-1]
        urllib.request.urlretrieve(each,filename,None)
        print(each)

if __name__ == '__main__':
    url = "https://tieba.baidu.com/p/3823765471"
    get_img(open_url(url))

上面引入的包是需要安装的,可以通过电脑的运行窗口进行安装,当然Pycharm里面也是提供安装的,看下图

在setting>Project python_frist>Project Interpreter里面进行安装,点击加号进行相应文件的搜索就能进行安装。


点击运行相应的.py文件

C:\Users\dell\PycharmProjects\python_frist\venv\Scripts\python.exe C:/Users/dell/PycharmProjects/python_frist/imgget.py
https://imgsa.baidu.com/forum/w%3D580/sign=f9cf09409c25bc312b5d01906ede8de7/8f0ede0735fae6cdafb377ef0ab30f2443a70fda.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=35c4709bb9315c6043956be7bdb0cbe6/cc223ffae6cd7b894b6be60d0a2442a7d8330eda.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=4f1f558f596034a829e2b889fb1249d9/2ddfeccd7b899e51d989e69a47a7d933c9950dda.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=6b0bb5de31a85edffa8cfe2b795509d8/fee871899e510fb3d81eab19dc33c895d0430cda.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=845add165bdf8db1bc2e7c6c3922dddb/63ac94510fb30f249a9d308dcd95d143ac4b03da.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=ed92b76188b1cb133e693c1bed5556da/867405b30f2442a70009212bd443ad4bd01302da.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=84e5640fce1349547e1ee86c664f92dd/1796052442a7d93312af38fda84bd11372f001da.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=568b22ad4c540923aa696376a259d1dc/170148a7d933c8950a7944f5d41373f0830200da.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=1729a0ea0c23dd542173a760e108b3df/5a82d333c895d143717138ad76f082025baf07da.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=946ee09dd854564ee565e43183df9cde/c116c295d143ad4b0c299a4e87025aafa50f06da.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=ca99eeb62d381f309e198da199004c67/de791a385343fbf2e6a77aadb57eca8064388f5e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=bbdf24af778b4710ce2ffdc4f3cfc3b2/2e5fd0b44aed2e738aaa5aa28201a18b86d6faa6.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=f93d44f5d41373f0f53f6f97940e4b8b/25485ffbb2fb4316455ca6f425a4462308f7d3a6.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=b04a7eb4354e251fe2f7e4f09787c9c2/af6049a98226cffcef69633cbc014a90f703eaa7.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=13c9233e60380cd7e61ea2e59145ad14/2bdf888ba61ea8d35ea8a88a920a304e241f585f.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=0bef1fbd768da9774e2f86238050f872/492ad3f9d72a605964523d942d34349b023bbaa7.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=f0f46e68227f9e2f70351d002f31e962/22a2e350352ac65c4147bafdfef2b21192138aa7.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=9d7c7831a5ec08fa260013af69ef3d4d/b48a24dda3cc7cd9e76fe8573c01213fb90e91a7.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=5d95fc7033d3d539c13d0fcb0a86e927/51710323dd54564e062738b7b6de9c82d0584fa7.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=7f172576a286c91708035231f93f70c6/06328082b9014a901cde8733ac773912b21bee06.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=d814e3fba8345982c58ae59a3cf5310b/59119d0a304e251fda039377a286c9177e3e5369.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=de38dd9002082838680ddc1c8898a964/2d0fcc5c1038534302b5f1ae9613b07ecb8088e8.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=08ad11d9d5160924dc25a213e406359b/12f468d9f2d3572caf8fea538f13632763d0c3f1.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=779b85e3153853438ccf8729a312b01f/c97dc6bf6c81800a5b69ff81b43533fa838b4751.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=79e70e18acec8a13141a57e8c7019157/c88706f431adcbef58c254f1a9af2edda2cc9f73.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=2bc464330fd162d985ee621421dda950/8ff8ab44ad3459829b7bcf6509f431adcaef847c.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=6107261cb3fd5266a72b3c1c9b1a9799/1a2d71f40ad162d9121f48eb14dfa9ec8b13cd7d.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=70a3c20354da81cb4ee683c56264d0a4/762317950a7b0208a639151667d9f2d3562cc87e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=b235e42aae64034f0fcdc20e9fc17980/5e0603f7905298228bc01334d2ca7bcb0b46d47f.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=b60024009a3df8dca63d8f99fd1372bf/2137b91bb051f81940d729bddfb44aed2f73e778.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=b4e8a9c73ff33a879e6d0012f65e1018/4e6b9858d109b3defd53ce9fc9bf6c81810a4c79.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=27cf5f22e7c4b7453494b71efffd1e78/1263f81fbe096b63dbbe875b00338744ebf8ac3f.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=8d6a14228a18367aad897fd51e738b68/30113e9b033b5bb5f4d1f9f53ad3d539b600bc58.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=9d8007ffbb4543a9f51bfac42e168a7b/ea3ab4096b63f62445f9e0088b44ebf81a4ca33e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=3f4109e6eb1190ef01fb92d7fe1a9df7/9fb1aec27d1ed21bb480f7eea16eddc451da3f11.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=3a401ee85b2c11dfded1bf2b53266255/41a7b8b7d0a20cf4693427d47a094b36acaf993e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=7f6a3bf177cb0a4685228b315b62f63e/ff349aef76c6a7efbd82712df1faaf51f3de663e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=34ae5ac65e66d0167e199e20a72ad498/63d100d162d9f2d35a59fde4a5ec8a136327cc12.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=187622209116fdfad86cc6e6848e8cea/03f0a76eddc451da3e52d5e0bafd5266d016323e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=c6415005bd3533faf5b6932698d2fdca/1b6a72f0f736afc3a6a8e66ebf19ebc4b745123e.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=c67c8758fa246b607b0eb27cdbf91a35/8533f7faaf51f3deccfa606f98eef01f3a297912.jpg


Process finished with exit code 0

上面的是所爬到的相应的图片链接,当然此时链接所对应的图片也都被爬下来放到了相应的文件夹里面。



最后说一下前面代码里面的User-Agent的

获取

我们进入需要爬的页面,使用开发者工具(F12)可以查看相应的User-Agent


当然还有下面的这个办法,编写一个HTML文件

<html>
<body>
<SCRIPT> document.write( navigator.userAgent.toLowerCase() );  </SCRIPT>
</body>
</html>

这样就能获取了,若没有这个值,你的爬虫会被网站认为是非人类操作而被拦截,由于刚入门所以写的比较浅,当然只是自己对于python爬虫最近比较感兴趣所以就了解了解。






猜你喜欢

转载自blog.csdn.net/qq_38047600/article/details/80670439
今日推荐