python3实现简单图片爬虫

版权声明:本文由monkey原创撰写,转载请注明文章来源! https://blog.csdn.net/weixin_44143222/article/details/86614965

效果如图:
在这里插入图片描述
在这里插入图片描述
思路如下:
1.用户输入一个需要爬取图片的网址。input()
2.导入re模块,用正则判断输入的网址是否正确,否则重新输入!

ret = re.match("(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?", website)

3.导入urllib.request模块
发送请求,读取网页的内容

req = urllib.request.Request(url=website, headers=headers)
            web_content = urllib.request.urlopen(req)
            content = web_content.read()

4.用正则提取出html代码中的图片(因为不是每个网址都适用,我这里写的是斗鱼的规则,读者可以自己更改正则,爬取你想爬的网站)演示爬取地址:https://www.douyu.com/g_yz

a = re.findall(r'data-original="(.+\.jpg)" src=', content.decode("utf-8"))

5.把提取到的图片链接都,放入一个列表中,我这里还加入了协程,提高爬取的速度

mylist = list()
            for x in a:
                print(x)
                mylist.append(gevent.spawn(downloader, "%s.jpg" % num, x))
                num += 1
            gevent.joinall(mylist)

6.最后while循环,把这些图片都保存到你的电脑里就OK啦

def downloader(img_name, img_url):
        req = urllib.request.urlopen(img_url)
        with open(r"C:\Users\monkey\Desktop\%s" % img_name, "wb") as f:
            while True:
                img_content = req.read(1024)
                if img_content:
                    f.write(img_content)
                else:
                    break

整体实现代码如下:

import urllib.request
import gevent
from gevent import monkey

monkey.patch_all()
import re


def downloader(img_name, img_url):
        req = urllib.request.urlopen(img_url)
        with open(r"C:\Users\monkey\Desktop\%s" % img_name, "wb") as f:
            while True:
                img_content = req.read(1024)
                if img_content:
                    f.write(img_content)
                else:
                    break

def main():
    num = 0
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                             'Chrome/51.0.2704.63 Safari/537.36'}

    while True:
        website = input("请输入您要爬取的网站链接:")
        ret = re.match("(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?", website)
        if ret:
            print("您输入的网址正确,请稍等我正在为您爬取:%s" % ret.group())

            req = urllib.request.Request(url=website, headers=headers)
            web_content = urllib.request.urlopen(req)
            content = web_content.read()
            a = re.findall(r'data-original="(.+\.jpg)" src=', content.decode("utf-8"))
            mylist = list()
            for x in a:
                print(x)
                mylist.append(gevent.spawn(downloader, "%s.jpg" % num, x))
                num += 1
            gevent.joinall(mylist)

if __name__ == '__main__':
    main()

猜你喜欢

转载自blog.csdn.net/weixin_44143222/article/details/86614965