Little reptile crawls pictures of kittens and saves to local folder

Little reptile crawls pictures of kittens and saves to local folder

I am a graduate student in the School of Electrical and Information Engineering, Anhui University of Technology. It ’s really awful not to start school recently, because I am stupid, I have to start late to learn machine learning and computer vision, but I always learn this It feels boring and meaningless. So bored at home, crawled some kitten pictures into a local file, just to use image processing for me. In addition, it is also very cute! Okay, let's start our practical exercises!

1. Some libraries
needed 5 libraries are needed here, of course, if you want to be simpler, you can also remove some.
gevent package: This is an indispensable package for using multiple coroutines. If you are not using multiple coroutines, you can not use this
time package: this can be used for timing and also for setting the crawl interval, otherwise it is not friendly to the server It ’s not good. .
request package: This is necessary to process and respond to URL links.
BeautifuiSoup package: This is to parse the response URL.
os package: create a folder.

from gevent import monkey
monkey.patch_all()
from gevent.queue import Queue
import requests
from bs4 import BeautifulSoup
import gevent
import time
import os

2. After we import the required library, we must analyze the web page URL to see where the response picture is hidden, so as to crawl. The URL I used here is the
kitty website.
When you open it, you will find a lot of cute kittens, and people are reluctant to leave (haha, off-topic). The URL link is hidden in the img src.
What I used here is to find with the smallest father, and the smallest father I found is

'div',class_='il_img'

Without further ado, all the code I crawled is shown below, which is basically very simple and clear at a glance.

def pachong():
    while not work.empty():
        url = work.get_nowait()
        res = requests.get(url,headers = headers)
        jiexi = BeautifulSoup(res.text,'html.parser')
        fuxi = jiexi.find_all('div',class_='il_img')
        for i in fuxi:
            photo = i.find_all('a')[0].find('img')['src']
            transform = str(photo)
            add = 'https:' + transform
            image.append(add)

3. The next step is to store the picture in a local file. First use the os module to create a folder. After the creation, it is a picture and is written to the folder in wb mode.

dir_name = 'catimage'
            #在当前目录下创建文件夹
            if not os.path.exists(dir_name):
                os.mkdir(dir_name)

            i = 0
            for img in image:
                #给它一点点时间,不然可能会把服务器搞崩掉。。
                time.sleep(0.1)
                picture_name = img.split('/')[-1] #提取图片url后缀,一定要用!
                response = requests.get(img,headers = headers)
                
                with open(dir_name+'/'+picture_name,'wb') as f:
                    f.write(response.content)
                 

4. In addition, I also added five crawlers for asynchronous crawling, which will be faster.
Insert picture description here
5. Okay, after saying so much, it's our turn to see what we finally got. I went to the local folder to find the catimage folder I created. Open, you can see it (my screenshot is part of the picture), so cute!
Insert picture description here
6. The overall source code is provided as follows!

#需要导入的库
#我采用多协程方式
from gevent import monkey
monkey.patch_all()
from gevent.queue import Queue
import requests
from bs4 import BeautifulSoup
import gevent
import time
import os


#用time.time()方法来记录爬取的时间
starttime = time.time()
work = Queue()
start_url = 'https://www.ivsky.com/tupian/xiaomao_t3023/index_{page}.html'
#头文件还是要加的,不然会被服务器拦截掉,导致爬取不到
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}

#对服务器不能太狠,所以只爬取4页就好了。。。
for x in range(1,5):
    real_url = start_url.format(page = x)
    work.put_nowait(real_url)


image = []

#主要爬虫的操作都在这里
def pachong():
    while not work.empty():
        url = work.get_nowait()
        res = requests.get(url,headers = headers)
        jiexi = BeautifulSoup(res.text,'html.parser')
        fuxi = jiexi.find_all('div',class_='il_img')
        for i in fuxi:
            photo = i.find_all('a')[0].find('img')['src']
            transform = str(photo)
            add = 'https:' + transform
            image.append(add)
            
            dir_name = 'catimage'
            #在当前目录下创建文件夹
            if not os.path.exists(dir_name):
                os.mkdir(dir_name)

            i = 0
            for img in image:
                #给它一点点时间,不然可能会把服务器搞崩掉。。
                time.sleep(0.1)
                picture_name = img.split('/')[-1] #提取图片url后缀,一定要用!
                response = requests.get(img,headers = headers)
                
                with open(dir_name+'/'+picture_name,'wb') as f:
                    f.write(response.content)
                i = i+1
                print('正在爬取第'+str(i)+'张图片')
            
                
     
           
            
task_list = []
for z in range(5):
    task = gevent.spawn(pachong)
    task_list.append(task)
gevent.joinall(task_list)
endtime = time.time()
print('爬取时长:',endtime-starttime)

Alright. If you don't understand, you can comment to me!

Published 2 original articles · praised 3 · visits 89

Guess you like

Origin blog.csdn.net/Jshauishaui/article/details/105576502