python-- crawling images (shutter Picture Network)

There are three modules in the present crawler program:

1, the scheduling ends reptiles: reptiles start, stop reptiles, monitor the operation of reptiles

2, reptiles module: contains three small modules, URL management, web download, web parser.

(. 1) URL manager: the need for crawling crawling through URL and the URL has managed to be a URL to be removed from the URL crawling manager, transmitted to the web downloading.

(2) downloading pages: web page download will be downloaded web page specified by URL, stored as a string, the parser is passed to the web page.

(3) pages parser: string passed page parser, the parser can only parse the required data crawling, but can also parse out every page to another page URL, the URL is parsed be added into the URL Manager

3, a data output module: storing image crawling

In particular idea is based on regular expressions, find url, then complete the download.

Design Environment

 

FDI : Sublime Text3

Python version: Python 3.7

 

Objective analysis

Goal: From https://www.shutterstock.com/zh/search/ start crawling multiple categories before 10 pictures

(1)初始URL"https://www.shutterstock.com/zh/search/"

 

(2) entry page URL format:

https://www.shutterstock.com/zh/search?searchterm=Architecture&image_type=photo

 

(3) find backgrounds, Architecture, business, kids, food, portrait, flowers, travel and other categories tag name

code show as below:

 import requests
import re
import urllib.request
import time

headers = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}

url_base = 'https://www.shutterstock.com/zh/search/'
imge_list = []

for i in ['backgrounds','Architecture','business','kids','food','portrait','flowers','travel','skyline']:
    url = url_base+i+'?image_type=photo'

    res = requests.get(url,headers= headers).text
    
    # 为图片生成缩略图
    #"thumbnail":"(.*?)",
    cop = re.compile('"thumbnail":"(.*?)",',re.S)
    result = re.findall(cop,res)[:10]
    for each in result:
        filename = each.split('/')[-1]
        #imge_list.append(each) #[90]

        response = urllib.request.urlopen(each)
        img = response.read()
        with open(filename,'wb')as f:
            f.write(img)
        print("已下载:",each)

        time.sleep(5) # 休眠五秒

print("下载结束")

result:

 

 

Problems encountered


(1) after crawling a few pictures, no new pictures. The reason: because it is a foreign site, too many visits, access will be restricted.
Workaround: Set the sleep function, sleep a few seconds continued access.


(2) Do not page English translation, the source label will display an error!
Solution: Make sure whether the browser automatically translated, you can get about his page, and then look for the label you want, does not exist, there are two cases, dynamic or be translated.

Guess you like

Origin www.cnblogs.com/yezishen/p/12079321.html