Baidu bulk download pictures with Python

 

In order to do a little image classification project, need to produce their own data sets. To make the data set, you have to download a lot of pictures from the Internet, re-unification process.

 

In this case, a picture is saved download, it becomes very cumbersome. So, is there a way to search for images to download it directly to your local computer?

There ah! In python it!

I have to, "Teddy," "Corgi", "Labrador" as key words were downloaded 500 images. Next, I'm going to write a puppy classifier, I do not know how you advice!

The results demonstrate:

 

 

 

Writing ideas:

 

1. Get the picture url link

First, open the Baidu home page picture, note the following figure in the index url

Next, flip the page to switch to the traditional version (flip), because this will help us crawl pictures!

Url found several comparison, pn is a request to the number of parameters. By modifying the pn parameter, which returns the data, we found that only 60 per page picture.

NOTE: gsm parameter is a hexadecimal expression parameter pn removed anyway

Then, check the right page source code directly (ctrl + F) search objURL

 

In this way, we need to find the url of the picture.

 

 

2. Save the image links to local

Now, we have to do is to take this information to climb out.

Note: The page has objURL, hoverURL ... but we are using objURL, because this is the original

So, how to get objURL? Use regular expressions!

How do we use regular expressions to achieve it? In fact, only one line of code ...

 

results = re.findall('"objURL":"(.*?)",', html) 

Core code:

1. Get the picture url Code:

 

# 获取图片url连接
def get_parse_page(pn,name):

    for i in range(int(pn)):
        # 1.获取网页
        print('正在获取第{}页'.format(i+1))

        # 百度图片首页的url
        # name是你要搜索的关键词
        # pn是你想下载的页数

        url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%s&pn=%d' %(name,i*20)

        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4843.400 QQBrowser/9.7.13021.400'}

        # 发送请求,获取相应
        response = requests.get(url, headers=headers)
        html = response.content.decode()
        # print(html)

        # 2.正则表达式解析网页
        # "objURL":"http://n.sinaimg.cn/sports/transform/20170406/dHEk-fycxmks5842687.jpg"
        results = re.findall('"objURL":"(.*?)",', html) # 返回一个列表

        # 根据获取到的图片链接,把图片保存到本地
        save_to_txt(results, name, i)

2.保存图片到本地代码:

 

# 保存图片到本地
def save_to_txt(results, name, i):

    j = 0
    # 在当目录下创建文件夹
    if not os.path.exists('./' + name):
        os.makedirs('./' + name)

    # 下载图片
    for result in results:
        print('正在保存第{}个'.format(j))
        try:
            pic = requests.get(result, timeout=10)
            time.sleep(1)
        except:
            print('当前图片无法下载')
            j += 1
            continue

        # 可忽略,这段代码有bug
        # file_name = result.split('/')
        # file_name = file_name[len(file_name) - 1]
        # print(file_name)
        #
        # end = re.search('(.png|.jpg|.jpeg|.gif)$', file_name)
        # if end == None:
        #     file_name = file_name + '.jpg'

        # 把图片保存到文件夹
        file_full_name = './' + name + '/' + str(i) + '-' + str(j) + '.jpg'
        with open(file_full_name, 'wb') as f:
            f.write(pic.content)

        j += 1

Core code:
PIC = requests.get (Result, timeout = 10)
f.write (pic.content)

3. The main function codes:

 

# Main function
IF __name__ == '__main__':

    name = the INPUT ( 'Please enter the keywords you want to download:')
    the pn = the INPUT ( 'you want before downloading pages (one has 60):')
    get_parse_page ( pn, name)

Instructions for use:

# 配置以下模块
import requests 
import re
import os
import time

# 1.运行 py源文件
# 2.输入你想搜索的关键词,比如“柯基”、“泰迪”等
# 3.输入你想下载的页数,比如5,那就是下载 5 x 60=300 张图片
 

Guess you like

Origin www.cnblogs.com/7758520lzy/p/11988856.html