Python Reptile project combat: look at me with reptiles batch download site Pictures

Foreword

Text and images in this article from the network, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.

PS: If necessary learning materials Python small partners can add their own acquisition

1. Get the picture url link

First, open the Baidu home page picture, note the following figure in the index url

Next, flip the page to switch to the traditional version (flip), because this will help us crawl pictures!

Url found several comparison, pn is a request to the number of parameters. By modifying the pn parameter, which returns the data, we found that only 60 per page picture.

NOTE: gsm parameter is a hexadecimal expression parameter pn removed anyway

Then, check the right page source code directly (ctrl + F) search objURL

In this way, we need to find the url of the picture.

2. Save the image links to local

Now, we have to do is to take this information to climb out.

Note: The page has objURL, hoverURL ... but we are using objURL, because this is the original

So, how to get objURL? Use regular expressions!

How do we use regular expressions to achieve it? In fact, only one line of code ...

results = re.findall('"objURL":"(.*?)",', html) 

Core code:

1. Get the picture url Code:

 Get Picture # 1 url connection
 2 def get_parse_page(pn,name):
 3 ​
 4     for i in range(int(pn)):
 1. Get page # 5
 6 print ( '{} are acquired on page' .format (i + 1))
 7 ​
 Picture 8 # Baidu home page url
 9 # name is the key word you want to search
10 # pn is the number of pages you want to download
11 ​
12         url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%s&pn=%d' %(name,i*20)
13 ​
14         headers = {
15             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4843.400 QQBrowser/9.7.13021.400'}
16 ​
17         # 发送请求,获取相应
18         response = requests.get(url, headers=headers)
19         html = response.content.decode()
20         # print(html)
21 ​
22         # 2.正则表达式解析网页
23         # "objURL":"http://n.sinaimg.cn/sports/transform/20170406/dHEk-fycxmks5842687.jpg"
24         results = re.findall('"objURL":"(.*?)",', html) # 返回一个列表
25 ​
26         # 根据获取到的图片链接,把图片保存到本地
27         save_to_txt(results, name, i)

2.保存图片到本地代码:

 1 # 保存图片到本地
 2 def save_to_txt(results, name, i):
 3 ​
 4     j = 0
 5     # 在当目录下创建文件夹
 6     if not os.path.exists('./' + name):
 7         os.makedirs('./' + name)
 8 ​
 9     # 下载图片
10     for result in results:
11         print('正在保存第{}个'.format(j))
12         try:
13             pic = requests.get(result, timeout=10)
14             time.sleep(1)
15         except:
16             print('当前图片无法下载')
17             j += 1
18             continue
19 ​
20         # 可忽略,这段代码有bug
21         # file_name = result.split('/')
22         # file_name = file_name[len(file_name) - 1]
23         # print(file_name)
24         #
25         # end = re.search('(.png|.jpg|.jpeg|.gif)$', file_name)
26         # if end == None:
27         #     file_name = file_name + '.jpg'
28 ​
29         # 把图片保存到文件夹
30         file_full_name = './' + name + '/' + str(i) + '-' + str(j) + '.jpg'
31         with open(file_full_name, 'wb') as f:
32             f.write(pic.content)
33 ​
34         j += 1

3.主函数代码:

1 # 主函数
2 if __name__ == '__main__':
3 ​
4     name = input('请输入你要下载的关键词:')
5     pn = input('你想下载前几页(1页有60张):')
6     get_parse_page(pn, 

使用说明:

1 # 配置以下模块
2 import requests 
3 import re
4 import os
5 import time
6 ​
7 # 1.运行 py源文件
8 # 2.输入你想搜索的关键词,比如“柯基”、“泰迪”等
9 # 3.输入你想下载的页数,比如5,那就是下载 5 x 60=300 张图片
发布了116 篇原创文章 · 获赞 16 · 访问量 1万+

Guess you like

Origin blog.csdn.net/FHGFHFYUUGY/article/details/104524533