In order to do a little image classification project, need to produce their own data sets. To make the data set, you have to download a lot of pictures from the Internet, re-unification process.
In this case, a picture is saved download, it becomes very cumbersome. So, is there a way to search for images to download it directly to your local computer?
There ah! In python it!
I have to, "Teddy," "Corgi", "Labrador" as key words were downloaded 500 images. Next, I'm going to write a puppy classifier, I do not know how you advice!
The results demonstrate:
Writing ideas:
1. Get the picture url link
First, open the Baidu home page picture, note the following figure in the index url
Next, flip the page to switch to the traditional version (flip), because this will help us crawl pictures!
Url found several comparison, pn is a request to the number of parameters. By modifying the pn parameter, which returns the data, we found that only 60 per page picture.
NOTE: gsm parameter is a hexadecimal expression parameter pn removed anyway
Then, check the right page source code directly (ctrl + F) search objURL
In this way, we need to find the url of the picture.
2. Save the image links to local
Now, we have to do is to take this information to climb out.
Note: The page has objURL, hoverURL ... but we are using objURL, because this is the original
So, how to get objURL? Use regular expressions!
How do we use regular expressions to achieve it? In fact, only one line of code ...
results = re.findall('"objURL":"(.*?)",', html)
Core code:
1. Get the picture url Code:
# 获取图片url连接
def get_parse_page(pn,name):
for i in range(int(pn)):
# 1.获取网页
print('正在获取第{}页'.format(i+1))
# 百度图片首页的url
# name是你要搜索的关键词
# pn是你想下载的页数
url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%s&pn=%d' %(name,i*20)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4843.400 QQBrowser/9.7.13021.400'}
# 发送请求,获取相应
response = requests.get(url, headers=headers)
html = response.content.decode()
# print(html)
# 2.正则表达式解析网页
# "objURL":"http://n.sinaimg.cn/sports/transform/20170406/dHEk-fycxmks5842687.jpg"
results = re.findall('"objURL":"(.*?)",', html) # 返回一个列表
# 根据获取到的图片链接,把图片保存到本地
save_to_txt(results, name, i)
2.保存图片到本地代码:
# 保存图片到本地
def save_to_txt(results, name, i):
j = 0
# 在当目录下创建文件夹
if not os.path.exists('./' + name):
os.makedirs('./' + name)
# 下载图片
for result in results:
print('正在保存第{}个'.format(j))
try:
pic = requests.get(result, timeout=10)
time.sleep(1)
except:
print('当前图片无法下载')
j += 1
continue
# 可忽略,这段代码有bug
# file_name = result.split('/')
# file_name = file_name[len(file_name) - 1]
# print(file_name)
#
# end = re.search('(.png|.jpg|.jpeg|.gif)$', file_name)
# if end == None:
# file_name = file_name + '.jpg'
# 把图片保存到文件夹
file_full_name = './' + name + '/' + str(i) + '-' + str(j) + '.jpg'
with open(file_full_name, 'wb') as f:
f.write(pic.content)
j += 1
Core code:
PIC = requests.get (Result, timeout = 10)
f.write (pic.content)
3. The main function codes:
# Main function
IF __name__ == '__main__':
name = the INPUT ( 'Please enter the keywords you want to download:')
the pn = the INPUT ( 'you want before downloading pages (one has 60):')
get_parse_page ( pn, name)