Python3 Web crawler Sessions - beautiful wallpaper crawling

On a blog, we have learned how to use Python3 reptiles crawl text, then asked in this, to teach you how to use the reptile Python3 batch fetching picture by example.

(1) practical background

URL:https://unsplash.com/

The name of the site on the map called Unsplash, free HD wallpaper adhere to a shared network is to share a day of high-definition photographic images of the site, update a high-quality picture material every day, full of life scenes work, fresh breath of life can picture as the desktop wallpaper can be applied to various needs of the environment.

To see such a beautiful picture, is not it want to download wallpaper to do it. I am very fond of each picture, bulk download it, not much climbing, download 50 better.

2) Advanced combat

<a> label contains the hyperlink, the picture stored in the <img> tag! In this case, we intercept it Unsplash website a <img> tag, analyze:

<img alt="Snow-capped mountain slopes under blue sky" src="https://images.unsplash.com/photo-1428509774491-cfac96e12253?dpr=1&

You can see, <img> tag has many attributes, there are alt, src, class, style attributes, src attribute that is stored in the picture we need to save the address, we will be able to download images of this address.

So crawling process:

  • Use requeusts obtain information about the entire HTML pages;
  • Using xpath parsing HTML messages, find all the <img> tag, the src attribute extraction to obtain images stored address;
  • Storage address according to the picture, download pictures.

According to this line of thought crawling Unsplash try to write code as follows:

import requests

if __name__ == '__main__':
     url = 'https://unsplash.com/'
     req = requests.get(url)
     print(req.text)

According to our vision, we should be able to find a lot of <img>labels. However, we found that, except for some <script>than the label, and some do not understand the code, we are nothing, a <img>label did not!

Because all the pictures of this site are dynamically loaded! Site points static website and dynamic website, real crawling on a website is a static site, and this site is a dynamic site, part of the dynamic loading purpose is to counter-crawlers.

You can use the browser that comes Networks, it will naturally help us analyze the content JavaScript script execution.

然而事实并没有这么简单,仔细看看,我们会发现,网页源码中只有为数不多的几个img标签,也就是说,我们只能获取到几张图片的路径,我们要的可是大量的图片,接下来将页面下滑,会发现img标签多了起来,很显然这是一个Ajax[1]动态加载的网站

3)完整代码如下:

'''
精美图片下载
'''
import requests
import json
import time
import random
from contextlib import closing

class Imgdownloader(object):

	#初始化
	def __init__(self):
		self.headers= {
		'Referer': 'https://unsplash.com/',
		'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) \
		AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36',
		}
		self.base_url = 'https://unsplash.com/napi/photos'
		self.img_urls = []
		self.page = 0
		self.per_page =12


	def get_img_urls(self):
		'''获取图片链接地址
		'''
		
		for i in range(self.per_page):
			self.page = self.page+i+1
			params={
			"page":self.page,
			"per_page":self.per_page,
			}
			res =requests.get(self.base_url,params = params)
			#json转换为字典
			res_dict = json.loads(res.text)

			for item in res_dict:

				url = item['urls']['regular']

				self.img_urls.append(url)

			#每次获取一页链接,随机休眠
			time.sleep(random.random())
		print("图片urls获取成功")

	def img_download(self,urls):
		'''
		下载图片到本地存储
		'''
		num = 0
		for url in urls:

			num = num + 1
			res = requests.get(url,headers =self.headers)
			img_path = str(num) +".jpg"
			with closing(requests.get(url,headers =self.headers,stream = True)) as res:
				with open(img_path,'wb') as f:
					print("正在下载第{}张照片".format(num))
					for chunk in res.iter_content(chunk_size = 1024):
						f.write(chunk)	
				print("{}.jpg 下载成功".format(num))


if __name__ == '__main__':
	#图片下载对象
	imgdl = Imgdownloader()
	print("获取图片链接中...")
	#对象获取图片链接
	
	imgdl.get_img_urls()


	print("开始下载图片:")
	#下载图片
	imgdl.img_download(imgdl.img_urls)

	print('所有图片下载完成')

下载速度还行,有的图片下载慢是因为图片太大

20170930130019403

Guess you like

Origin blog.csdn.net/qq_42415326/article/details/94478308