Reptile crawling GIF birthplace __python

Bloggers month before touching python, python have to say as a glue language, has its own unique advantages. Bloggers white as programming, after a day of hard work, write a makeshift usable python spider after the code to improve the capture requestException error, measured reptiles can run forever, he climbed a pile of dirty things in the dirt. . .
Brothers paste the code you give me a comment ah, a group of deserted, thank you ah
crawling results as shown below:
Here Insert Picture Description
Source code is as follows:

import os
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
import time
from requests.exceptions import RequestException


def Download_gif(url,path):
	headers={
			'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36',
			'Connection':'close'
			}
	html=requests.get(url,headers=headers)
	soup=BeautifulSoup(html.text,'html.parser')
	gif_url=soup.find_all('a',class_='focus')		#找出单页上的所有链接,返回一个list,这个list由一系列字典组成
	for gif_url in gif_url:				#迭代出每个字典
		gif_url=gif_url['href']		#每个字典的key为href时对应的value为链接
		html=requests.get(gif_url,headers=headers)			#对解析出的链接进行请求
		soup=BeautifulSoup(html.text,'html.parser')			#soup库进行解析
		page=soup.find('div',class_='article-paging').find_all('span')		#进入链接发现是分页形式,所有找出链接上的总页数
		page=page[-1].text							#发现div标签,class为article-paging的标签内的最后一个span标签为页数
		each_url=gif_url						#这里一定要将url区分开来,一个用each_url,一个用gif_url,否则会发生未知错误,调试过程会发现
		for i in range(1,int(page)+1):			#构造列表生成式,对应每一页链接进行图片或gif下载
			pic=each_url+str(i)				#每一页链接
			html=requests.get(pic,headers=headers)		#请求每一页链接
			soup=BeautifulSoup(html.text,'html.parser')		#解析每一页链接
			pic_url=soup.find_all('img',class_='aligncenter')		#发现每一页链接上的img标签,class_为aligncenter的为图片或gif_url
			for a_url in pic_url:			#迭代出每个图片或Gif链接(为字典形式)
				os.chdir(path)				#将目录切换至自己电脑上的path目录
				a_url=a_url['src']			#gif链接中的src对应图片链接
				if a_url==None:					#加入判断,如果图片无链接,pass,让爬虫能够运行下去
					pass
				try:
					html=requests.get(a_url,headers=headers)		#请求图片链接,得到图片或Gif的文件流
					requests.adapters.DEFAULT_RETRIES = 5		#加入重复请求次数
					file_name=a_url.split('/')[-1]		#给出文件名
					f=open(file_name,'wb')		#写入文件
					f.write(html.content)
					time.sleep(0.000001)
					f.close()
					time.sleep(0.2)			
				except RequestException:
					return None
				
if __name__=='__main__':
	path='C://Users/panenmin/Desktop/GIF/'		#定义path,这里可以更改为自己电脑上的路径,一定用正斜线
	start_url='https://www.gifjia5.com/category/neihan/page/'	#定义start_url
	pool=Pool(6)			#构建进程池
	for i in range(1,23):		#构造列表生成式
		url=start_url+str(i)		#构造每一页链接
		pool.apply_async(Download_gif,args=(url,path))	#传入函数和函数的参数
	pool.close()		
	pool.join()



Logic of the code is a bit confusing, the main idea is to define a function to crawl a single page. This function has three nested iteration. After the first layer iteration of all the links on a single page, to find out the number of pages on each link, list generation type structure, each page circulate, filter out the real picture Gif links, each for a gif or jpg link, requests the request to get a picture or gif file stream, and write to the file.
For the Main function, circulating ideas for Download_gif function defined for the implementation of the home page that links on each page of the function execution process. While adding multi-process Pool.
Inside the function, use the time module, time.sleep () is to make spider request to the server less frequently, to prevent the closure ip etc. Sao operation.
Although this spider compare rubbish, but I also write code for a long time, but also hope just getting started reptile friends were able to learn a little something from inside, but tomorrow I will be labeled specific development process. (Hard to force the landlord to office workers, but more busy.)
I enclose my favorite word, talk is cheap, show me the code
development process:
1: gif birthplace of the first to enter the home page, see Page Layout
GIF birthplace connotation Home

With chrome browser developers options, found the following:
Developer options interface
you can find a label, class attributes to focus node which contains links to all home
all the links on the home page is the corresponding href

Our Home's first link, as shown below:
Home link to get a first interface

Turned down, still we found the interface tab
Each set of pictures pagination links

Open Developer options, find the maximum number of pages jpg or gif link

img tag, class contains a link to the picture section of aligncenter

Find the maximum number of pages:

div tag, class as part of the article-paging portion pages

After find links to each page request to circulate pictures or gif link to obtain a file stream, write to the file.

Ideas crawling over part of the function, that is, to climb a single page ideas for crawling pages, the idea is very simple, a total of 23, crawling cycle

IF name == ' main ':
path = 'C: // the Users / panenmin / Desktop / GIF /' # define the path, where you can change the path to your own computer, be sure to use a forward slash
START_URL = ' HTTPS: // www.gifjia5.com/category/neihan/page/ '# define START_URL
the pool = pool (. 6) Construction process pool #
for i in range (1,23): # configured list formula
url = start_url + str (i) # construction of each page links
pool.apply_async (Download_gif, args = (url , path)) # passing function parameters and functions
pool.close ()
pool.join ()

And the function parameter crawling pass in a single page, it's done. But bloggers found that the reptiles are still many areas for improvement, such as crawling for some time, you will find the remote host to force reptile dropped (transient closure Ip), consider joining agent pool. Many of the problems encountered in the development process, can be considered while learning, knowledge and action. I hope just getting started reptile friends fight for their write something up, so to understand it more deeply.

Update the code, the blogger appeared on file_name os error at the time of crawling, so the updated source code. `Judgment of file_name

import os
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
import time
from requests.exceptions import RequestException


def Download_gif(url,path):
	headers={
			'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36',
			'Connection':'close'
			}
	html=requests.get(url,headers=headers)
	soup=BeautifulSoup(html.text,'html.parser')
	gif_url=soup.find_all('a',class_='focus')		#找出单页上的所有链接,返回一个list,这个list由一系列字典组成
	for gif_url in gif_url:				#迭代出每个字典
		gif_url=gif_url['href']		#每个字典的key为href时对应的value为链接
		html=requests.get(gif_url,headers=headers)			#对解析出的链接进行请求
		soup=BeautifulSoup(html.text,'html.parser')			#soup库进行解析
		page=soup.find('div',class_='article-paging').find_all('span')		#进入链接发现是分页形式,所有找出链接上的总页数
		page=page[-1].text							#发现div标签,class为article-paging的标签内的最后一个span标签为页数
		each_url=gif_url						#这里一定要将url区分开来,一个用each_url,一个用gif_url,否则会发生未知错误,调试过程会发现
		for i in range(1,int(page)+1):			#构造列表生成式,对应每一页链接进行图片或gif下载
			pic=each_url+str(i)				#每一页链接
			html=requests.get(pic,headers=headers)		#请求每一页链接
			soup=BeautifulSoup(html.text,'html.parser')		#解析每一页链接
			pic_url=soup.find_all('img',class_='aligncenter')		#发现每一页链接上的img标签,class_为aligncenter的为图片或gif_url
			for a_url in pic_url:			#迭代出每个图片或Gif链接(为字典形式)
				os.chdir(path)				#将目录切换至自己电脑上的path目录
				a_url=a_url['src']				#gif链接中的src对应图片链接
				file_name=a_url.split(r'/')[-1]
				if file_name[-4:]!='.gif' and file_name[-4:]!='.jpg' and file_name[-4:]!='jpeg':
					return None
				if a_url==None:					#加入判断,如果图片无链接,pass,让爬虫能够运行下去
					pass
				try:
					html=requests.get(a_url,headers=headers)		#请求图片链接,得到图片或Gif的文件流
					requests.adapters.DEFAULT_RETRIES = 5		#加入重复请求次数
					f=open(file_name,'wb')		
					f.write(html.content)
					time.sleep(0.000001)
					f.close()
					time.sleep(0.2)			
				except RequestException:
					return None
				
if __name__=='__main__':
	path='D://GIF/'		#定义path,这里可以更改为自己电脑上的路径,一定用正斜线
	start_url='https://www.gifjia5.com/category/neihan/page/'	#定义start_url
	pool=Pool(6)			#构建进程池
	for i in range(1,23):		#构造列表生成式
		url=start_url+str(i)		#构造每一页链接
		pool.apply(Download_gif,args=(url,path))			#传入函数和函数的参数
		print('第i页已爬完')
	pool.close()		
	pool.join()

Explain here, be sure to add a comment write code so that people not only easy to read, in the future when the bug appeared the code, they can have a clear idea of ​​the code can be easy to maintain. Of course, except for the god. . .

Published 19 original articles · won praise 1 · views 3139

Guess you like

Origin blog.csdn.net/qq_41603639/article/details/84933660