A preliminary study of the crawler case and the speed comparison of the parsers supported by Beautiful Soup

Today I learned an  example of image capture and preservation in well2049's blog . Click to open the link .

The code in it has been modified and optimized to check the speed of html.parser and lxml parsing (the picture below is from Cui Dashen's crawler tutorial). By the way, set a limit on the number of downloads.


Source code refer to the link above.

The modified code is referenced below.

import requests
from bs4 import BeautifulSoup
from PIL import Image
import them
from io import BytesIO
import time
start = time.clock() # Add the program running timing function by yourself.
url = "http://www.yestone.com/gallery/1501754333627"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}

r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml') # Modify the 'html.parser' in the source code to lxml code, or replace it with xml, or html5lib
# soup = BeautifulSoup(r.content, 'html.parser') source code

items = soup.find_all('img', class_='img-responsive')

folder_path = './photo'

if os.path.exists(folder_path) == False:
   os.makedirs(folder_path)
   
for index,item in enumerate(items):
	if item:
		html= requests.get(item.get('data-src'))
		img_name=folder_path+str(index+1)+'.png'
		image=Image.open(BytesIO(html.content))
		image.save('F:\Python\photo'+img_name)
		print('The %d image download is complete' % (index + 1))
		if index==9: # Add a limit on the number of downloads of MAX 10 to avoid downloading too large.
			break
		time.sleep(1)
end = time.clock() # Add the program running timing function by yourself.
print('Crawling completed','\n','Time-consuming:',end-start) # Add the program running timing function by yourself.

The result of parsing using the  BeautifulSoup(markup, "lxml") method of the lxml HTML parser  is as follows:

The results of parsing using the BeautifulSoup(markup, "html.parser") method of the Python standard library  are as follows:


Differences do exist. (The above time is not fixed, it is related to the speed of the personal computer and the Internet, but the two parsing times are indeed fast and slow)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325658664&siteId=291194637