Python basic crawler is mainly aimed at some websites with relatively simple anti-crawling mechanisms. It is a proficient process of understanding the entire process of crawling and crawling strategies.
The crawler is divided into four steps: request, parse data, extract data, and store data . This article will also introduce cases of basic crawlers from these four perspectives.
1. Crawling simple static web pages
What we want to crawl is all wallpapers from a wallpaper website
http://www.netbian.com/dongman/
1.1 Select crawler strategy - thumbnail
First, open the developer mode, observe the web page structure, and find the image tag corresponding to each image. You can find that we only need to get the yellow img tag and send a request to it to get a preview of the wallpaper.
Then I noticed that the website had more than one page. I opened the first 3 pages of the website and observed if there were any patterns in the URLs.
http://www.netbian.com/dongman/index.htm#第一页
http://www.netbian.com/dongman/index\_2.htm#第二页
http://www.netbian.com/dongman/index\_3.htm#第三页
We found that except for the first page, the URLs of other pages have fixed rules, so we first build a list containing the URLs of all pages.
url\_start \= 'http://www.netbian.com/dongman/'
url\_list\=\['http://www.netbian.com/dongman/index.htm'\]
if not os.path.exists('./exercise'):
os.mkdir('./exercise')
for i in range(2,133):
url \= url\_start+'index\_'+str(i)+'.htm'
url\_list.append(url)
At this point our basic crawler strategy has been determined.
Web page request
for url in url\_list:
headers \= {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
response \= requests.get(url\=url,headers\=headers).text
Analytical data
Here we choose to use etree to parse the data
tree \= etree.HTML(response)
Extract data
Here we choose to use xpath to extract data
leaf \= tree.xpath('//div\[@class="list"\]//ul/li/a/img/@src')
for l in leaf:
print(l)
h \= requests.get(url\=l, headers\=headers).content
Storing data
i \= 'exercise/' + l.split('/')\[-1\]
with open(i, 'wb') as fp:
fp.write(h)
Complete code
import requests
from lxml import etree
import os
url_start = 'http://www.netbian.com/dongman/'
url_list=['http://www.netbian.com/dongman/index.htm']
#http://www.netbian.com/dongman/index_2.htm
if not os.path.exists('./exercise'):
os.mkdir('./exercise')
for i in range(2,133):
url = url_start+'index_'+str(i)+'.htm'
url_list.append(url)
print(url_list)
for url in url_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
response = requests.get(url=url,headers=headers).text
tree = etree.HTML(response)
leaf = tree.xpath('//div[@class="list"]//ul/li/a/img/@src')
for l in leaf:
print(l)
h = requests.get(url=l, headers=headers).content
i = 'exercise/' + l.split('/')[-1]
with open(i, 'wb') as fp:
fp.write(h)
1.2 Select a crawler strategy - high-definition large picture
In the crawler we just crawled, we only crawled the thumbnails of wallpapers. If we want to crawl to the high-definition version, we need to change our strategy. Reopen the developer tools and observe, and find that there is an href tag above the originally crawled img tag. After opening it, the high-definition large image will jump.
Then our crawling strategy at this time becomes to extract the content of this href tag, send a request to the website in this tag, and then find the img tag in the website to make another request.
We use regular expressions to extract the content of href tags. Regular expressions are a simpler data extraction method than xpath syntax. For specific syntax, please see the following documents
for url in url_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
response = requests.get(url=url,headers=headers).text
leaf = re.findall("desk/\d*.htm",response,re.S)
for l in leaf:
url = "http://www.netbian.com/"+str(l)
h = requests.get(url=url, headers=headers).text
leaf_ =re.findall('<div class="pic">.*?(http://img.netbian.com/file/\d*/\d*/\w*.jpg)',h,re.S)
The leaf_ output in this way is the img tag of the high-definition image we are looking for. At this time, we only need to send the request again and then save the data.
Storing data
for l_ in leaf_:
print(l_)
h = requests.get(url=l_, headers=headers).content
i = 'exercise/' + l_.split('/')[-1]
with open(i, 'wb') as fp:
fp.write(h)
Complete code
import requests
import os
import re
url_start = 'http://www.netbian.com/dongman/'
url_list=['http://www.netbian.com/dongman/index.htm']
if not os.path.exists('./exercise'):
os.mkdir('./exercise')
for i in range(2,133):
url = url_start+'index_'+str(i)+'.htm'
url_list.append(url)
print(url_list)
for url in url_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
response = requests.get(url=url,headers=headers).text
leaf = re.findall("desk/\d*.htm",response,re.S)
for l in leaf:
url = "http://www.netbian.com/"+str(l)
h = requests.get(url=url, headers=headers).text
leaf_ =re.findall('<div class="pic">.*?(http://img.netbian.com/file/\d*/\d*/\w*.jpg)',h,re.S)
for l_ in leaf_:
print(l_)
h = requests.get(url=l_, headers=headers).content
i = 'exercise/' + l_.split('/')[-1]
with open(i, 'wb') as fp:
fp.write(h)
2. Crawling of dynamically loaded websites
What we want to crawl is all wallpapers from another wallpaper website
https://sucai.gaoding.com/topic/9080?
2.1 Select crawler strategy—selenium
First, open the developer mode and observe the structure of the web page. At this time, we will find that not all wallpapers on the page have been loaded. That is to say, as we pull down the scroll bar, the content will continue to be loaded in real time. Check the web page elements. You can also see the lazy-image tag that represents dynamic loading .
Because it is dynamic loading, we cannot use the previous method of directly sending requests to crawl data. Faced with this situation, we need to simulate the browser sending a request and pull down the page to achieve the purpose of crawling a real-time loading web page. .
After observing the structure of the web page, let’s observe the number of pages again. I won’t go into details this time. I think everyone can also find the pattern.
url_list=[]
for i in range(1,4):
url = 'https://sucai.gaoding.com/topic/9080?p={}'.format(i)
url_list.append(url)
Web page request
Here we use selenium, the automated testing framework
for url in url_list:
driver = webdriver.Chrome()
driver.get(url)
driver.maximize_window()
time.sleep(2)
i=0
while i<10:#下拉滚动条加载页面
i+=1
driver.execute_script("window.scrollBy(0,500)")
driver.implicitly_wait(5)#显式等待
Parse and extract data
items = driver.find_elements_by_xpath("//*[@class='gdd-lazy-image__img gdd-lazy-image__img--loaded']")
for item in items:
href = item.get_attribute('src')
print(href)
As for the storage of data, we only need to request the website with the href tag we crawled down.
Complete code
from selenium import webdriver
import time
import os
if not os.path.exists('./exercise'):
os.mkdir('./exercise')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.75 Safari/537.36'
}
url_list=[]
url_f_list=[]
for i in range(1,4):
url = 'https://sucai.gaoding.com/topic/9080?p={}'.format(i)
url_list.append(url)
for url in url_list:
driver = webdriver.Chrome()
driver.get(url)
driver.maximize_window()
time.sleep(2)
i=0
while i<10:
i+=1
driver.execute_script("window.scrollBy(0,500)")
driver.implicitly_wait(5)#显式等待
items = driver.find_elements_by_xpath("//*[@class='gdd-lazy-image__img gdd-lazy-image__img--loaded']")
for item in items:
href = item.get_attribute('src')
print(href)
[For those who want to learn crawlers, I have compiled a lot of Python learning materials and uploaded them to the CSDN official. Friends in need can scan the QR code below to obtain them]
1. Study Outline
2. Development tools
3. Python basic materials
4. Practical data