Introduction to crawler based on python

Python basic crawler is mainly aimed at some websites with relatively simple anti-crawling mechanisms. It is a proficient process of understanding the entire process of crawling and crawling strategies.

The crawler is divided into four steps: request, parse data, extract data, and store data . This article will also introduce cases of basic crawlers from these four perspectives.

1. Crawling simple static web pages

What we want to crawl is all wallpapers from a wallpaper website

http://www.netbian.com/dongman/

1.1 Select crawler strategy - thumbnail

First, open the developer mode, observe the web page structure, and find the image tag corresponding to each image. You can find that we only need to get the yellow img tag and send a request to it to get a preview of the wallpaper.

Then I noticed that the website had more than one page. I opened the first 3 pages of the website and observed if there were any patterns in the URLs.

http://www.netbian.com/dongman/index.htm#第一页
http://www.netbian.com/dongman/index\_2.htm#第二页
http://www.netbian.com/dongman/index\_3.htm#第三页

We found that except for the first page, the URLs of other pages have fixed rules, so we first build a list containing the URLs of all pages.

  url\_start \= 'http://www.netbian.com/dongman/'
    
  url\_list\=\['http://www.netbian.com/dongman/index.htm'\]
    
  if not os.path.exists('./exercise'):
    
  os.mkdir('./exercise')
    
 for i in range(2,133):
    
  url \= url\_start+'index\_'+str(i)+'.htm'
    
  url\_list.append(url)

At this point our basic crawler strategy has been determined.

Web page request

for url in url\_list:
    
 headers \= {
    
    
    
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    
}
    
 response \= requests.get(url\=url,headers\=headers).text
    

Analytical data

Here we choose to use etree to parse the data

  tree \= etree.HTML(response)

Extract data

Here we choose to use xpath to extract data

 leaf \= tree.xpath('//div\[@class="list"\]//ul/li/a/img/@src')
    
 for l in leaf:
    
 print(l)
    
 h \= requests.get(url\=l, headers\=headers).content
    

Storing data

  i \= 'exercise/' + l.split('/')\[-1\]
    
 with open(i, 'wb') as fp:
    
  fp.write(h)
    

Complete code

import requests

from lxml import etree

import os

url_start = 'http://www.netbian.com/dongman/'

url_list=['http://www.netbian.com/dongman/index.htm']

#http://www.netbian.com/dongman/index_2.htm

if not os.path.exists('./exercise'):

os.mkdir('./exercise')

for i in range(2,133):

url = url_start+'index_'+str(i)+'.htm'

url_list.append(url)

print(url_list)

for url in url_list:

headers = {
    
    

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'

}

response = requests.get(url=url,headers=headers).text

tree = etree.HTML(response)

leaf = tree.xpath('//div[@class="list"]//ul/li/a/img/@src')

for l in leaf:

print(l)

h = requests.get(url=l, headers=headers).content

i = 'exercise/' + l.split('/')[-1]

with open(i, 'wb') as fp:

fp.write(h)

1.2 Select a crawler strategy - high-definition large picture

In the crawler we just crawled, we only crawled the thumbnails of wallpapers. If we want to crawl to the high-definition version, we need to change our strategy. Reopen the developer tools and observe, and find that there is an href tag above the originally crawled img tag. After opening it, the high-definition large image will jump.

Then our crawling strategy at this time becomes to extract the content of this href tag, send a request to the website in this tag, and then find the img tag in the website to make another request.

We use regular expressions to extract the content of href tags. Regular expressions are a simpler data extraction method than xpath syntax. For specific syntax, please see the following documents

for url in url_list:

headers = {
    
    

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'

}

response = requests.get(url=url,headers=headers).text

leaf = re.findall("desk/\d*.htm",response,re.S)

for l in leaf:

url = "http://www.netbian.com/"+str(l)

h = requests.get(url=url, headers=headers).text

leaf_ =re.findall('<div class="pic">.*?(http://img.netbian.com/file/\d*/\d*/\w*.jpg)',h,re.S)

The leaf_ output in this way is the img tag of the high-definition image we are looking for. At this time, we only need to send the request again and then save the data.

Storing data

for l_ in leaf_:

print(l_)

h = requests.get(url=l_, headers=headers).content

i = 'exercise/' + l_.split('/')[-1]

with open(i, 'wb') as fp:

fp.write(h)

Complete code

import requests

import os

import re

url_start = 'http://www.netbian.com/dongman/'

url_list=['http://www.netbian.com/dongman/index.htm']

if not os.path.exists('./exercise'):

os.mkdir('./exercise')

for i in range(2,133):

url = url_start+'index_'+str(i)+'.htm'

url_list.append(url)

print(url_list)

for url in url_list:

headers = {
    
    

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'

}

response = requests.get(url=url,headers=headers).text

leaf = re.findall("desk/\d*.htm",response,re.S)

for l in leaf:

url = "http://www.netbian.com/"+str(l)

h = requests.get(url=url, headers=headers).text

leaf_ =re.findall('<div class="pic">.*?(http://img.netbian.com/file/\d*/\d*/\w*.jpg)',h,re.S)

for l_ in leaf_:

print(l_)

h = requests.get(url=l_, headers=headers).content

i = 'exercise/' + l_.split('/')[-1]

with open(i, 'wb') as fp:

fp.write(h)

2. Crawling of dynamically loaded websites

What we want to crawl is all wallpapers from another wallpaper website

https://sucai.gaoding.com/topic/9080?

2.1 Select crawler strategy—selenium

First, open the developer mode and observe the structure of the web page. At this time, we will find that not all wallpapers on the page have been loaded. That is to say, as we pull down the scroll bar, the content will continue to be loaded in real time. Check the web page elements. You can also see the lazy-image tag that represents dynamic loading .

Because it is dynamic loading, we cannot use the previous method of directly sending requests to crawl data. Faced with this situation, we need to simulate the browser sending a request and pull down the page to achieve the purpose of crawling a real-time loading web page. .

After observing the structure of the web page, let’s observe the number of pages again. I won’t go into details this time. I think everyone can also find the pattern.

url_list=[]

for i in range(1,4):

url = 'https://sucai.gaoding.com/topic/9080?p={}'.format(i)

url_list.append(url)

Web page request

Here we use selenium, the automated testing framework

for url in url_list:

driver = webdriver.Chrome()

driver.get(url)

driver.maximize_window()

time.sleep(2)

i=0

while i<10:#下拉滚动条加载页面

i+=1

driver.execute_script("window.scrollBy(0,500)")

driver.implicitly_wait(5)#显式等待

Parse and extract data

items = driver.find_elements_by_xpath("//*[@class='gdd-lazy-image__img gdd-lazy-image__img--loaded']")

for item in items:

href = item.get_attribute('src')

print(href)

As for the storage of data, we only need to request the website with the href tag we crawled down.

Complete code

from selenium import webdriver

import time

import os

if not os.path.exists('./exercise'):

os.mkdir('./exercise')

headers = {
    
    

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.75 Safari/537.36'

}

url_list=[]

url_f_list=[]

for i in range(1,4):

url = 'https://sucai.gaoding.com/topic/9080?p={}'.format(i)

url_list.append(url)

for url in url_list:

driver = webdriver.Chrome()

driver.get(url)

driver.maximize_window()

time.sleep(2)

i=0

while i<10:

i+=1

driver.execute_script("window.scrollBy(0,500)")

driver.implicitly_wait(5)#显式等待

items = driver.find_elements_by_xpath("//*[@class='gdd-lazy-image__img gdd-lazy-image__img--loaded']")

for item in items:

href = item.get_attribute('src')

print(href)

[For those who want to learn crawlers, I have compiled a lot of Python learning materials and uploaded them to the CSDN official. Friends in need can scan the QR code below to obtain them]

1. Study Outline

Insert image description here

2. Development tools

Insert image description here

3. Python basic materials

Insert image description here

4. Practical data

Insert image description here

Guess you like

Origin blog.csdn.net/Z987421/article/details/133354722