In this blog, we will learn how to deal with the anti-crawler strategy of the website so as to smoothly scrape the required data. We will discuss the following:
Table of contents
1. Anti-crawler strategy and its reasons
2. Set the request header (User-Agent)
1. Anti-crawler strategy and its reasons
Anti-crawler strategy is a technical means adopted by the website to prevent web crawlers from obtaining its content. The main reasons for these strategies include:
- Protect data privacy and intellectual property
- Prevent excessive load on the server
- Protect against malicious behavior (e.g. data theft, data grabbing by competitors)
To deal with these strategies, we need to take corresponding measures when writing web crawlers.
2. Set the request header (User-Agent)
User-Agent is an HTTP request header sent to the server to tell the server the type of client (such as a browser). By setting different User-Agents, we can simulate different browsers and devices. This reduces the risk of crawlers being identified.
requests
Here is an example of setting request headers using the Python library:
import requests
url = 'https://www.example.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
html_content = response.text
else:
print(f'Error {response.status_code}: Could not fetch the webpage.')
3. Use proxy IP
Proxy IP refers to accessing the target website through a third-party server (proxy server) to hide the real IP address. In this way, our crawler is not easily blocked by the target website.
Here's an requests
example using the Python library and a free proxy IP:
import requests
url = 'https://www.example.com'
proxy = {
'http': 'http://123.45.67.89:8080',
'https': 'http://123.45.67.89:8080'
}
response = requests.get(url, proxies=proxy)
if response.status_code == 200:
html_content = response.text
else:
print(f'Error {response.status_code}: Could not fetch the webpage.')
4. Request frequency limit
In order to comply with the website's crawler policy and avoid excessive load on the server, we need to limit the frequency of crawler requests. This can be achieved by adding a delay between requests. Here is an time
example using the Python module random delay:
import requests
import time
import random
url = 'https://www.example.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url, headers=headers)
time.sleep(random.uniform(1, 3))
if response.status_code == 200:
html_content = response.text
else:
print(f'Error {response.status_code}: Could not fetch the webpage.')
5. Captcha processing
In order to prevent crawlers, some websites will add verification codes on the page. For this case, we need to use OCR technology (optical character recognition) or third-party services (such as coding platform) to identify and process the verification code.
Here is an pytesseract
example that recognizes a simple captcha using the Python library:
import requests
from PIL import Image
import pytesseract
url = 'https://www.example.com/captcha'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
with open('captcha.png', 'wb') as f:
f.write(response.content)
image = Image.open('captcha.png')
captcha_text = pytesseract.image_to_string(image)
print(f'Captcha text: {captcha_text}')
else:
print(f'Error {response.status_code}: Could not fetch the captcha.')
6. Use distributed crawlers
Distributed crawling refers to the distribution of crawling tasks on multiple computers, which can reduce the risk of a single IP address being blocked and increase the crawling speed. There are many technologies for implementing distributed crawlers, such as using message queues (such as RabbitMQ), distributed databases (such as Redis), and so on.
The following is a simple multiprocessing
example of a concurrent crawler implemented using the Python library:
import requests
from multiprocessing import Pool
def fetch_url(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get(url, headers=headers)
return response.text if response.status_code == 200 else None
urls = ['https://www.example.com/page1', 'https://www.example.com/page2', 'https://www.example.com/page3']
with Pool(processes=3) as pool:
html_contents = pool.map(fetch_url, urls)
print(html_contents)
7. Dynamic web crawling
For dynamically loaded web content, we need to use browser automation libraries such as Selenium
or to simulate browser behavior.Puppeteer
Here's an selenium
example of using a Python library to fetch dynamically loaded content:
from selenium import webdriver
url = 'https://www.example.com'
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
driver.get(url)
dynamic_content = driver.find_element_by_id('dynamic-content').text
print(dynamic_content)
driver.quit()
8. Ethical issues of reptiles
When performing web crawling, we need to abide by the crawling policy of the website and follow the following principles:
- Do not crawl prohibited pages and content
- Limit the crawling speed to avoid excessive burden on the server
- Respect Data Privacy and Intellectual Property Rights
Summarize
In this article, we discussed how to deal with website anti-crawling strategies, including setting request headers, using proxy IPs, request frequency limits, captcha processing, using distributed crawlers, and dynamic web crawling. At the same time, we also need to pay attention to the ethical issues of crawlers and abide by the crawler policy of the website.
I hope this article will help you write web crawlers in practical applications!