Python crawler - crawler camouflage and anti-"anti-crawling"

foreword

Reptile camouflage and anti-"anti-crawling" are very important topics in the field of reptiles. Camouflage can make your crawlers look more like ordinary browsers or applications, thereby reducing the risk of being blocked by the server; anti-"anti-crawling" is an anti-crawler mechanism that responds to server strengthening. The following will introduce some common camouflage and anti-crawling techniques in detail, and provide corresponding code cases.

 

1. User-Agent disguise

User-Agent is a part of the HTTP request header, which contains information about applications used by browsers, mobile phones, etc. In the crawler, use the default User-Agent, or use the User-Agent commonly used by crawlers, which is easily recognized as a robot by the server, so we need to disguise the User-Agent. The User-Agent header can be easily added using the requests library in Python.

import requests

# 设置User-Agent头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 请求URL
url = 'https://www.example.com'

# 发送请求
response = requests.get(url, headers=headers)

# 输出响应内容
print(response.text)
2. IP Proxy

Frequent access to the server by a single IP is easy to be banned, so we can use IP proxy to access the website. There are free and paid IP proxies, here we use free IP proxies. Setting up a proxy server is easy using the requests library in Python.

import requests

# 设置代理服务器
proxies = {
    'http': 'http://127.0.0.1:1080',
    'https': 'https://127.0.0.1:1080'
}

# 请求URL
url = 'https://www.example.com'

# 发送请求
response = requests.get(url, proxies=proxies)

# 输出响应内容
print(response.text)
3. Random access interval

Frequent access to the server is easy to be identified as a robot, so we need to simulate the behavior of humans visiting the website and set the access time interval randomly. The access interval can be conveniently set using the time library in Python.

import requests
import time
import random

# 设置User-Agent头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 请求URL
url = 'https://www.example.com'

# 随机访问时间间隔
time.sleep(random.randint(0, 3))

# 发送请求
response = requests.get(url, headers=headers)

# 输出响应内容
print(response.text)
4. Cookie masquerading

Some websites require login to access, and we need to carry cookies when visiting the website to simulate the login status. Cookies are conveniently set using the requests library in Python.

import requests

# 设置User-Agent头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 设置cookie
cookies = {
    'sessionid': 'xxxx'
}

# 请求URL
url = 'https://www.example.com'

# 发送请求
response = requests.get(url, headers=headers, cookies=cookies)

# 输出响应内容
print(response.text)
5. Use the captcha recognition library

Some websites require verification code identification, and we can use technologies such as OCR for identification. Here we use the Tesseract-OCR library in Python to identify the verification code.

import requests
import pytesseract
from PIL import Image

# 设置User-Agent头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 请求验证码图片
url = 'https://www.example.com/captcha.png'
response = requests.get(url, headers=headers)

# 保存验证码图片
with open('captcha.png', 'wb') as f:
    f.write(response.content)

# 对验证码图片进行识别
captcha_image = Image.open('captcha.png')
captcha_text = pytesseract.image_to_string(captcha_image)

# 输出验证码文本
print(captcha_text)
6. Dynamic parsing page

Some websites use JS to load data asynchronously on the front end. In this case, tools such as Selenium are required to dynamically parse the page. Here we use the Selenium library in Python to simulate a browser visiting a website.

from selenium import webdriver

# 设置User-Agent头
options = webdriver.ChromeOptions()
options.add_argument(
    'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3')

# 请求URL
url = 'https://www.example.com'

# 使用Selenium打开网页
driver = webdriver.Chrome(options=options)
driver.get(url)

# 执行JS代码
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# 获取响应内容
response = driver.page_source

# 输出响应内容
print(response)

# 关闭浏览器
driver.quit()
7. Multiple accounts can be used in turn

If an account is blocked for frequent access, we can use multiple accounts to access the website in turn. Here we use the random library in Python to randomly select accounts.

import requests
import random

# 用户列表
users = [
    {'username': 'user1', 'password': 'password1'},
    {'username': 'user2', 'password': 'password2'},
    {'username': 'user3', 'password': 'password3'}
]

# 随机选择一个账号
user = random.choice(users)

# 构造登录信息
data = {
    'username': user['username'],
    'password': user['password']
}

# 请求登录URL
login_url = 'https://www.example.com/login'
response = requests.post(login_url, data=data)

# 输出响应内容
print(response.text)

Summarize

In general, the purpose of camouflage is to make reptiles look more like human behavior, and the purpose of anti-"anti-crawling" is to deal with complex anti-reptile mechanisms. In actual reptile projects, it is necessary to choose appropriate camouflage and anti-"anti-crawling" techniques according to the specific situation.

Guess you like

Origin blog.csdn.net/wq10_12/article/details/132229478