How to use Python crawler to handle multiple types of sliding verification codes

00954-4113027448-_modelshoot style,a girl on the computer, (extremely detailed CG unity 8k wallpaper), full shot body photo of the most beautiful.png
Background introduction: In the world of web crawlers, sliding verification code is a common anti-crawling mechanism. It blocks access by automated programs by requiring users to verify their identity by sliding a slider on a web page. For developers, how to deal with multiple types of sliding verification codes in Python crawlers has become a huge challenge. This article will share some observations and reflections, as well as some suggestions to help you deal with various types of sliding captchas.
Our goal is to develop a crawler that can automatically handle multiple types of sliding captchas. By observing and analyzing different types of sliding verification codes, we will design corresponding algorithms to simulate the behavior of users sliding sliders, thereby successfully passing verification code verification. We can use several cases to explain in detail how to deal with different verification codes.
Case 1: Use Selenium to simulate user operations. Some websites’ sliding verification codes require users to complete verification by dragging the slider. In this case, we can use Selenium library to simulate user actions. By automating the browser, we can load a web page, drag the slider, and successfully pass the sliding CAPTCHA verification.

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains

# 亿牛云爬虫代理参数设置
proxyHost = "u6205.5.tp.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"

# 创建浏览器实例
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server=http://%s:%s@%s:%s' % (proxyUser, proxyPass, proxyHost, proxyPort))
driver = webdriver.Chrome(options=options)

# 打开目标网站
driver.get("https://example.com")

# 模拟滑动操作
slider = driver.find_element_by_id("slider")
ActionChains(driver).click_and_hold(slider).move_by_offset(200, 0).release().perform()

# 继续后续的爬取操作
# ...

# 关闭浏览器
driver.quit()

Example 2: Sliding verification code recognition The sliding verification codes of some websites cannot be bypassed through Selenium simulation operations because they use more complex algorithms to verify users. In this case, we can use a third-party library to recognize the sliding captcha. Here is a sample code using the Tesseract OCR library:

import requests
from PIL import Image
import pytesseract

# 亿牛云爬虫代理参数设置
proxyHost = "u6205.5.tp.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"

# 设置代理
proxies = {
    "http": "http://%s:%s@%s:%s" % (proxyUser, proxyPass, proxyHost, proxyPort),
    "https": "http://%s:%s@%s:%s" % (proxyUser, proxyPass, proxyHost, proxyPort)
}

# 下载验证码图片
response = requests.get("https://example.com/captcha.jpg", proxies=proxies)
with open("captcha.jpg", "wb") as f:
    f.write(response.content)

# 识别验证码
captcha_image = Image.open("captcha.jpg")
captcha_text = pytesseract.image_to_string(captcha_image)

# 提交验证码并继续后续的爬取操作
data = {
    "captcha": captcha_text,
    # 其他表单数据
}
response = requests.post("https://example.com/submit", data=data, proxies=proxies)

# 处理响应数据
# ...

Example 3: Sliding verification code defense strategy As developers, we can also adopt some strategies to prevent crawlers from bypassing the sliding verification code. For example, you can increase the randomness of the sliding distance, or add the simulation of the mouse trajectory during the sliding process. This can increase the difficulty of crawler identification. In addition, human-machine verification services such as reCAPTCHA can be used to further improve security.
This article shares a practical case of processing sliding verification codes in Python crawlers. By bypassing the verification code and identifying the verification code, we can successfully crawl the required data. At the same time, we also propose some defense strategies to protect the website from malicious crawlers. I hope these cases and suggestions can help developers better cope with the challenge of sliding verification codes, and encourage everyone to maintain an observant, thoughtful and innovative attitude in crawler development.

Guess you like

Origin blog.csdn.net/Z_suger7/article/details/132540931