Inventory of a Python web crawler over verification code (method 1)

Click on " Python Crawler and Data Mining " above to pay attention

Reply to " Books " to get a total of 10 e-books on Python from entry to advanced

now

Day

chicken

Soup

The formalities of the low-brow letter continued, telling the infinite things in my heart.

Hello everyone, I am Pippi.

I. Introduction

A few days ago in the Python strongest king group [鶏 ah 鶏. ] I asked a Pythonquestion about a web crawler, and I will share it with you here.

1123ab7e35d78e7d4e518c164b6d18ef.png

Here is his code:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from PIL import Image
import ddddocr

ocr = ddddocr.DdddOcr()

options = webdriver.ChromeOptions()
options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)

# 打开目标网页
driver.get('https://sol.sinosure.com.cn')
time.sleep(5)
driver.maximize_window()
# 定位验证码图片元素并模拟鼠标悬停以加载图片
yanzhengma = driver.find_element(By.CSS_SELECTOR, '.pass-form-item.pass-form-item-code')
captcha_element = yanzhengma.find_element(By.CSS_SELECTOR,  '.pass-label-img')
webdriver.ActionChains(driver).move_to_element(captcha_element).perform()
time.sleep(5)

# 获取验证码图片元素的位置和大小
location = captcha_element.location
size = captcha_element.size
print(location)
print(size)
# 截取整个网页的截图
driver.save_screenshot('screenshot.png')

# 根据验证码图片元素的位置和大小,从整个网页截图中裁剪出验证码图片
left = int(location['x'])
top = int(location['y'])
right = int(location['x'] + size['width'])
bottom = int(location['y'] + size['height'])
captcha_screenshot = Image.open('screenshot.png').crop((left, top, right, bottom))
print(left)
print(top)
print(location)
print(bottom)
# 保存裁剪后的验证码图片,并进行识别
captcha_screenshot.save('captcha.png')
with open('captcha.png', 'rb') as f:
    img_bytes = f.read()
res = ocr.classification(img_bytes)
print('识别的验证码是:' + res)

The basic idea is that there is nothing wrong with it. You can indeed get screenshots of the corresponding interface, but there is a slight deviation in the location of the verification code, which leads to the verification code not being correctly recognized.

ad61e4c193257d07ca8f04df50c8be90.png

The following code is to get the position and size of the captcha image element:

location = captcha_element.location
size = captcha_element.size

In this part, I think the introduction should return the position of the positioned element. I just roughly dragged the position before cropping and printed it out, and it has indeed gone to the position near the input password, but the element I positioned is the place where the verification code is, and I also tried to locate and verify the large element at that position first, and then locate the specific verification code picture position, and the problem still persists.

The above are the doubts of fans, let's take a look at the solutions.

2. Implementation process

Here [Brother Wei] tried the following code, but the following error occurred:

3869a636a741b81b011053f5f551be7c.png

This error report is quite common. For those who often use sel, this error report is commonplace. The reason for the error report is that the local browser driver does not match the version of Google Chrome, and the local browser driver needs to be replaced.

The solution to this problem is to go to the web page to download the corresponding driver of the corresponding browser version, put it in a specified local folder, and ensure that the folder path is added to the environment variable. The solution to this problem is also mentioned in the official account history article, and there are a lot of solution tutorials on the Internet, so I won’t repeat them here.

87e2419f284b52e4f8078d7e81da78ea.png

Closer to home, continue to return to the solution to this problem. Here [Classmate Ning] gave an idea, directly find the url of the picture of the verification code, use requests to request the .content of the verification code, and use ocr.classification (.content of the verification code). You don’t need to save the picture and read the binary stream in open. The code looks like this:

7b2241a315fcceb7460095efdd8149e1.png

It successfully solved the problem of fans. If you are not familiar with requests and Beautiful, it may be more difficult to accept.

Here is just one of the methods, another method, let's read the next article together, so stay tuned!

3. Summary

Hello everyone, I am Pippi. This article mainly takes stock of Pythonthe problem of a web crawler passing verification codes. Aiming at this problem, the article gives specific analysis and code implementation to help fans solve the problem smoothly.

Finally, I would like to thank fans [鶏啊鶏] for asking questions, thank [Classmate Ning] and [Brother Wei] for their ideas and code analysis, and thank [Ineverleft] and others for participating in the learning exchange.

[Supplementary questions] Warm reminder, when you ask questions in the group. You can pay attention to the following points: if it involves large file data, you can desensitize the data, send some demo data (meaning small files), and then paste some code (the kind that can be copied), and remember to send the screenshot of the error report (complete cut ). If there are not many codes, just send the code text directly. If the code exceeds 50 lines, just send a .py file.

26ea3e0ded5bd234d361919099d97dda.png

If you have any problems during the learning process, please feel free to contact me to solve them (my WeChat: pdcfighting1). At the request of fans, I have created some high-quality Python paid learning exchange groups and paid order receiving groups. Welcome everyone to join me Python learning exchange group and order receiving group!

7a65f7efd22b3e51d2ef16b8fef4fc6e.png

Friends, hurry up and practice it! If you encounter any problems during the learning process, please add me as a friend, and I will pull you into the Python learning exchange group to discuss learning together.

c8fc4472448ceecad5ce164f582daa6f.jpeg

------------------- End -------------------

Recommendations for past wonderful articles:

b3b302cf94b499ad4130de9d9a649c3e.png

Welcome everyone to like, leave a message, forward, repost, thank you for your company and support

If you want to join the Python learning group, please reply in the background [ join the group ]

Thousands of rivers and thousands of mountains are always in love, can you click [ Looking ]

/Today's Message Topic/

Just say a few words~~

Guess you like

Origin blog.csdn.net/pdcfighting/article/details/131335787