It is said that the verification code is a hurdle in crawlers, and I can break through it with only five lines of code.

foreword

I believe that many novices who are new to reptiles will encounter verification codes during the learning process. In fact, this can be regarded as anti-climbing. Because you run code crawling, it will cause a series of burdens on the website to a certain extent. So this case is only used for learning and communication.

A long time ago, I shared a method of Python code to realize verification code recognition.

At that time, pillow+pytesseract was used, which has the advantage of being free and easy to use. However, its recognition accuracy is average. If you want higher requirements for verification code recognition, beginners can only choose to use the Baidu API interface.

But in fact, both Baidu API interface and pytesseract need to be configured in advance, which is not very friendly for beginners.

Moreover, Baidu API must be connected to the Internet. For some friends who cannot connect to the Internet, they have to pass

Recently, a group member in the group shared a new library, tried it out and found it very useful, and specially shared it with you today.

Github address: Follow the public account: Python Gu Muzi to get it.

The name of the library is also very interesting - ddddocr (homophonic tape with younger brother OCR)

Environmental requirements:

python >= 3.8
Windows/Linux/Macox..

It can be installed by the following command

pip install ddddocr

Parameter Description:

I randomly found a verification code picture on the Internet, and used this library to practice it.

Source: Baidu search

import ddddocr
ocr = ddddocr.DdddOcr()
with open('1.png', 'rb') as f:
    img_bytes = f.read()
res = ocr.classification(img_bytes)
print(res)

Successfully recognized the verification code text!

And the advantages are also very obvious: First of all, the code is very streamlined. Compared with the two methods mentioned above, there is no need to set additional environment variables, etc., and the verification code image can be easily identified with 5 lines of code. On the other hand, can we also test it out by using the magic command %%time? The recognition speed of this code is very fast.

Let's continue testing with more verification code pictures:

I found another 6 verification code pictures to test, observed the results, and found that this type of simple verification code can basically be quickly identified. But there are also some problems with the results - the letter case is not distinguished (such as the 6th picture).

All in all, if you need verification code recognition, and the accuracy requirements are not too high.

There must be many people who say that you are using other people's libraries directly. What do you do without the library? In fact, Python is a language that calls major modules and frameworks. It's nice to be able to achieve the desired effect without reinventing the wheel.

Remember to pay attention to the official account: Python Gu Muzi, get the complete project code

Guess you like

Origin blog.csdn.net/TZ45678/article/details/124781081