Python3 crawler combat -41, recognition CAPTCHA

In this section we first attempt to identify the simplest kind of code, graphics code, the oldest, is now very common these codes appear, usually consisting of four letters or numbers, such as the registration page on the Chinese HowNet a similar verification code, the link is:http://my.cnki.net/elibregister/commonRegister.aspx , the page shown in Figure 8-1:

Python3 crawler combat -41, recognition CAPTCHA

Figure 8-1 HowNet registration page

The last one is a form of CAPTCHA, we must fully enter the correct characters in the picture before they can complete the registration.

1. The objective of this section

In this section we have to HowNet verification codes, for example, explain the method using the OCR technology to identify such graphical verification code.

2. Preparation

Identification pattern codes required libraries have Tesserocr, reference may be installed without the installation instructions chapter.

3. Get verification code

In order to facilitate the experiment, we first verify that the picture is saved to the local code, for testing.

Open the Developer Tools, find the code element, you can see that this is a picture, it's src attribute is CheckCode.aspx, where we open this link directly: you can see a verification code that is directly right-preserved may the name of the named code.jpg, do not understand the learning process can join our learning exchanges among Qiuqiu ring 784 758 214 back to share the moment Python enterprise talent needs and how you learn Python from a zero base, and learning What content. Related video learning materials, development tools are shown in Figure 8-2 Share:

![(https://upload-images.jianshu.io/upload_images/17885815-4e576633baeea957.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

FIG codes 8-2

So that we can get a verification code identifies the following picture for testing use.

4. Identify Test

Next we create a new project, the CAPTCHA image into the project root directory, Tesserocr library to identify what the code to try, as follows:


import tesserocr

from PIL import Image

image  =  Image.open('code.jpg')

result  =  tesserocr.image_to_text(image)

print(result)

Here we first create an Image object and then call the Tesserocr of image_to_text () method, passing in the Image object to complete the identification process is very simple realization, recognition results are as follows:

JR42

In addition Tesserocr there is a more simple way directly to the image file into a string you can achieve the same effect, as follows:

import tesserocr

print(tesserocr.file_to_text('image.png'))

However, this method has been tested good as the effect of identifying a way.

The treatment codes

As the basic picture identification is not difficult, just create a new Image object and then call image_to_text () method to obtain recognition of the results of the picture.

Next, we try for a verification code, named code2.jpg, shown in Figure 8-3:

Python3 crawler combat -41, recognition CAPTCHA

FIG codes 8-3

Re-test with the following code:


import tesserocr

from PIL import Image

image  =  Image.open('code2.jpg')

result  =  tesserocr.image_to_text(image)

print(result)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Then you can see the following output:


FFKT

The discovery and identification of actual results deviate, this is because the validation extra lines in the interference image recognition.

In this case, what we need to do additional processing, such as turn gray, binarization operation.

We can use the Image object convert () method parameters passed to the image L is converted to a grayscale image, as follows:


image  =  image.convert('L')

image.show()

1 can be passed in the picture binary processing:

image  =  image.convert('1')

image.show()

In addition, we can also specify the binarization threshold value, the above method uses a default threshold 127, but we can not be directly transformed with the picture, you can be first converted to a grayscale image, and then specify the binarization threshold value conversion, as follows:


image  =  image.convert('L')

threshold  =  80

table  =  []

for  i  in  range(256):

    if  i  <  threshold:

        table.append(0)

    else:

        table.append(1)

image  =  image.point(table,  '1')

image.show()

Here we specify a variable threshold the representative binarization threshold value, the threshold value set to 80, after the process we look at the results, shown in Figure 8-4:

Python3 crawler combat -41, recognition CAPTCHA

FIG processing result 8-4

After treatment we found that the original line codes have been removed, and the entire verification code becomes black and white, then re-identification codes, as follows:


import tesserocr

from PIL import Image

image  =  Image.open('code2.jpg')

image  =  image.convert('L')

threshold  =  127

table  =  []

for  i  in  range(256):

    if  i  <  threshold:

        table.append(0)

    else:

        table.append(1)

image  =  image.point(table,  '1')

result  =  tesserocr.image_to_text(image)

print(result)

The results can be found running becomes:

PFRT
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Correctly identified.

For some visible picture interference, we do some grayscale and binary processing, will improve its recognition accuracy.

6. Conclusion

In this section we understand the process of using the Tesserocr identification code, for the simple graphical verification code we can directly use it to get results, if you want to improve the accuracy of recognition can also do some pre-verification picture.

Guess you like

Origin blog.51cto.com/14445003/2427542