In this section we first attempt to identify the simplest kind of code, graphics code, the oldest, is now very common these codes appear, usually consisting of four letters or numbers, such as the registration page on the Chinese HowNet a similar verification code, the link is:http://my.cnki.net/elibregister/commonRegister.aspx , the page shown in Figure 8-1:
Figure 8-1 HowNet registration page
The last one is a form of CAPTCHA, we must fully enter the correct characters in the picture before they can complete the registration.
1. The objective of this section
In this section we have to HowNet verification codes, for example, explain the method using the OCR technology to identify such graphical verification code.
2. Preparation
Identification pattern codes required libraries have Tesserocr, reference may be installed without the installation instructions chapter.
3. Get verification code
In order to facilitate the experiment, we first verify that the picture is saved to the local code, for testing.
Open the Developer Tools, find the code element, you can see that this is a picture, it's src attribute is CheckCode.aspx, where we open this link directly: you can see a verification code that is directly right-preserved may the name of the named code.jpg, do not understand the learning process can join our learning exchanges among Qiuqiu ring 784 758 214 back to share the moment Python enterprise talent needs and how you learn Python from a zero base, and learning What content. Related video learning materials, development tools are shown in Figure 8-2 Share:
FIG codes 8-2
So that we can get a verification code identifies the following picture for testing use.
4. Identify Test
Next we create a new project, the CAPTCHA image into the project root directory, Tesserocr library to identify what the code to try, as follows:
import tesserocr
from PIL import Image
image = Image.open('code.jpg')
result = tesserocr.image_to_text(image)
print(result)
Here we first create an Image object and then call the Tesserocr of image_to_text () method, passing in the Image object to complete the identification process is very simple realization, recognition results are as follows:
JR42
In addition Tesserocr there is a more simple way directly to the image file into a string you can achieve the same effect, as follows:
import tesserocr
print(tesserocr.file_to_text('image.png'))
However, this method has been tested good as the effect of identifying a way.
The treatment codes
As the basic picture identification is not difficult, just create a new Image object and then call image_to_text () method to obtain recognition of the results of the picture.
Next, we try for a verification code, named code2.jpg, shown in Figure 8-3:
FIG codes 8-3
Re-test with the following code:
import tesserocr
from PIL import Image
image = Image.open('code2.jpg')
result = tesserocr.image_to_text(image)
print(result)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎
Then you can see the following output:
FFKT
The discovery and identification of actual results deviate, this is because the validation extra lines in the interference image recognition.
In this case, what we need to do additional processing, such as turn gray, binarization operation.
We can use the Image object convert () method parameters passed to the image L is converted to a grayscale image, as follows:
image = image.convert('L')
image.show()
1 can be passed in the picture binary processing:
image = image.convert('1')
image.show()
In addition, we can also specify the binarization threshold value, the above method uses a default threshold 127, but we can not be directly transformed with the picture, you can be first converted to a grayscale image, and then specify the binarization threshold value conversion, as follows:
image = image.convert('L')
threshold = 80
table = []
for i in range(256):
if i < threshold:
table.append(0)
else:
table.append(1)
image = image.point(table, '1')
image.show()
Here we specify a variable threshold the representative binarization threshold value, the threshold value set to 80, after the process we look at the results, shown in Figure 8-4:
FIG processing result 8-4
After treatment we found that the original line codes have been removed, and the entire verification code becomes black and white, then re-identification codes, as follows:
import tesserocr
from PIL import Image
image = Image.open('code2.jpg')
image = image.convert('L')
threshold = 127
table = []
for i in range(256):
if i < threshold:
table.append(0)
else:
table.append(1)
image = image.point(table, '1')
result = tesserocr.image_to_text(image)
print(result)
The results can be found running becomes:
PFRT
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎
Correctly identified.
For some visible picture interference, we do some grayscale and binary processing, will improve its recognition accuracy.
6. Conclusion
In this section we understand the process of using the Tesserocr identification code, for the simple graphical verification code we can directly use it to get results, if you want to improve the accuracy of recognition can also do some pre-verification picture.