2020-1-19 crawled pages entry identification codes 21 (simple input type codes -Tesseract-OCR)

Climb there will be verification code page appears.
The most fortunate situation is encountered in this case the picture below
Here Insert Picture Description
this verification code type most most likely to be identified.

Online description pytesser to achieve, it is Google open source OCR project.
I tried for a long time, I do not know why that is not. Interested parties can own research, the following interface
Here Insert Picture Description
will need to download two files, themselves Baidu, anyway, many years ago, older posts are this way.

I say the following method is successful. This approach is also to take advantage of Tesseract-OCR

My environment 64 Win10 + py2.7

step1. Installing PIL library
as relates to image recognition, it must be installed PIL (Python Imaging Library) to perform image processing
proposed installation by the following method

pip install pillow

About pillow For instructions, see link

step2. install Tesseract-OCR
go here to download https://github.com/UB-Mannheim/tesseract/wiki
I downloaded the tesseract-ocr-w64-setup- v5.0.0-alpha.20191030.exe
installed by default on the line

Step3. Configuration
3-1 appending the path you just installed tesseract-ocr variable path in the computer environment
, if normal, the following is displayed

C:\>tesseract -v
tesseract v5.0.0-alpha.20191030
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5

3-2. TESSDATA_PREFIX append an environment variable in the environment variable, the value is your Tesseract-OCR installation directory \ tessdata. In fact, this is the directory where the file eng.traineddata

3-3. Modify installation of python Lib \ site-packages \ pytesseract in pytesseract.py
this file has the following line

tesseract_cmd = 'tesseract'

change into

tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'

That is the directory where tesseract.exe.
Note : Please here with a backslash, or be wrong. Do not believe you can own try.

step4. Use
us to try the following verification code.
Here Insert Picture Description
Code

#coding:utf-8
from PIL import Image,ImageEnhance
import pytesseract

im=Image.open("yzm.aspx.jfif")
image = im.convert('L')#图像加强,二值化
im2 =ImageEnhance.Contrast(image)#对比度增强
im3 = im2.enhance(2.0)

text = pytesseract.image_to_string(im3)
print text
  

yzm.aspx.jfif verification picture is saved in the file name on the site.
Run Results

PFRD

Supplementary
then another verification code test site at (irregular gap character, the character is inclined)
Here Insert Picture Description
the results of 6823, can be identified.

Published 122 original articles · won praise 7 · views 20000 +

Guess you like

Origin blog.csdn.net/weixin_42555985/article/details/104041891