Chapter IV of reptiles advanced CAPTCHA recognition technology

Pattern recognition codes:

Hinder our crawlers. It is sometimes Login or request some data when the CAPTCHA. So here we explain a can the picture be translated into text technology. The pictures are to be translated into text General Optical Character Recognition (Optical Character Recognition), for short OCR. Realize OCRthe library is not a lot, especially open source. Because of this there are some technical barriers (requires a lot of data, algorithms, machine learning, deep learning of knowledge, etc.), and if the well has a high commercial value. So open source is less. Here are a relatively good image recognition open source libraries: Tesseract .

Tesseract:

Tesseract OCR is a library, currently sponsored by Google. Tesseract is now recognized as the best and most accurate open source OCR library. Tesseract has a high degree of recognition, but also has high flexibility, he can identify any font by training.

installation:

Windows System:

Download the executable file in the following link, and then click Next meal can be installed (on the path of pure English do not need permission):
https://github.com/tesseract-ocr/

Linux systems:

Free to compile the following link to download the source code.
https://github.com/tesseract-ocr/tesseract/wiki/Compiling
or ubuntuinstalled by the following command:

sudo apt install tesseract-ocr

Mac System:

Use Homebrewcan be easy to install:

brew install tesseract

Set the environment variable:

After installation is complete, if you want to use the command line Tesseract, you should set the environment variable. MacAnd Linuxwhen installed on the default has been set up. In the Windowsnext to tesseract.exeadd the path to where the PATHenvironment variable.

There is also a need to set environment variables that the data file path should also be trained into the environment variable.
In the environment variables, add one TESSDATA_PREFIX=C:\path_to_tesseractdata\teseractdata.

Using the command line identifying image tesseract:

If you want to cmdbe able to use the next tesseractcommand, you need to tesseract.exedirectory resides into PATHenvironment variable. Then use the tesseract 图片路径 文件路径command: .
Example:

tesseract a.png a

It will recognize the a.pngpictures and the text is written to a.txtthe. If you do not want to write files directly displayed on the terminal, then do not add the file name on it.

Tesseract using image recognition in the code:

In Pythonoperation code tesseract. You need to install a library called pytesseract. By pipthe way can be installed:

pip install pytesseract

 

And the need to read the image, it needs a third-party library is called PIL. By pip listfacie is installed. If not, by pipway of installation:

pip install PIL

 

Using pytesseractthe sample code to convert the text image text on text as follows:

# Import pytesseract library 
Import pytesseract
 # Import Image Library 
from PIL Import Image

# Specifies the path where tesseract.exe 
pytesseract.pytesseract.tesseract_cmd R & lt = ' D: \ ProgramApp \ TesseractOCR \ tesseract.exe '

# Open Image 
Image = Image.open ( " a.png " )
 # call image_to_string convert the picture to text 
text = pytesseract.image_to_string (Image)
 Print (text)

 

By pytesseracttreatment pull hook net pattern codes:

import pytesseract
from urllib import request
from PIL import Image
import time


pytesseract.pytesseract.tesseract_cmd = r"D:\ProgramApp\TesseractOCR\tesseract.exe"


while True:
    captchaUrl = "https://passport.lagou.com/vcode/create?from=register&refresh=1513081451891"
    request.urlretrieve(captchaUrl,'captcha.png')
    image = Image.open('captcha.png')
    text = pytesseract.image_to_string(image,lang='eng')
    print(text)
    time.sleep(2)

 

Guess you like

Origin www.cnblogs.com/lcy0302/p/11019266.html