Pattern recognition codes:
Hinder our crawlers. It is sometimes Login or request some data when the CAPTCHA. So here we explain a can the picture be translated into text technology. The pictures are to be translated into text General Optical Character Recognition (Optical Character Recognition), for short OCR
. Realize OCR
the library is not a lot, especially open source. Because of this there are some technical barriers (requires a lot of data, algorithms, machine learning, deep learning of knowledge, etc.), and if the well has a high commercial value. So open source is less. Here are a relatively good image recognition open source libraries: Tesseract .
Tesseract:
Tesseract OCR is a library, currently sponsored by Google. Tesseract is now recognized as the best and most accurate open source OCR library. Tesseract has a high degree of recognition, but also has high flexibility, he can identify any font by training.
installation:
Windows System:
Download the executable file in the following link, and then click Next meal can be installed (on the path of pure English do not need permission):
https://github.com/tesseract-ocr/
Linux systems:
Free to compile the following link to download the source code.
https://github.com/tesseract-ocr/tesseract/wiki/Compiling
or ubuntu
installed by the following command:
sudo apt install tesseract-ocr
Mac System:
Use Homebrew
can be easy to install:
brew install tesseract
Set the environment variable:
After installation is complete, if you want to use the command line Tesseract
, you should set the environment variable. Mac
And Linux
when installed on the default has been set up. In the Windows
next to tesseract.exe
add the path to where the PATH
environment variable.
There is also a need to set environment variables that the data file path should also be trained into the environment variable.
In the environment variables, add one TESSDATA_PREFIX=C:\path_to_tesseractdata\teseractdata
.
Using the command line identifying image tesseract:
If you want to cmd
be able to use the next tesseract
command, you need to tesseract.exe
directory resides into PATH
environment variable. Then use the tesseract 图片路径 文件路径
command: .
Example:
tesseract a.png a
It will recognize the a.png
pictures and the text is written to a.txt
the. If you do not want to write files directly displayed on the terminal, then do not add the file name on it.
Tesseract using image recognition in the code:
In Python
operation code tesseract
. You need to install a library called pytesseract
. By pip
the way can be installed:
pip install pytesseract
And the need to read the image, it needs a third-party library is called PIL
. By pip list
facie is installed. If not, by pip
way of installation:
pip install PIL
Using pytesseract
the sample code to convert the text image text on text as follows:
# Import pytesseract library Import pytesseract # Import Image Library from PIL Import Image # Specifies the path where tesseract.exe pytesseract.pytesseract.tesseract_cmd R & lt = ' D: \ ProgramApp \ TesseractOCR \ tesseract.exe ' # Open Image Image = Image.open ( " a.png " ) # call image_to_string convert the picture to text text = pytesseract.image_to_string (Image) Print (text)
By pytesseract
treatment pull hook net pattern codes:
import pytesseract from urllib import request from PIL import Image import time pytesseract.pytesseract.tesseract_cmd = r"D:\ProgramApp\TesseractOCR\tesseract.exe" while True: captchaUrl = "https://passport.lagou.com/vcode/create?from=register&refresh=1513081451891" request.urlretrieve(captchaUrl,'captcha.png') image = Image.open('captcha.png') text = pytesseract.image_to_string(image,lang='eng') print(text) time.sleep(2)