1. OCR
OCR, i.e., Optical Character Recognition, OCR, refers to the process by scanning the character, and then by its shape will be translated into electronic text. For graphics codes, they are some irregular characters, which is indeed slight twisting converted content obtained by the character.
For example, for codes shown in FIG. 1-22 and FIG. 1-23, we can use the OCR technology to be converted into electronic text, then the recognition result crawler submitted to the server, we can achieve an automatic identification code verification process .
Figure 1-22 codes
Figure 1-23 codes
tesserocr OCR recognition is a Python library, but in fact is a tesseract do one Python API package, so it is the core of tesseract. Therefore, prior to installing tesserocr, we need to install tesseract.
2. Related Links
- tesserocr GitHub:https://github.com/sirfz/tesserocr
- tesserocr PyPI:https://pypi.python.org/pypi/tesserocr
- tesseract Download: http://digi.bib.uni-mannheim.de/tesseract
- tesseract GitHub:https://github.com/tesseract-ocr/tesseract
- tesseract language packs: https://github.com/tesseract-ocr/tessdata
- tesseract document: https://github.com/tesseract-ocr/tesseract/wiki/Documentation
3. Windows installation under
In Windows, you first need to download the tesseract, which provides support for the tesserocr.
The download page, you can see there are a variety .exe file download list, where you can choose to download version 3.0. Figure 1-24 is a version 3.05.
Figure 1-24 Download Page
Where the file name with the dev version for developers, without the dev is stable version, you can choose to download without the dev version, for example, can choose to download tesseract-ocr-setup-3.05.01.exe.
After the download is complete double click, then the page will appear as shown in FIG 1-25.
Figure 1-25 Installation page
At this point you can check Additional language data (download) option to install the OCR language support packages, so we can OCR recognizes multiple languages. Then all the way click on the Next button.
Next, tesserocr to install, this time directly mounted pip:
1 |
pip3 install tesserocr pillow |
4. Linux installation of
For Linux, different systems have different distribution package, and it might be called or tesseract-ocr tesseract, can be installed directly with the corresponding command.
Ubuntu、Debian和Deepin
In Ubuntu, Debian and Deepin systems, install command as follows:
1 |
sudo apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev |
CentOS、Red Hat
In CentOS and Red Hat systems, install command as follows:
1 |
yum install -y tesseract |
In the above command to run different releases, tesseract to complete the installation.
After installation is complete, you can invoke tesseract
commands.
Next, we look at the language it supports:
1 |
tesseract --list-langs |
Operating results Example:
1 2 3 4 |
List of available languages (3): eng osd equ |
结果显示它只支持几种语言,如果想要安装多国语言,还需要安装语言包,官方叫作tessdata(其下载链接为:https://github.com/tesseract-ocr/tessdata)。
利用Git命令将其下载下来并迁移到相关目录即可,不同版本的迁移命令如下所示。
在Ubuntu、Debian和Deepin系统下的迁移命令如下:
1 2 |
git clone https://github.com/tesseract-ocr/tessdata.git sudo mv tessdata/* /usr/share/tesseract-ocr/tessdata |
在CentOS和Red Hat系统下的迁移命令如下:
1 2 |
git clone https://github.com/tesseract-ocr/tessdata.git sudo mv tessdata/* /usr/share/tesseract/tessdata |
这样就可以将下载下来的语言包全部安装了。
这时我们重新运行列出所有语言的命令:
1 |
tesseract --list-langs |
结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
List of available languages (107): afr amh ara asm aze aze_cyrl bel ben bod bos bul cat ceb ces chi_sim chi_tra ... |
可以发现,这里列出的语言就多了很多,比如chi_sim就代表简体中文,这就证明语言包安装成功了。
接下来再安装tesserocr即可,这里直接使用pip安装:
1 |
pip3 install tesserocr pillow |
5. Mac下的安装
在Mac下,我们首先使用Homebrew安装ImageMagick和tesseract库:
1 2 |
brew install imagemagick brew install tesseract --all-languages |
接下来再安装tesserocr即可:
1 |
pip3 install tesserocr pillow |
这样我们便完成了tesserocr的安装。
6. 验证安装
接下来,我们可以使用tesseract和tesserocr来分别进行测试。
下面我们以如图1-26所示的图片为样例进行测试。
图1-26 测试样例
该图片的链接为https://raw.githubusercontent.com/Python3WebSpider/TestTess/master/image.png,可以直接保存或下载。
首先用命令行进行测试,将图片下载下来并保存为image.png,然后用tesseract
命令测试:
1 |
tesseract image.png result -l eng && cat result.txt |
运行结果如下:
1 2 |
Tesseract Open Source OCR Engine v3.05.01 with Leptonica Python3WebSpider |
这里我们调用了tesseract
命令,其中第一个参数为图片名称,第二个参数result
为结果保存的目标文件名称,-l
指定使用的语言包,在此使用英文(eng
)。然后,再用cat
命令将结果输出。
运行结果便是图片的识别结果:Python3WebSpider
。可以看到,这时已经成功将图片文字转为电子文本了。
然后还可以利用Python代码来测试,这里就需要借助于tesserocr库了,测试代码如下:
1 2 3 4 |
import tesserocr from PIL import Image image = Image.open('image.png') print(tesserocr.image_to_text(image)) |
我们首先利用Image
读取了图片文件,然后调用了tesserocr
的image_to_text()
方法,再将其识别结果输出。
运行结果如下:
1 |
Python3WebSpider |
另外,我们还可以直接调用file_to_text()
方法,这可以达到同样的效果:
1 2 |
import tesserocr print(tesserocr.file_to_text('image.png')) |
运行结果:
1 |
Python3WebSpider |
如果成功输出结果,则证明tesseract和tesserocr都已经安装成功。
转载:
https://cuiqingcai.com/
https://my.oschina.net/u/3273360/blog/1845039
OCR 下载地址:
https://digi.bib.uni-mannheim.de/tesseract/