OCR tesserocr installation

1. OCR

OCR, i.e., Optical Character Recognition, OCR, refers to the process by scanning the character, and then by its shape will be translated into electronic text. For graphics codes, they are some irregular characters, which is indeed slight twisting converted content obtained by the character.

For example, for codes shown in FIG. 1-22 and FIG. 1-23, we can use the OCR technology to be converted into electronic text, then the recognition result crawler submitted to the server, we can achieve an automatic identification code verification process .

Figure 1-22 codes

Figure 1-23 codes

tesserocr OCR recognition is a Python library, but in fact is a tesseract do one Python API package, so it is the core of tesseract. Therefore, prior to installing tesserocr, we need to install tesseract.

2. Related Links

3. Windows installation under

In Windows, you first need to download the tesseract, which provides support for the tesserocr.

The download page, you can see there are a variety .exe file download list, where you can choose to download version 3.0. Figure 1-24 is a version 3.05.

Figure 1-24 Download Page

Where the file name with the dev version for developers, without the dev is stable version, you can choose to download without the dev version, for example, can choose to download tesseract-ocr-setup-3.05.01.exe.

After the download is complete double click, then the page will appear as shown in FIG 1-25.

Figure 1-25 Installation page

At this point you can check Additional language data (download) option to install the OCR language support packages, so we can OCR recognizes multiple languages. Then all the way click on the Next button.

Next, tesserocr to install, this time directly mounted pip:

1

pip3 install tesserocr pillow

4. Linux installation of

For Linux, different systems have different distribution package, and it might be called or tesseract-ocr tesseract, can be installed directly with the corresponding command.

Ubuntu、Debian和Deepin

In Ubuntu, Debian and Deepin systems, install command as follows:

1

sudo apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev

CentOS、Red Hat

In CentOS and Red Hat systems, install command as follows:

1

yum install -y tesseract

In the above command to run different releases, tesseract to complete the installation.

After installation is complete, you can invoke tesseractcommands.

Next, we look at the language it supports:

1

tesseract --list-langs

Operating results Example:

1

2

3

4

List of available languages (3):

eng

osd

equ

结果显示它只支持几种语言,如果想要安装多国语言,还需要安装语言包,官方叫作tessdata(其下载链接为:https://github.com/tesseract-ocr/tessdata)。

利用Git命令将其下载下来并迁移到相关目录即可,不同版本的迁移命令如下所示。

在Ubuntu、Debian和Deepin系统下的迁移命令如下:

1

2

git clone https://github.com/tesseract-ocr/tessdata.git

sudo mv tessdata/* /usr/share/tesseract-ocr/tessdata

在CentOS和Red Hat系统下的迁移命令如下:

1

2

git clone https://github.com/tesseract-ocr/tessdata.git

sudo mv tessdata/* /usr/share/tesseract/tessdata

这样就可以将下载下来的语言包全部安装了。

这时我们重新运行列出所有语言的命令:

1

tesseract --list-langs

结果如下:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

List of available languages (107):

afr

amh

ara

asm

aze

aze_cyrl

bel

ben

bod

bos

bul

cat

ceb

ces

chi_sim

chi_tra

...

可以发现,这里列出的语言就多了很多,比如chi_sim就代表简体中文,这就证明语言包安装成功了。

接下来再安装tesserocr即可,这里直接使用pip安装:

1

pip3 install tesserocr pillow

5. Mac下的安装

在Mac下,我们首先使用Homebrew安装ImageMagick和tesseract库:

1

2

brew install imagemagick

brew install tesseract --all-languages

接下来再安装tesserocr即可:

1

pip3 install tesserocr pillow

这样我们便完成了tesserocr的安装。

6. 验证安装

接下来,我们可以使用tesseract和tesserocr来分别进行测试。

下面我们以如图1-26所示的图片为样例进行测试。

图1-26 测试样例

该图片的链接为https://raw.githubusercontent.com/Python3WebSpider/TestTess/master/image.png,可以直接保存或下载。

首先用命令行进行测试,将图片下载下来并保存为image.png,然后用tesseract命令测试:

1

tesseract image.png result -l eng && cat result.txt

运行结果如下:

1

2

Tesseract Open Source OCR Engine v3.05.01 with Leptonica

Python3WebSpider

 

这里我们调用了tesseract命令,其中第一个参数为图片名称,第二个参数result为结果保存的目标文件名称,-l指定使用的语言包,在此使用英文(eng)。然后,再用cat命令将结果输出。

运行结果便是图片的识别结果:Python3WebSpider。可以看到,这时已经成功将图片文字转为电子文本了。

然后还可以利用Python代码来测试,这里就需要借助于tesserocr库了,测试代码如下:

1

2

3

4

import tesserocr

from PIL import Image

image = Image.open('image.png')

print(tesserocr.image_to_text(image))

我们首先利用Image读取了图片文件,然后调用了tesserocrimage_to_text()方法,再将其识别结果输出。

运行结果如下:

1

Python3WebSpider

另外,我们还可以直接调用file_to_text()方法,这可以达到同样的效果:

1

2

import tesserocr

print(tesserocr.file_to_text('image.png'))

运行结果:

1

Python3WebSpider

如果成功输出结果,则证明tesseract和tesserocr都已经安装成功。

 

转载: 

https://cuiqingcai.com/    

https://my.oschina.net/u/3273360/blog/1845039  

OCR 下载地址:  

https://digi.bib.uni-mannheim.de/tesseract/

 

Guess you like

Origin blog.csdn.net/chang995196962/article/details/91039138