Tess4J -4.0.2- Linux practice [resolved: Tess4J - Native library (linux-x86-64 / libtesseract.so) not found in resource path]

Tess4J is Tesseract's Java JNA wrapper. This article describes the use of steps and precautions Tess4J in CentOS 7 operating system. Before the official start, first spend a little space, the relevant technical make a brief introduction.

A little bit of background

Tesseract

Tesseract is a well-known open source OCR engine that supports 100 languages, out of the box. You can also support more languages ​​through training methods. Tesseract was born in 1984, from HP's 2005 revenue. Since 2006, the Google take over the development. Up to now, the latest stable release is June 1, 2017 release of 3.05.01. As well as a more active based LSTM (short and long term memory network is a time recurrent neural network) version 4.0, is still in development, the latest release is 4.0.0-beta.3 2018 Nian 6 26 May. Tesseract developed by C ++.

Site:

https://github.com/tesseract-ocr/tesseract

Leptonica

As Tesseract OCR engine, can not avoid using an image processing. The image processing Tesseract used mainly by the leptonica. Related functions Leptonica includes many image processing and image.

Site:

http://www.leptonica.com/

Java JNA Wrapper

JNA is an acronym for Java Native Access, by definition, is an implementation of the Java Native application calls the operating system libraries. Lift the Java local calls, we naturally think of JNI, but using JNI process is very complicated, daunting. JNA is a more natural way to provide call local application support for Java applications.

Site:

https://github.com/java-native-access/jna

Tess4J

Tess4J by Java JNA Wrapper, Java provides the Tesseract API, while also providing Tesseract's DLL Windows 32bit and 64bit as well as some sample images. By Tess4J, it can be very convenient by using Tesseract Java in Windows. For other operating system Linux, MAC, etc., the need to build their own can use Tesseract Tess4J.

In other words, native cross-platform Tess4J not only is Windows out of the box.

This is also the original intention of this writing, recording in a Linux environment, using the steps Tess4J and walk through the pits.

As used herein, technology version

Why the emphasis version alone? Long-term open pit removed from comrades must understand the fact: Most of the open source project quality (usability features, documentation accuracy, timeliness update) relatively normal, mixed in circles, must have the ability to eviscerate the many complex information , crying out seeking attention in the community skills, strong ability and indomitable spirit. . .

For some technical problems, the results Google out, a large part is invalid, it will waste a lot of time, or even go astray. But with many of the exchange found that the vast majority of such cases is incomplete article describes the program, or not strict enough due.

So, I think, as every practice sharing, you need to have the basic requirements repeatable operation. So, I will try to repeat the exact version of the software and environment practice I use, we want to help.

Tess4J:4.0.2

Tesseract:4.0.0-beta.1

Leptonica:1.76.0

JDK:1.8 Update 102 64bit

Operating environment: CentOS 7 (kernel: 3.10.0-862.3.3.el7.x86_64) 64bit

                  GCC:4.8.5

                  Clang:3.4

Development Environment: Windows 10 64bit

Why is this election?

Pit here is not to your liking, use the new version. I used Tesseract 4.0.0-beta.3, but running JVM will be reported in a Fatal Error finally quit on their own, reading the newspaper out of the wrong judge, probably some changes in the function signature Tesseract in, Tess4J with the signature does not match cause.

Remember official documents (https://github.com/tesseract-ocr/tesseract/wiki) do?

Wiki official mentioned, Linux is so libraries can install pre-compiled packages, according to Wiki operating, it will automatically install the latest version. I also tried the old version installed by yum list, found that even the so-called old version, will lead to error Tess4J runtime. (Yum downloaded and automatically installed by the package as shown rpm)

Here it must be selected according to Tess4J adapted version.

Tess4J of versionchanges.txt described recent changes in several versions:

Version 4.0.0 (28 April 2018)
Upgrade to Tesseract 4.0.0-beta.1 (45bb942)
- Update Lept4J to 1.9.3 (Leptonica 1.75.3)

Version 4.0.1 (2 May 2018)
- Fix a path issue when extracting resources from JAR to temp directory on Windows server

Version 4.0.2 (3 May 2018)
- Replace JNA string constant Platform.RESOURCE_PREFIX
- Update jai-imageio url
- Update Lept4J to 1.9.4

Visible, the latest version of Tess4J, only Tesseract 4.0.0-beta.1 been adapted, thus creating a version of the combination described above. Given the pre-built package can not be determined which is built from the Tesseract 4.0.0-beta.1, it can only be constructed through the source code itself.

Construction of Tesseract

1 of Repo modify yum

I may be a network environment where the very poor, yum mirror is used by default, but most of the mirror not connected, the download process will result in a large number of invalid attempts in the mirror, it is a waste of time.

So, I closed the fast mirror yum plugin (/etc/yum/pluginconf.d), in addition to modify the CentOS-Base.repo.

Installation essential package 2

1
2
3
yum -y update
yum -y  install  libstdc++ autoconf automake libtool autoconf-archive pkg-config  gcc  gcc -c++  make  libjpeg-devel libpng-devel libtiff-devel zlib-devel
yum group  install  -y  "Development Tools"

 Note: autoconf-archive is the official description of the prerequisites are not mentioned.

3 Download Source

Leptonica

http://www.leptonica.com/source/leptonica-1.76.0.tar.gz

Tesseract

https://codeload.github.com/tesseract-ocr/tesseract/tar.gz/4.0.0-beta.1

Download the complete results:

4 Installation Leptonica

1
2
3
4
5
6
tar  -zxvf leptonica-1.76.0. tar .gz
cd  leptonica-1.76.0
. /autobuild
. /configure
make  -j
make  install

5 Installation Tesseract

1
2
3
4
5
6
7
tar  -zxvf 4.0.0-beta.1. tar .gz
cd  tesseract-4.0.0-beta.1/
. /autogen .sh
PKG_CONFIG_PATH= /usr/local/lib/pkgconfig  LIBLEPT_HEADERSDIR= /usr/local/include  . /configure  --with-extra-includes= /usr/local/include  --with-extra-libraries= /usr/local/lib
LDFLAGS= "-L/usr/local/lib"  CFLAGS= "-I/usr/local/include"  make  -j
make  install
ldconfig

6 Confirm Installation

After a long process of compilation, Tesseract has been installed.

 Execute the following instructions:

1
tesseract - v

  显示内容如下:

 即为安装完毕。

获取so库

在/usr/local/lib中,可以找到Tess4J需要的依赖库libtesseract.so。可见libtesseract.so实际指向libtesseract.so.4.0.0。liblept.so是leptonica的库,Tesseract也需要调用。

为Tess4J设置库(so)文件位置

Tess4J的JAR包中,包含windows的DLL库,其位置如下:

在Tess4J运行时,会将操作系统依赖的库从JAR包中解压出来使用,对于Linux系统的so文件,也是如此。Tess4J约定的Linux库文件存储路径为classpath根路径下的linux-x86-64。

既然已经拿到了so文件,显而易见的有两种方法:

1 修改tess4j-4.0.2.jar,将so文件按照约定路径存储,这样Tess4J在运行时就会自动解压使用。

但这种方式修改了Tess4J公开发布的JAR包,日后升级会有麻烦,因此,不建议这样操作。

 2 将linux-x86-64放到Java工程的classpath根目录。这样在运行时,Tess4J就可以找到库了。我的Java工程采用的Gradle构建,因此将这些文件放到了src/main/resources下。

linux-x86-64目录中包含的文件,是从Linux系统中/usr/local/lib拷贝而来,去掉了一些链接文件,具体如下:

至此,已经完成了Tess4J在库文件方面的准备。

 在Tess4J设置Tesseract数据目录tessdata

Tesseract运行时是需要加载语言的训练数据的,按照约定,这些训练数据需要放在tessdata下。但Tess4J 4.0.2对Windows和Linux两类操作系统目录处理方式是不一致的。

初始化Tesseract的代码,setDatapath就是用来设置tessdata目录的。

1
2
3
ITesseract instance =  new  Tesseract();
//设置tessdata目录
instance.setDatapath( "/path/to/tessdata" );

Tesseract的训练数据以“语言名.traineddata”命名。

 经过实测,在Windows中,需要直接指定到.traineddata所在目录,在Linux中,则需要指定到一个目录,其中包含一个叫做tessdata的文件夹,tessdata内部是.traineddata文件。

举例说明:

在Windows中,instance.setDatapath("lngData/tessdata");

在Linux中,instance.setDatapath("lngData");

显然这是一个BUG,不过开源项目有BUG已经是家常便饭了。

为此,我在实践时根据操作系统做了一个小小的适配。

1
2
3
4
5
6
7
8
9
10
11
12
13
         ITesseract instance =  new  Tesseract();
File tessTrainedDataLoc =  null ;
if (SystemDetector.isWindows())
{
     //Windows Data目录直接指定到*.traineddata所在目录
     tessTrainedDataLoc =  new  File(System.getProperty( "user.dir" ), "lngData\\tessdata" );
}
else
{
     // 在Linux(如CentOS 7)中,Data目录指定到tessdata上一级
     tessTrainedDataLoc =  new  File(System.getProperty( "user.dir" ), "lngData" );
}
instance.setDatapath(tessTrainedDataLoc.getAbsolutePath());

以上用到的SystemDetector代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import  java.util.Properties;
 
public  class  SystemDetector {
 
     private  static  boolean  isWindows =  false ;
     private  static  boolean  isLinux =  false ;
     
     static  {
         Properties props = System.getProperties();
         String systemName = props.getProperty( "os.name" );
         if  (systemName.toLowerCase().indexOf( "windows" ) != - 1 ) {
             isWindows =  true ;
         }
         if  (systemName.toLowerCase().indexOf( "linux" ) != - 1 ) {
             isLinux =  true ;
         }
     }
     
     public  static  boolean  isWindows()
     {
         return  isWindows;
     }
     
     public  static  boolean  isLinux()
     {
         return  isLinux;
     }
}

  至此,全部完成。在Windows上开发的项目使用Tess4J,已经可以在Linux中正常运行。

 

关于Tesseract的训练数据

Tesseract最大优势就是可以开箱即用,拥有大量语言的训练数据,实际使用时,可以根据需要进行OCR识别的内容类型添加。

但是不要贪多,识别范围越广,速度也就越慢,甚至还会影响精确度。建议在使用时尽量指定要识别内容的语言,类型,以便在准确度及效率之间取一个恰当的平衡。

Tesseract训练数据可以从下面获取:

https://github.com/tesseract-ocr/tessdata

Tesseract 4使用了LSTM,因此还有一个叫做tessdata_best的Repo,其内容是使用LSTM模型训练的各种语言识别率最高的训练数据。(推荐使用)

https://github.com/tesseract-ocr/tessdata_best

如果对Tesseract效果不满意,还可以自行准备数据进行训练。Tesseract所有项目均在Github上,链接地址为:

https://github.com/tesseract-ocr

Guess you like

Origin www.cnblogs.com/socketqiang/p/10960800.html