Run Tess4j for OCR recognition in Linux environment

1. Introduction to Tess4j & Tesseract OCR

Tess4J is a Java interface based on the Tesseract OCR engine that can be used to recognize text in images. Tesseract is an open source OCR engine developed by Google that can recognize text in multiple languages. Tess4J combines the advantages and simplicity of the Tesseract engine with the portability and ease of use of Java to create a very powerful OCR solution.
Tess4J provides a rich API that can easily perform image processing and text recognition. It supports image files in a variety of formats, including BMP, PNG, JPEG, GIF, and TIFF formats. Tess4J can also perform image preprocessing, such as cropping, scaling, and binarization, to improve the accuracy of text recognition.
Tess4j official website: https://tess4j.sourceforge.net/usage.html
Tesseract OCR code library: https://github.com/tesseract-ocr/

2. Simple case of Tess4j
Introduce dependencies

First introduce maven dependencies:

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.0.0</version>
</dependency>
Code examples

Here is a simple code example:

// 新建工具对象
ITesseract iTesseract = new Tesseract();
// 设置语言 这里设置简体中文
iTesseract.setLanguage("chi_sim");
// 设置训练数据路径
iTesseract.setDatapath("D:\\tessdata");
// 提取图片中的文字
String result = iTesseract.doOCR(new File("D:\\test.png"));

The language set above is Simplified Chinese, and you need to download the pre-training data from the tesseract-ocr language library.

Download language library

Language library download address: https://github.com/tesseract-ocr/tessdata
image.png
Download Simplified Chinese ( chi_sim.traineddata ) pre-training data here.

3. Run ITesseract class on Windows

After going through the above steps, you can run the code example directly on Windows, because Tess4J is available out-of-the-box on Windows. Let’s take a look at why.

Code call chain analysis

First, let’s take a look at the core code call link for OCR recognition:

String result = iTesseract.doOCR(new File("D:\\test.png"));

The following are the methods of the Tesseractl class:

public String doOCR(File imageFile) throws TesseractException {
    
    
    return doOCR(imageFile, null);
}
public String doOCR(File imageFile, Rectangle rect) throws TesseractException {
    
    
    try {
    
    
        return doOCR(ImageIOHelper.getIIOImageList(imageFile), imageFile.getPath(), rect);
    } catch (Exception e) {
    
    
        logger.error(e.getMessage(), e);
        throw new TesseractException(e);
    }
}

The following doOCR method initializes the TessAPI object through the init method:

public String doOCR(List<IIOImage> imageList, String filename, Rectangle rect) throws TesseractException {
    
    
    init();
    setTessVariables();
    ......
}

TessAPI object initialization:

protected void init() {
    
    
    api = TessAPI.INSTANCE;
    handle = api.TessBaseAPICreate();
    StringArray sarray = new StringArray(configList.toArray(new String[0]));
    PointerByReference configs = new PointerByReference();
    configs.setPointer(sarray);
    api.TessBaseAPIInit1(handle, datapath, language, ocrEngineMode, configs, configList.size());
    if (psm > -1) {
    
    
        api.TessBaseAPISetPageSegMode(handle, psm);
    }
}

Load the TessAPI object from the local dependent library by the library name through the Native method:

public static final TessAPI INSTANCE = LoadLibs.getTessAPIInstance();
public static TessAPI getTessAPIInstance() {
    
    
    return (TessAPI) Native.loadLibrary(getTesseractLibName(), TessAPI.class);
}

The following is the processing logic of the getTesseractLibName method:

public static final String LIB_NAME = "libtesseract400";
public static final String LIB_NAME_NON_WIN = "tesseract";

public static String getTesseractLibName() {
    
    
    return Platform.isWindows() ? LIB_NAME : LIB_NAME_NON_WIN;
}

The above code shows that the Windows system obtains the TessAPI object from the local dependent library through the name libtesseract400, and other systems obtain the TessAPI object through the name tesseract.
The following are the methods in the Native class:

public static Object loadLibrary(String name, Class interfaceClass) {
    
    
    return loadLibrary(name, interfaceClass, Collections.EMPTY_MAP);
}

public static Object loadLibrary(String name,Class interfaceClass,Map options) {
    
    
    Library.Handler handler =new Library.Handler(name, interfaceClass, options);
    ClassLoader loader = interfaceClass.getClassLoader();
    Library proxy = (Library)Proxy.newProxyInstance(loader, new Class[] {
    
    interfaceClass}, handler);
    cacheOptions(interfaceClass, options, proxy);
    return proxy;
}

The above code generates interfaceClass objects through JDK dynamic proxy.

Why Windows works right out of the box

Through the above code call chain analysis, we can know that the code needs to call the local library named libtesseract400 in the path path. Above we have introduced the following dependencies:

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.0.0</version>
</dependency>

Open the local maven repository (mine is D:\m2\repository) and find the jar package of tess4j:
image.png
Unzip the above tess4j-4.0.0.jar:
image.png
Enter the win32-x86-64 directory:
image.png
you can see tess4j-4.0.0. The jar already contains the dll libraries required by Windows systems, so it can be used out of the box on Windows systems.

4. Run ITesseract class on linux
Install tesserac

To run Tess4j code on a Linux system, you first need to install leptonica and tesserac. For the specific installation process, refer to:
https://blog.csdn.net/weixin_47914635/article/details/128715110 "Linux installation of tesseract supports tess4j image recognition"
The installation directory is :/usr/local, enter the tesserac lib directory:
image.png
you can see the tesserac so file.

Configure environment variables

If you run the code, the following error is reported:
image.png
It means that the so file of tesserac cannot be found in the path. From the above code call chain analysis, we can know that there is a problem with the following code:

protected void init() {
    
    
    api = TessAPI.INSTANCE;
    handle = api.TessBaseAPICreate();
    StringArray sarray = new StringArray(configList.toArray(new String[0]));
    PointerByReference configs = new PointerByReference();
    configs.setPointer(sarray);
    api.TessBaseAPIInit1(handle, datapath, language, ocrEngineMode, configs, configList.size());
    if (psm > -1) {
    
    
        api.TessBaseAPISetPageSegMode(handle, psm);
    }
}

That is, the system cannot initialize the TessAPI object. The TessAPI object is completed through the following code:

public static final String LIB_NAME = "libtesseract400";
public static final String LIB_NAME_NON_WIN = "tesseract";

public static final TessAPI INSTANCE = LoadLibs.getTessAPIInstance();
public static TessAPI getTessAPIInstance() {
    
    
    return (TessAPI) Native.loadLibrary(getTesseractLibName(), TessAPI.class);
}
public static String getTesseractLibName() {
    
    
    return Platform.isWindows() ? LIB_NAME : LIB_NAME_NON_WIN;
}


That is to say, the so file of tesseract needs to be loaded from the path, and the so file path of tesseract needs to be added to the path.
Modify environment variables (vim /etc/profile):

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/leptonica/lib:/usr/local/tesseract/lib
export LD_LIBRARY_PATH
LIBRARY_PATH=$LIBRARY_PATH:/usr/local/leptonica/lib:/usr/local/tesseract/lib
export LIBRARY_PATH
LIBLEPT_HEADERSDIR=/usr/local/leptonica/include/leptonica
export LIBLEPT_HEADERSDIR

PATH=$PATH:/usr/local/tesseract/bin
export PATH
export TESSDATA_PREFIX=/usr/local/tesseract/share/tessdata 
export PATH=$PATH:$TESSDATA_PREFIX

Make environment variables take effect:

source /etc/profile

Then rerun the Tess4j code.

Guess you like

Origin blog.csdn.net/Princeliu999/article/details/130953176