[Nanny level long-winded tutorial] Compile, install and test Tesseract OCR 5 under Windows 10 (successful test)

insert image description here

As an excellent text recognition (OCR) library, Tesseract was not open source software at first. It was a proprietary software developed by HP Labs from 1985 to 1994. Until 2005, HP and the University of Nevada, Las Vegas, in the form of open source Released, and then Google sponsored Tesseract development jobs starting in 2006. Tesseract version 4 adds an LTSM-based OCR engine and many models, enabling it to support up to 116 languages ​​and 37 texts (Script). After more than 2 years of development and testing, Tesseract version 5 will be released in 2021.

Tesseract is divided into two distributions. One is the program that has been compiled and released. It can be downloaded and installed directly through Tesseract installer for Windows. Download address:
https://github.com/UB-Mannheim/tesseract/wiki
Another branch is An open source library for developers, through which you can embed Tesseract into your own programs. This article mainly focuses on the latter.

The installation process is mainly divided into 4 steps:

  1. Install MSYS2
  2. Install the mingw-64 toolchain and various dependencies
  3. Compile and install Tesseract with cmake
  4. Compile the test program to verify the installation

1. Install MSYS2

Download address:
https://www.msys2.org/
(After the download is complete, you can go to the next step without thinking. Note that it will freeze for a long time when the installation reaches 50%. Wait patiently and don’t interrupt it.) After the installation is complete
, The default is in c:\msys64, open this directory, you will find many startup files, we choose: mingw64.exe to enter the environment.
The first thing to do after entering MSYS2 is to update its pacman package manager, enter:
$ pacman -Syu
(During the period, you may be asked to restart the Shell, don’t forget to enter it again after being interrupted pacman -Syu)


2. Improve the tool chain

After the update is complete, start installing the compilation environment (mainly the Mingw C++ toolchain):
$ pacman -S base-devel msys2-devel mingw-w64-x86_64-toolchain git
Then start installing various dependencies required by Tesseract:
$pacman -S mingw-w64-x86_64-asciidoc mingw-w64-x86_64-cairo mingw-w64-x86_64-curl mingw-w64-x86_64-icu mingw-w64-x86_64-leptonica mingw-w64-x86_64-libarchive mingw-w64-x86_64-pango mingw-w64-x86_64-zlib mingw-w64-x86_64-autotools mingw-w64-x86_64-cmake


3. Compile and install Tesseract

$ cd ~(back to home directory)
$ git clone https://github.com/tesseract-ocr/tesseract tesseract
(cloning the source code from github to the local)
$ cd tesseract
$ mkdir build && cd build
$ cmake .. -G"MinGW Makefiles" -DSW_BUILD=0 -DCMAKE_INSTALL_PREFIX=/usr/local
(-DSW_BUILD=0 means not to build the program with sw, SW is Software Network is also an open source installer, I tried it and it is not easy to use in China , -DCMAKE_INSTALL_PREFIX=/usr/local means install the program to /usr/local)

Because all the required dependencies are installed in advance, the whole compilation process is very peaceful, no Error or even Warning will pop up, if you are stuck at this step, please check carefully whether the above dependency installation is successfully installed?

After the compilation is complete, use the following command to install:
$ cmake --build . --config Release --target install
After installation, the file is saved in:
Header file:
/usr/local/include/tesseract
[Physical path: C:\msys64\usr\local\include\tesseract]
Library file:
/usr/local/lib
[Physical path: C:\msys64\usr\local\lib]
In addition, Tesseract also uses another open source library called Leptonica, this library It was installed with pacman before, and it is saved by default in:
header: C:\msys64\mingw64\include\leptonica
library: C:\msys64\mingw64\lib\libleptonica.a
(Because other non-C++ languages ​​​​may need these two paths when referring to Tesseract, so I will inform them together.)

Finally, we also need to prepare training data for Tesseract. The download address of "eng (English)" is:

https://github.com/tesseract-ocr/tessdata/blob/main/eng.traineddata

The file is 22MB in size, after downloading, copy it to /usr/local/share/tessdata
(ie C:\msys64\usr\local\share\tessdata)

Once complete, set the training data path environment variable:

$ export TESSDATA_PREFIX=/usr/local/share/tessdata

(Note that this environment variable will become invalid after restarting. You can put it in .bashrc to take effect permanently.)
After all these tasks are done, you can enter the next stage of "testing".


4. Test

$ cd ~(back to home directory)
$ mkdir test && cd test
$ nano test.cpp(create test program, the content is as follows)

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main()
{
    
    
    char *outText;

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    // Initialize tesseract-ocr with English, without specifying tessdata path
    if (api->Init(NULL, "eng")) {
    
    
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    // Open input image with leptonica library
    Pix *image = pixRead("test.tif");
    api->SetImage(image);
    // Get OCR result
    outText = api->GetUTF8Text();
    printf("OCR output:\n%s", outText);

    // Destroy used object and release memory
    api->End();
    delete api;
    delete [] outText;
    pixDestroy(&image);

    return 0;
}

This is the standard test procedure of the Tesseract official document. Before compiling, we have to prepare a picture,
open the drawing board, select the text tool, and enter some text (note that it is in English), such as: Hello World! Save, pay attention to save in .tif format. Copy this file (take test.tif as an example) to the /test subdirectory of the home directory.

Compile with the following command:

g++ test.cpp `pkg-config --libs lept tesseract` -o test

(This step is the easiest to display a bunch of undefined reference to XXX , carefully check all the aforementioned processes, if you strictly follow the tutorial step by step, this prompt should not appear)
If everything is normal, enter:
$ test.exe
The program displays:

OCR output:
Hello World!

Congratulations!


5. Postscript

Tesseract project address:
https://github.com/tesseract-ocr/tesseract

References:
https://tesseract-ocr.github.io/tessdoc/Compiling.html#windows
https://tesseract-ocr.github.io/tessdoc/Examples_C++.html
https://medium.com/building-a -simple-text-correction-tool/basic-ocr-with-tesseract-and-opencv-34fae6ab3400
https://packages.msys2.org

Guess you like

Origin blog.csdn.net/rockage/article/details/131336101