tesseract-ocr+java implements picture text recognition

  OCR is the abbreviation of Optical Character Recognition, which is optical character recognition technology. It is mainly a technology for identifying pictures containing text data and obtaining text information.

    Currently, the tool tesseract-ocr can be easily installed under Windows, Linux, and Mac.

    Installation link under windows: https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.1.20220118.exe

    After installing this tool, you need to set the environment variable TESSDATA_PREFIX. This variable is to set the directory location of tessdata in the tesseract installation directory.​ 

    In order to use the tesseract executable program from the command line, it is best to add the tesseract-ocr installation path to the Path environment variable.

    In this way, we can use the tesseract command from the command line. The following is to check whether the tesseract installation is correct: 

    We can use a picture with characters for verification:

    This picture is hello.png, with the words "hello.tesseract" on it.

    Under the command line, through tesseract images\hello.png hello, the hello.png image in the images directory can be recognized, and the extracted text is saved in the hello.txt file.

/

    The above is to directly extract the characters in the picture through the tesseract-ocr tool. The following is to extract through the program. Here, we take the java program as an example and add the net.sourceforge.tess4j dependency.

<dependency>
      <groupId>net.sourceforge.tess4j</groupId>
      <artifactId>tess4j</artifactId>
      <version>4.6.0</version>
</dependency>
    Java代码也是极其简单:

package ocr;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import java.io.File;
public class TestOCR {
    public static void main(String[] args) {
        Tesseract instance = new Tesseract();
        //instance.setDatapath("D:\\Program Files\\Tesseract-OCR\\tessdata");
        File imageFile = new File("D:\\yofc\\python\\images\\hello.png");
        try {
            String result = instance.doOCR(imageFile);
            System.out.println(result);
        } catch (TesseractException e) {
            e.printStackTrace();
        }
    }
}

    Run the program and the printed information is as follows:

     This code creates a new Tesseract instance, and then starts to recognize image files. In just four lines of code, it completes the functions of identifying and printing the recognition results. In fact, some places mentioned the need to set the training file location. If we set TESSDATA_PREFIX in the previous installation, there is no need to set it here. In this example, this line of code is commented out:

instance.setDatapath("D:\\Program Files\\Tesseract-OCR\\tessdata");
    can also run successfully, because when Tesseract is instantiated, The value of the system variable TESSDATA_PREFIX will be read and set:

public Tesseract() {         try {             this.datapath = System.getenv("TESSDATA_PREFIX");         } catch (Exception var5) {         } finally {             if (this.datapath == null) { /      So far, the OCR example has been finished. For java development, the code is extremely simple. In some places, it is not necessary to execute commands and call tesseract to identify images by simulating the command line. If that kind of code is transplanted, the installation path in the Linux environment needs to be set up, which is very troublesome. ———————————————— Copyright statement: This article is an original article by CSDN blogger "luffy5459", follow CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement when reprinting. Original link: https://blog.csdn.net/feinifi/article/details/124697094







 

 




Guess you like

Origin blog.csdn.net/qq_33209777/article/details/130425614