Java uses Tess4J to realize OCR image text recognition

introduce

Tess4J is the java api implementation library of Tesseract OCR. You can easily realize image recognition and text extraction through java calls, that is, OCR image extraction text technology.

Tess4J supports recognized image formats:

  • TIFF, JPEG, GIF, PNG and BMP image formats
  • Multi-page TIFF image
  • PDF document format

Tesseract OCR's github official website

Tesseract OCR Manual

Tess4J official website

1. Maven is as follows

        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>5.7.0</version>
        </dependency>
         <!-- 解决输出的时候 slf4j 报错 -->   
         <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.8.0-beta4</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>1.8.0-beta4</version>
        </dependency>

2. Download the language model

If you cannot access github or the language model download fails slowly due to network problems, you can directly refer to Step 2 Baidu Cloud Download.

1. Download the language model

Tesseract supports more than 100 language recognition. You can download the language model file in the format you want to recognize from the Traineddata language model description download page ..traineddata

(1) Special model

(2) Language model

Tesseract has three independent language model repositories on GitHub, tessdata , tessdata-best , and tessdata-fast , which store language models respectively. Their differences are:

how to train speed recognition accuracy Whether to support the old version Whether to support retraining
test data Traditional +LSTM (and integrate tessdata-best) faster than tessdata-best Slightly less accurate than tessdata-best support not support
tessdata-best LSTM only (based on langdata ) slowest most accurate not support support
tessdata-fast Smaller LSTM network integration than tessdata-best fastest least accurate not support not support

When I tested to recognize a multi-text image, tessdata-best had the best effect but took 10 seconds faster, and tessdata took 3 seconds but the effect was slightly worse. You can choose to download the language model file according to your own needs. Here I choose to download chi_sim.traineddata (simplified Chinese) and eng.traineddata (English model) from the tessdata-best library.

Because of network problems, if you can’t access github or the download is slow, you can download tessdata and tessdata-best from my Baidu cloud, which contains all language model files, (if you only need Chinese and English models, you can see my step 2 ):

Baidu cloud download: tessdata-4.1.0.zip about 635 MB (link: https://pan.baidu.com/s/1e2UKTpMqnfhpCoq6NquIAQ
extraction code: jc9p)

Baidu cloud download: tessdata_best-4.1.0.zip about 1.29 GB (link: https://pan.baidu.com/s/1dcHpukvaH6Rtma_drfqD9g
extraction code: w3gh)

(3) Create a new tessdata folder under the resources folder of the project, and then .traineddatacopy the language model file in the format downloaded above to tessdata.

2. Baidu cloud download

If you cannot access github or the download fails slowly due to network problems, you can download my tessdata.zip from Baidu Cloud, which contains Chinese and English language models. After decompression, copy the tessdata folder to your resources folder:

Baidu cloud download: tessdata.zip about 27MB (link: https://pan.baidu.com/s/1nXHJ_e4kzOGHbFwh95ijEg
extraction code: k1qu)

3. Test

1. Test code


import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import java.io.File;
import java.awt.*;

public class MainServer {
    
    
   public static void main(String[] args) throws TesseractException {
    
    
       long start = System.currentTimeMillis();
       System.out.println("开始OCR文字识图,请稍后...");
       //加载要识别的图片
       File image = new File("F:\\image\\test1.jpg");
       //设置配置文件夹微视、识别语言、识别模式
       Tesseract tesseract = new Tesseract();
       tesseract.setDatapath("src/main/resources/tessdata");
       //设置识别语言为中文简体,(如果要设置为英文可改为"eng")
       tesseract.setLanguage("chi_sim");
       //使用 OSD 进行自动页面分割以进行图像处理
       tesseract.setPageSegMode(1);
       //设置引擎模式是神经网络LSTM引擎
       tesseract.setOcrEngineMode(1);
       //开始识别整张图片中的文字
       String result = tesseract.doOCR(image);
       long time = System.currentTimeMillis()-start;
       System.out.println("识别结束,耗时:"+time+" 毫秒,识别结果如下:");
       System.out.println();
       System.out.println(result);
   }
}

In the above example, the text in the entire picture is recognized. If you only want to recognize the text in a specific part of the picture, you can select the length and width range recognition as follows:

String result = tesseract.doOCR(image, new Rectangle(300, 200));

2. Test picture

Please add a picture description

3. Effect

Please add a picture description


reference:

Optical Character Recognition with Tesseract

Guess you like

Origin blog.csdn.net/qq_33697094/article/details/131438114