Tesseract 4.0 trains fonts and recognizes trained pictures


The download links for each tool are at the bottom of the article!

important! ! First create an empty folder (with any name) to save the trained model. You also need to create a folder named tessdata inside, which must be called

You can use the downloaded ones for test training first (just replace the file path in the java file with the storage path of your own download package), and the pictures must use the pictures in the download package, because the training is the pictures in the download package\ color{#2111f1} can be downloaded for testing and training first (you only need to replace the file path in the java file with the storage path of your own download package), and the picture must use the picture in the download package, because the training is to download picture inside the bagYou can use the downloaded one to test and train first (just replace the file path in the java file with the storage path of your own download package), and the pictures must use the pictures in the download package, because the training is in the download package picture




1. Run tesseract − ocr − w 64 − setup − v 4.0.0.20181030.exe file. After installation, configure system environment variables (the most basic will not be described) \color{#21a2f1} 1. Run tesseract-ocr-w64 -setup-v4.0.0.20181030.exe file, after installation, configure the system environment variables (the most basic will not be described)1. Run t esser a c tocrw64setupv 4.0.0.20181030.exe file, after installation, configure system environment variables (the most basic will not be described )

D:\Program Files (x86)\Tesseract-OCR  只是我配置的路径,就是安装的根路径
在后面用来生成.box文件


2. Use j T ess B ox Editor to generate the combined tif picture of the training sample (the picture is already prepared, or you can prepare it by yourself) \color{#21a2f1} 2. Use jTessBoxEditor to generate the combined tif picture of the training sample ( The picture is already prepared, or you can prepare it yourself)2. Use j T ess B o x E di t or to generate the merged t if picture of the training sample (the picture is already prepared, or you can prepare it yourself)

  1. Open jTessBoxEditor, select Tools->Merge TIFF, enter the folder where the training samples are located, and select the sample images to participate in the training:

  2. Click "Open" to pop up a save dialog box, choose to save it in the current path, name the file "zwp.test.exp0.tif", and only one format "TIFF" is optional.

  3. Note: The naming format of tif text is [lang].[fontname].exp[num].tif
    lang is the language, fontname is the font, and num is the custom number.
    For example, if we want to train a custom font library zwp, the font name is test, then we name the image file as zwp.test.exp0.tif



3. Use tesseract to generate .box file \color{#21a2f1} 3. Use tesseract to generate .box file3. Use t esser a c t to generate . b o x file

Open the command line program in the directory where the "zwp.test.exp0.tif" file generated in the previous step is located, execute the following command, and the zwp.test.exp0.box file will be generated after execution.

执行如下命令  tesseract zwp.test.exp0.tif zwp.test.exp0  batch.nochop makebox

4. Use j T ess B ox E editor to correct errors in .box files \color{#21a2f1} 4. Use jTessBoxEditor to correct errors in .box files4. Use j T ess B o x E di t or to correct errors in . b o x files

The .box file records the position of each character on the picture and the recognized content, because the recognized content and position may combine two characters or split a character, so you need to use jTessBoxEditor to adjust before training The position and content of characters.

Steps for usage:

Open jTessBoxEditor and click Box Editor -> Open, open the "zwp.test.exp0.tif" generated in step 2, and it will be automatically associated with the "zwp.test.exp0.box" file. The two files must be in the same directory. After adjusting, click “save” to save the modification.

5. Generate fontproperties file: \color{#21a2f1} 5. Generate font_properties file:5. Generate f o n tpro p er t i es file:

  1. Execute the following command: echo test 0 0 0 0 0 >font_properties

  2. You can also manually create a new text file named font_properties, and enter the content "test 0 0 0 0 0" to indicate a total of 5 properties such as bold and italic of the font test. The "test" here must be consistent with the name "test" in "zwp.test.exp0.box".

6. Use tesseract to generate .tr training file\color{#21a2f1} 6. Use tesseract to generate .tr training file6. Use t esser a c t to generate . t r training files

Execute the following command. After execution, the zwp.test.exp0.tr file will be generated in the current directory.

执行如下命令  tesseract zwp.test.exp0.tif zwp.test.exp0 nobatch box.train 

7. Generate character set file\color{#21a2f1}7. Generate character set file7. Generate character set file

Execute the following command: After execution, a file named "unicharset" will be generated in the current directory.

执行命令 unicharset_extractor zwp.test.exp0.box

8. Generate shape file\color{#21a2f1}8. Generate shape file8. Generate shape file _ _ _

Execute the following command. After execution, two files, shapetable and zwp.unicharset, will be generated.

执行命令  shapeclustering -F font_properties -U unicharset -O zwp.unicharset zwp.test.exp0.tr

9. Generate poly character feature file\color{#21a2f1}9. Generate poly character feature file9. Generate poly character feature files

Executing the following command will generate four files: inttemp, pffmtable, shapetable and zwp.unicharset.

执行命令   mftraining -F font_properties -U unicharset -O zwp.unicharset zwp.test.exp0.tr

10. Generate character normalization feature file\color{#21a2f1}10. Generate character normalization feature file10. Generate character normalization feature files

Execute the following command to generate a normproto file.

执行命令: cntraining zwp.test.exp0.tr

11. File rename\color{#21a2f1}11. File rename11. File renaming

Rename the four files inttemp, pffmtable, shapetable and normproto to [lang].xxx.

Here modified to zwp.inttemp, zwp.pffmtable, zwp.shapetable and zwp.normproto

依次执行下面命令

rename normproto zwp.normproto

rename inttemp zwp.inttemp

rename pffmtable zwp.pffmtable

rename shapetable zwp.shapetable

11. Merge training files\color{#21a2f1}11. Merge training files11. Merge training files

Execute the following command to generate the zwp.traineddata file.

执行命令 combine_tessdata zwp.

Copy the generated "zwp.traineddata" language pack file to the tessdata folder under the newly created folder directory, and then you can use the language pack generated by training for image text recognition.

12. Code test\color{#21a2f1}12. Code test12. Code testing

  1. Introduce dependencies in pom
     <!--   pom  tess4j相关依赖   -->
        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>3.4.0</version>
            <exclusions>
                <exclusion>
                    <groupId>com.sun.jna</groupId>
                    <artifactId>jna</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
  1. the code

public class Main {
    public static void main(String[] args) {
        System.out.println("Hello world!");
        //你的图片 1.png 为我自定义图片名字 其实使用的就是 上面的测试图片
        File imageFile = new File("D:\\OCR\\1.png");
        //你训练库的路径tessdata
        ITesseract instance = new Tesseract();
        //     模型文件夹必须叫做 tessdata
        instance.setDatapath("D:\\OCR\\Test2\\tessdata");
//        zwp 是刚刚训练过的 包
        instance.setLanguage("zwp");
//        chi_sim 是自带的中文包
//        instance.setLanguage("chi_sim");
        String result = null;
        try {
            result = instance.doOCR(imageFile);
        } catch (TesseractException e) {
            throw new RuntimeException(e);
        }
        System.out.println(result);
    }
}

If the network is not working, you can download the source code file to operate
the demo source code


Used to configure environment variables and generate .box files
tesseract-ocr official website


Used to adjust the content and position of the text on the picture
jTessBoxEditor tool official website


Used to set the language pack (instance.setLanguage)
other language pack address

Guess you like

Origin blog.csdn.net/weixin_41620505/article/details/129195725