Tesseract-ocr font learning steps (training your own library), analysis of pictures about ocr, including tesseract installation (the most complete in history)

Tesseract-ocr's font learning steps (train your own library)

Introduction: I read a lot of cases on the Internet. They are not very complete. Then I spent a day sorting it out, writing an article for the first time, I hope it will be helpful to everyone.
1. Install tesseract
download address https://github.com/UB-Mannheim/tesseract/wiki
1. Click the one downloaded below to run (it is a bit slow during installation)
Insert picture description here
2. Click next 3.
Insert picture description here
I accept the terms of the License Tick ​​the check box before Agreement and click Next.
Insert picture description here
4. Tick the check box before Install for anyone using this computer, click Next
5. Select the content that needs to be installed, and click Next. (Select the language, I selected all when I installed it, the installation is very slow, try to choose your own language)
Insert picture description here
6...Uncheck the check box before Show README, and click Finish.
Insert picture description here
7. Install jTessBoxEditor
installation package address: many online. Can't find the private message to contact me

8. Double click jTessBoxEditor, jar to run
Insert picture description here
9. If the following interface appears, the installation is successful.
Insert picture description here
10. Click Merge TIFF in Tools.
Insert picture description here
11. Select All Image Files as the file type, select the sample image, and click Open.
Insert picture description here
12. Enter num.font.exp0.tif as the file name, select TIFF as the file type, and click Save.
Insert picture description here
13. Click OK.
Insert picture description here
14. Copy the num.font.exp0.tif file to the Tesseract-OCR installation directory.
Insert picture description here
15. Open cmd at tesseract-ocr and execute the following command to generate num.font.exp0.box
. The command executed is:
tesseract.exe num.font.exp0.tif num.font.exp0 batch.nochop
Insert picture description here
box file generated by makebox It is num.font.exp0.box, and the box file is the characters and their coordinates recognized by Tesseract.
Note: The file name of Make Box File has a certain format, so you can't randomly
choose the name. The command format is: tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch .nochop makebox

where lang is the language name, fontname is the font name, and num is the serial number, which can be defined at will.
16. Click Box Editor, click open, select the photo num.font.exp0.tif (the tif image generated before), the following picture appears
Insert picture description here
17 It can be seen that the positions of some characters recognized above are not accurate, and the wrong characters and positions in each picture can be corrected manually by this tool. Save it after calibration.
Note: The wrong characters must be modified here, otherwise the traineddata file will be wrong. It can be modified and saved in the interface below, or it can be modified directly in the traineddata file.
18. Define the font feature file. Create a font feature file named font_properties. font_properties does not contain a BOM header, and the file content format is as follows:

fontname is the font name, which must be consistent with the name in [lang].[fontname].exp[num].box. The value of,,, and is 1 or 0, indicating whether the font has these attributes. Here, create a file named font_properties in the directory where the sample picture is located, open it with Notepad, and enter the following content: font 0 0 0 0 0 The value here is 0, which means the font is not bold, italic, etc.
Insert picture description here
19. Write a running script in D:\tesseract-ocr num.bat
echo Run Tesseract for Training...
tesseract.exe num.font.exp0.tif num.font.exp0 nobatch box.train
echo Compute the Character Set...
unicharset_extractor. exe num.font.exp0.box
mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr
echo Clustering...
cntraining.exe num.font.exp0.tr
echo Rename Files…
rename normproto num.normproto
rename inttemp num.inttemp
rename pffmtable num.pffmtable
rename shapetable num.shapetable
echo Create Tessdata…
combine_tessdata.exe num.
Insert picture description here
20. Run num.bat, Generate the following file. num.traineddata is the trained font file
Insert picture description here
21. Test font file

num1.txt is the result of parsing.
Insert picture description here

Guess you like

Origin blog.csdn.net/HAC12/article/details/107174477