Tesseract-ocr's font learning steps (train your own library)
Introduction: I read a lot of cases on the Internet. They are not very complete. Then I spent a day sorting it out, writing an article for the first time, I hope it will be helpful to everyone.
1. Install tesseract
download address https://github.com/UB-Mannheim/tesseract/wiki
1. Click the one downloaded below to run (it is a bit slow during installation)
2. Click next 3.
I accept the terms of the License Tick the check box before Agreement and click Next.
4. Tick the check box before Install for anyone using this computer, click Next
5. Select the content that needs to be installed, and click Next. (Select the language, I selected all when I installed it, the installation is very slow, try to choose your own language)
6...Uncheck the check box before Show README, and click Finish.
7. Install jTessBoxEditor
installation package address: many online. Can't find the private message to contact me
8. Double click jTessBoxEditor, jar to run
9. If the following interface appears, the installation is successful.
10. Click Merge TIFF in Tools.
11. Select All Image Files as the file type, select the sample image, and click Open.
12. Enter num.font.exp0.tif as the file name, select TIFF as the file type, and click Save.
13. Click OK.
14. Copy the num.font.exp0.tif file to the Tesseract-OCR installation directory.
15. Open cmd at tesseract-ocr and execute the following command to generate num.font.exp0.box
. The command executed is:
tesseract.exe num.font.exp0.tif num.font.exp0 batch.nochop
box file generated by makebox It is num.font.exp0.box, and the box file is the characters and their coordinates recognized by Tesseract.
Note: The file name of Make Box File has a certain format, so you can't randomly
choose the name. The command format is: tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch .nochop makebox
where lang is the language name, fontname is the font name, and num is the serial number, which can be defined at will.
16. Click Box Editor, click open, select the photo num.font.exp0.tif (the tif image generated before), the following picture appears
17 It can be seen that the positions of some characters recognized above are not accurate, and the wrong characters and positions in each picture can be corrected manually by this tool. Save it after calibration.
Note: The wrong characters must be modified here, otherwise the traineddata file will be wrong. It can be modified and saved in the interface below, or it can be modified directly in the traineddata file.
18. Define the font feature file. Create a font feature file named font_properties. font_properties does not contain a BOM header, and the file content format is as follows:
fontname is the font name, which must be consistent with the name in [lang].[fontname].exp[num].box. The value of,,, and is 1 or 0, indicating whether the font has these attributes. Here, create a file named font_properties in the directory where the sample picture is located, open it with Notepad, and enter the following content: font 0 0 0 0 0 The value here is 0, which means the font is not bold, italic, etc.
19. Write a running script in D:\tesseract-ocr num.bat
echo Run Tesseract for Training...
tesseract.exe num.font.exp0.tif num.font.exp0 nobatch box.train
echo Compute the Character Set...
unicharset_extractor. exe num.font.exp0.box
mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr
echo Clustering...
cntraining.exe num.font.exp0.tr
echo Rename Files…
rename normproto num.normproto
rename inttemp num.inttemp
rename pffmtable num.pffmtable
rename shapetable num.shapetable
echo Create Tessdata…
combine_tessdata.exe num.
20. Run num.bat, Generate the following file. num.traineddata is the trained font file
21. Test font file
num1.txt is the result of parsing.