Tesseract-OCR font training

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/wsp_1138886114/article/details/84098903

First, set up the environment

Install Tesseract-OCR comes with download Chinese character identification! But in actual use to identify ineffective,
in order to improve the recognition effect and decided according to the required training content identified font!
Font training is best carried out in the Tesseract-OCR installation directory.

1. Download Tesseract-OCR engine: Use details, please click .

2, jTessBoxEditor Download: HTTPS: //www.softpedia.com/get/Multimedia/Graphic/Graphic-Others/jTessBoxEditor.shtml
https://github.com/tesseract-ocr/tesseract/wiki/AddOns
Or: https: / /dl.pconline.com.cn/download/1060986.html

3, download chi_sim.traindata font. The Chinese have to recognize. After just great, put Tesseract-OCR project tessdata folder inside (you can install).

Second, the automatic training 3500 commonly used Chinese characters with jTessBoxEditor

First steps are summarized as follows:

Third, training fonts

First training ready with sample images.

  1. Tesseract-OCR into the folder of the installation directory cd Program Files \ Tesseract-OCR

  2. Open jTessBoxEditor, select Tools -> Merge TIFF, the Open dialog box, select the folder where the training sample, and select all to be involved in the training sample picture, note the dialog box "File Type" choose png

  3. Then there is the Open dialog box, enter "chi_my.font.exp0.tif", the format tiff. Chi_my which can be changed to your own definition. It generates chi_my.font.exp0.tif file.

  4. Generation "chi_my.font.exp0.box" file; execute the command line
    tesseract chi_my.font.exp0.tif chi_my.font.exp0 -l chi_sim batch.nochop makebox
    tesseract chi_my.font.exp0.tif chi_my.font.exp0 -l eng batch.nochop makebox

  5. Open jTessBoxEditor, click Box Editor -> Open, select chi_my.font.exp0.tif file.

  6. Adjustments misidentification. Especially the relatively large number of images, characters, situations.
    Note identifies the need to save on charater click interface modified after setting icon button, and then click the save button.

  7. Create font files feature
    echo font 0 0 0 0 0> font_properties
    will generate "font_properties" file. Display file size is 0 bytes. In fact, there are ' "font 0 0 0 0 0 "' content.

  8. training
    tesseract chi_my.font.exp0.tif chi_my.font.exp0 -l eng -psm 7 nobatch box.train

  9. Generate character set files
    unicharset_extractor chi_my.font.exp0.box
    generation "unicharset" file.

  10. Generate shape files, gathered character profile, character profile normalization of four files.

    • Command shapeclustering -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
      generation "shapetable" "inttemp" "pffmtable " file.
    • Command mftraining -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
      generation "pinyin.unicharset" file.
    • Command cntraining langyp.fontyp.exp0.tr
      generation "normproto" file.
  11. Rename, merge training file
    the rename normproto langyp.normproto
    the rename inttemp langyp.inttemp
    the rename pffmtable langyp.pffmtable
    the rename unicharset langyp.unicharset
    the rename shapetable langyp.shapetable
    merge training file,
    combine_tessdata langyp.
    Generate langyp.traineddata file.

  12. The resulting "langyp.traineddata" Language Pack files are copied to the tesseract tessdata directory,
    you can use it for a Chinese character recognition.

实例:
G:\Program Files (x86)\jTessBoxEditorFX\samples\pinyin>unicharset_extractor pinyin.font.exp0.box
Extracting unicharset from pinyin.font.exp0.box
Wrote unicharset file ./unicharset.
G:\Program Files (x86)\jTessBoxEditorFX\samples\pinyin>shapeclustering -F font_properties -U unicharset -O pinyin.unicharset pinyin.font.exp0.tr
Reading pinyin.font.exp0.tr …
G:\Program Files (x86)\jTessBoxEditorFX\samples\pinyin>mftraining -F font_properties -U unicharset -O pinyin.unicharset pinyin.font.exp0.tr
Read shape table shapetable of 27 shapes
G:\Program Files (x86)\jTessBoxEditorFX\samples\pinyin>cntraining pinyin.font.exp0.tr
Reading pinyin.font.exp0.tr …
Clustering …

G:\Program Files (x86)\jTessBoxEditorFX\samples\pinyin>combine_tessdata pinyin.
Combining tessdata files

Acknowledgments
https://www.cnblogs.com/zhongtang/p/5555950.html
automatic training 3500 commonly used Chinese characters: HTTPS: //blog.csdn.net/woaipangruimao/article/details/78741022
https://blog.csdn.net / duanshao / Article This article was / the Details / 79,835,651
https://blog.csdn.net/woaipangruimao/article/details/78685727
http://www.cnblogs.com/wzben/p/5930538.html

https://blog.csdn.net/sylsjane/article/details/83751297

Guess you like

Origin blog.csdn.net/wsp_1138886114/article/details/84098903