--- training Tesseract-OCR recognition of Chinese custom character generation language pack

It has been previously described Tessract-OCR recognition (handwriting | generic font) Chinese, and only use the official Store,

https://blog.csdn.net/weixin_37794901/article/details/83343092;

To improve recognition for several characters, you can train yourself to text library generation language pack, here is relatively retarded training machine manually Ha;

 

1. Tools:

     1) installed Tesseract-OCR 2) training tool jTessBoxEditor (required Java environment), specifically, how you can use the online fishing;

2.demo (window10 environment)

    1) The test pictures (with Chinese) into tiff format: https://www.aconvert.com/cn/image/jpg-to-tiff/

    2) file naming format:

       tif facial naming format [lang]. [fontname] .exp [num] .tif,

       lang is the language fontname font, for example, we want to train a custom font mjorcen font name normal, then we rename the picture file mjorcen.normal.exp0.jpg in turn tif.   

   3) generate a file box

       Tesseract into the installation directory, dos command:

       tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l chi_sim batch.nochop makebox

   4) Open the verification tools, training text

     

  dos command:

tesseract  mjorcen.normal.exp0.jpg mjorcen.normal.exp0  nobatch box.train

unicharset_extractor mjorcen.normal.exp0.box

   5) Create a new file font_properties

   dos命令:echo normal 0 0 0 0 0 >fileName_properties

   6) generation language pack

     dos命令:

     shapeclustering -F font_properties -U unicharset mjorcen.normal.exp0.tr

     mftraining -F font_properties -U unicharset -O unicharset mjorcen.normal.exp0.tr

     cntraining mjorcen.normal.exp0.tr

     将生成的unicharset、inttemp、pffmtable、shapetable、normproto这五个文件前面都加上normal. 方便合成

     combine_tessdata normal.

     最后得到:

    

 

Guess you like

Origin blog.csdn.net/weixin_37794901/article/details/83501160