It has been previously described Tessract-OCR recognition (handwriting | generic font) Chinese, and only use the official Store,
https://blog.csdn.net/weixin_37794901/article/details/83343092;
To improve recognition for several characters, you can train yourself to text library generation language pack, here is relatively retarded training machine manually Ha;
1. Tools:
1) installed Tesseract-OCR 2) training tool jTessBoxEditor (required Java environment), specifically, how you can use the online fishing;
2.demo (window10 environment)
1) The test pictures (with Chinese) into tiff format: https://www.aconvert.com/cn/image/jpg-to-tiff/
2) file naming format:
tif facial naming format [lang]. [fontname] .exp [num] .tif,
lang is the language fontname font, for example, we want to train a custom font mjorcen font name normal, then we rename the picture file mjorcen.normal.exp0.jpg in turn tif.
3) generate a file box
Tesseract into the installation directory, dos command:
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l chi_sim batch.nochop makebox
4) Open the verification tools, training text
dos command:
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 nobatch box.train
unicharset_extractor mjorcen.normal.exp0.box
5) Create a new file font_properties
dos命令:echo normal 0 0 0 0 0 >fileName_properties
6) generation language pack
dos命令:
shapeclustering -F font_properties -U unicharset mjorcen.normal.exp0.tr
mftraining -F font_properties -U unicharset -O unicharset mjorcen.normal.exp0.tr
cntraining mjorcen.normal.exp0.tr
将生成的unicharset、inttemp、pffmtable、shapetable、normproto这五个文件前面都加上normal. 方便合成
combine_tessdata normal.
最后得到: