Tesseract update traineddata method

Tesseract update traineddata method

Tesseract sometimes updates its training data, usually by issuing an incremental update, such as the current version 4.0 training data is an incremental update. To combine the incremental update with the previous training data, you can use the combine_tessdata command. The steps are as follows:

Environmental preparation

  1. To download traineddata,
    go to: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
    Download Data Files for Version 4.00 (November 29, 2016) and
    Updated Data Files for Version 4.00 (September 15, 2017)

  2. Create a directory to place the decompressed traineddata file
    PS E:\tesseract4> mkdir chi_sim

  3. Example directory structure

PS E:\tesseract4> tree.com /F
文件夹 PATH 列表
卷序列号为 76B2-83BC
E:.
│  chi_sim.traineddata
│  eng.traineddata
│  equ.traineddata
│  tesseract-ocr-w32-setup-v4.0.0-beta.4.20180912.exe
│
├─20170915-Updated Data Files for Version 4.00
│      chi_sim.traineddata
│      chi_sim_vert.traineddata
│
└─chi_sim

Unzip and repackage traineddata

  • Unzip the original traineddata to a directory
PS E:\tesseract4> & 'C:\Program Files (x86)\Tesseract-OCR\combine_tessdata.exe' -u .\chi_sim.traineddata .\chi_sim\chi_sim.
Extracting tessdata components from .\chi_sim.traineddata
Wrote .\chi_sim\chi_sim.config
Wrote .\chi_sim\chi_sim.unicharset
Wrote .\chi_sim\chi_sim.unicharambigs
Wrote .\chi_sim\chi_sim.inttemp
Wrote .\chi_sim\chi_sim.pffmtable
Wrote .\chi_sim\chi_sim.normproto
Wrote .\chi_sim\chi_sim.punc-dawg
Wrote .\chi_sim\chi_sim.word-dawg
Wrote .\chi_sim\chi_sim.number-dawg
Wrote .\chi_sim\chi_sim.freq-dawg
Wrote .\chi_sim\chi_sim.shapetable
Wrote .\chi_sim\chi_sim.lstm
Wrote .\chi_sim\chi_sim.lstm-punc-dawg
Wrote .\chi_sim\chi_sim.lstm-word-dawg
Wrote .\chi_sim\chi_sim.lstm-number-dawg
Wrote .\chi_sim\chi_sim.version
Version string:Pre-4.0.0
0:config:size=1930, offset=192
1:unicharset:size=382937, offset=2122
2:unicharambigs:size=1, offset=385059
3:inttemp:size=39926030, offset=385060
4:pffmtable:size=50194, offset=40311090
5:normproto:size=618655, offset=40361284
6:punc-dawg:size=290, offset=40979939
7:word-dawg:size=652386, offset=40980229
8:number-dawg:size=74, offset=41632615
9:freq-dawg:size=1042, offset=41632689
13:shapetable:size=455944, offset=41633731
17:lstm:size=9924750, offset=42089675
18:lstm-punc-dawg:size=18, offset=52014425
19:lstm-word-dawg:size=648082, offset=52014443
20:lstm-number-dawg:size=74, offset=52662525
23:version:size=9, offset=52662599
  • Unzip the incrementally updated traineddata
PS E:\tesseract4> & 'C:\Program Files (x86)\Tesseract-OCR\combine_tessdata.exe' -u '.\20170915-Updated Data Files for Version 4.00\chi_sim.traineddata' .\chi_sim\chi_sim.
Extracting tessdata components from .\20170915-Updated Data Files for Version 4.00\chi_sim.traineddata
Wrote .\chi_sim\chi_sim.config
Wrote .\chi_sim\chi_sim.lstm
Wrote .\chi_sim\chi_sim.lstm-punc-dawg
Wrote .\chi_sim\chi_sim.lstm-word-dawg
Wrote .\chi_sim\chi_sim.lstm-number-dawg
Wrote .\chi_sim\chi_sim.lstm-unicharset
Wrote .\chi_sim\chi_sim.lstm-recoder
Wrote .\chi_sim\chi_sim.version
Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=1966, offset=192
17:lstm:size=12152851, offset=2158
18:lstm-punc-dawg:size=282, offset=12155009
19:lstm-word-dawg:size=590634, offset=12155291
20:lstm-number-dawg:size=82, offset=12745925
21:lstm-unicharset:size=258834, offset=12746007
22:lstm-recoder:size=72494, offset=13004841
23:version:size=84, offset=13077335
  • Pack the files in the directory into a complete traineddata file
    这个操作会在相应的目录下生成一个完整的traineddata文件
PS E:\tesseract4> cd .\chi_sim\
PS E:\tesseract4\chi_sim> & 'C:\Program Files (x86)\Tesseract-OCR\combine_tessdata.exe' chi_sim
Combining tessdata files
Output chi_sim.traineddata created successfully.
Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=1966, offset=192
1:unicharset:size=382937, offset=2158
2:unicharambigs:size=1, offset=385095
3:inttemp:size=39926030, offset=385096
4:pffmtable:size=50194, offset=40311126
5:normproto:size=618655, offset=40361320
6:punc-dawg:size=290, offset=40979975
7:word-dawg:size=652386, offset=40980265
8:number-dawg:size=74, offset=41632651
9:freq-dawg:size=1042, offset=41632725
13:shapetable:size=455944, offset=41633767
17:lstm:size=12152851, offset=42089711
18:lstm-punc-dawg:size=282, offset=54242562
19:lstm-word-dawg:size=590634, offset=54242844
20:lstm-number-dawg:size=82, offset=54833478
21:lstm-unicharset:size=258834, offset=54833560
22:lstm-recoder:size=72494, offset=55092394
23:version:size=84, offset=55164888
  • Copy the new traineddata file to the tessdata directory of the tesseract installation path

Guess you like

Origin blog.csdn.net/huzhenwei/article/details/82705544