tesseract update traineddata的方法

tesseract update traineddata的方法

tesseract有时会更新它的训练数据,通常是发布一个增量更新,如目前4.0版的训练数据就是增量更新。将增量更新与之前的训练数据组合起来可以用combine_tessdata命令,步骤如下:

环境准备

  1. 下载traineddata
    前往:https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
    下载Data Files for Version 4.00 (November 29, 2016)以及
    Updated Data Files for Version 4.00 (September 15, 2017)

  2. 创建一个目录用来放置解压的traineddata文件
    PS E:\tesseract4> mkdir chi_sim

  3. 目录结构示例

PS E:\tesseract4> tree.com /F
文件夹 PATH 列表
卷序列号为 76B2-83BC
E:.
│  chi_sim.traineddata
│  eng.traineddata
│  equ.traineddata
│  tesseract-ocr-w32-setup-v4.0.0-beta.4.20180912.exe
│
├─20170915-Updated Data Files for Version 4.00
│      chi_sim.traineddata
│      chi_sim_vert.traineddata
│
└─chi_sim

解压和重新打包traineddata

  • 解压原始的traineddata到某目录中
PS E:\tesseract4> & 'C:\Program Files (x86)\Tesseract-OCR\combine_tessdata.exe' -u .\chi_sim.traineddata .\chi_sim\chi_sim.
Extracting tessdata components from .\chi_sim.traineddata
Wrote .\chi_sim\chi_sim.config
Wrote .\chi_sim\chi_sim.unicharset
Wrote .\chi_sim\chi_sim.unicharambigs
Wrote .\chi_sim\chi_sim.inttemp
Wrote .\chi_sim\chi_sim.pffmtable
Wrote .\chi_sim\chi_sim.normproto
Wrote .\chi_sim\chi_sim.punc-dawg
Wrote .\chi_sim\chi_sim.word-dawg
Wrote .\chi_sim\chi_sim.number-dawg
Wrote .\chi_sim\chi_sim.freq-dawg
Wrote .\chi_sim\chi_sim.shapetable
Wrote .\chi_sim\chi_sim.lstm
Wrote .\chi_sim\chi_sim.lstm-punc-dawg
Wrote .\chi_sim\chi_sim.lstm-word-dawg
Wrote .\chi_sim\chi_sim.lstm-number-dawg
Wrote .\chi_sim\chi_sim.version
Version string:Pre-4.0.0
0:config:size=1930, offset=192
1:unicharset:size=382937, offset=2122
2:unicharambigs:size=1, offset=385059
3:inttemp:size=39926030, offset=385060
4:pffmtable:size=50194, offset=40311090
5:normproto:size=618655, offset=40361284
6:punc-dawg:size=290, offset=40979939
7:word-dawg:size=652386, offset=40980229
8:number-dawg:size=74, offset=41632615
9:freq-dawg:size=1042, offset=41632689
13:shapetable:size=455944, offset=41633731
17:lstm:size=9924750, offset=42089675
18:lstm-punc-dawg:size=18, offset=52014425
19:lstm-word-dawg:size=648082, offset=52014443
20:lstm-number-dawg:size=74, offset=52662525
23:version:size=9, offset=52662599
  • 解压增量更新的traineddata
PS E:\tesseract4> & 'C:\Program Files (x86)\Tesseract-OCR\combine_tessdata.exe' -u '.\20170915-Updated Data Files for Version 4.00\chi_sim.traineddata' .\chi_sim\chi_sim.
Extracting tessdata components from .\20170915-Updated Data Files for Version 4.00\chi_sim.traineddata
Wrote .\chi_sim\chi_sim.config
Wrote .\chi_sim\chi_sim.lstm
Wrote .\chi_sim\chi_sim.lstm-punc-dawg
Wrote .\chi_sim\chi_sim.lstm-word-dawg
Wrote .\chi_sim\chi_sim.lstm-number-dawg
Wrote .\chi_sim\chi_sim.lstm-unicharset
Wrote .\chi_sim\chi_sim.lstm-recoder
Wrote .\chi_sim\chi_sim.version
Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=1966, offset=192
17:lstm:size=12152851, offset=2158
18:lstm-punc-dawg:size=282, offset=12155009
19:lstm-word-dawg:size=590634, offset=12155291
20:lstm-number-dawg:size=82, offset=12745925
21:lstm-unicharset:size=258834, offset=12746007
22:lstm-recoder:size=72494, offset=13004841
23:version:size=84, offset=13077335
  • 将目录下文件打包成完整的traineddata文件
    这个操作会在相应的目录下生成一个完整的traineddata文件
PS E:\tesseract4> cd .\chi_sim\
PS E:\tesseract4\chi_sim> & 'C:\Program Files (x86)\Tesseract-OCR\combine_tessdata.exe' chi_sim
Combining tessdata files
Output chi_sim.traineddata created successfully.
Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=1966, offset=192
1:unicharset:size=382937, offset=2158
2:unicharambigs:size=1, offset=385095
3:inttemp:size=39926030, offset=385096
4:pffmtable:size=50194, offset=40311126
5:normproto:size=618655, offset=40361320
6:punc-dawg:size=290, offset=40979975
7:word-dawg:size=652386, offset=40980265
8:number-dawg:size=74, offset=41632651
9:freq-dawg:size=1042, offset=41632725
13:shapetable:size=455944, offset=41633767
17:lstm:size=12152851, offset=42089711
18:lstm-punc-dawg:size=282, offset=54242562
19:lstm-word-dawg:size=590634, offset=54242844
20:lstm-number-dawg:size=82, offset=54833478
21:lstm-unicharset:size=258834, offset=54833560
22:lstm-recoder:size=72494, offset=55092394
23:version:size=84, offset=55164888
  • 将新的traineddata文件拷贝到tesseract安装路径的tessdata目录下

猜你喜欢

转载自blog.csdn.net/huzhenwei/article/details/82705544