Pitfalls encountered in secondary training of Tesseract-OCR LSTM

My environment:

  • win10
  • Tesseract 4.1.0
  • jTessBoxEditor 2.2.1

The training process refers to the following article:

https://blog.csdn.net/Hu_helloworld/article/details/100923215

  • Pit 1. makebox:

After I use the following command:

tesseract nml.num.exp0.tif nml.num.exp0 -l eng --psm 6 batch.nochop makebox

Only the information of the first picture in the tif can be generated. Later, I found that the merge tiff of jTessBoxEditor seemed to be used incorrectly. I originally used multiple jpgs to generate a tif. Later, I converted all the jpgs to tif (using opencv), and then used merge tiff to generate a tiff to be able to identify the text information of all the pictures.

  • 坑2. Compute CTC targets failed!

After I use the following command:

lstmtraining --model_output="F:\Test\AMyWork\ImgSampleLib\nomal\samples\CTCCB24\output\output" --continue_from="F:\Test\AMyWork\ImgSampleLib\nomal\samples\CTCCB24\eng.lstm" 
--train_listfile="F:\Test\AMyWork\ImgSampleLib\nomal\samples\CTCCB24\eng.training_files.txt" --traineddata="F:\Test\AMyWork\ImgSampleLib\nomal\samples\CTCCB24\eng.traineddata" 
--debug_interval -1 --max_iterations 2000 


An infinite loop prints Compute CTC targets failed!. After a Google search, I found that the data format in the box was incorrect. In lstm training mode, box only accepts a whole row of data, instead of splitting a whole row of data into boxes. So you only need to change the range of the data belonging to one line from a single text to a whole line, and it also needs to end with a \t, such as the following:

1 148 127 268 151 0
2 148 127 268 151 0
3 148 127 268 151 0
4 148 127 268 151 0
5 148 127 268 151 0
6 148 127 268 151 0
7 148 127 268 151 0
8 148 127 268 151 0
     148 127 268 151 0

The last one is \t, which is the tab key.

In fact, there are official explanations on this, but they are all in English, and few people read them.

Official description of box format

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

The correct LSTM box acquisition instruction should be similar to the following:

tesseract cn.my.exp0.tif cn.my.exp1 -l chi_sim lstmbox

 

Then you can happily start the training process.

 

Guess you like

Origin blog.csdn.net/qq_19313495/article/details/102977915