Tesseract training new fonts

The recent work needs to accurately identify specific fonts. Using the officially downloaded eng recognition library, the error rate is relatively high and cannot meet the requirements, so I embarked on the journey of training fonts. Let me talk about several training methods I am looking for.

Note: Tesseract-OCR and Tesseract are not the same thing

1. Tesseract-OCR manual training, you need to generate or collect sample pictures yourself, then use jTessBoxEditor to manually correct the recognition errors, and then type commands to generate various files. Generally speaking, it is too slow and troublesome, error-prone, and the effect is not good

2. It is still Tesseract-OCR training, this time it is a bit more intelligent, using LSTM technology, it can be trained automatically, without pictures, and without manual error correction, but you need to provide the TXT of the training samples (text, characters, etc. to be recognized), and then type commands to generate training files, but the effect is not good, the probability of error is high, and it cannot meet the production needs.

3. Using Tesseract training, the training steps are almost the same as Tesseract-OCR's LSTM, the difference is that the samples are provided by the official training samples, and there are training scripts provided by the big guys, which makes training much more convenient. The effect is also better than the first two methods, much better than the official eng training package

Summary: The training may be the same, but the training samples are equally important. Reasonable samples can get twice the result with half the effort

Let's just look at the simplest and most effective training method Tesseract LSTM training (Windows)

Preparation

1. Download and compile Tesseract. The compiled Tesseract is provided at the end of this article. You can download it yourself

2. Configure Tesseract/bin to environment variables

3. Download the script program required for training, and the download address will be provided later in this article

4. Prepare the font samples you want to train

5. Copy the font file to be trained to the tesstrain/fonts path. tesstrain.sh supports training multiple font files at the same time.

6. Copy the language-related files of the font to be trained to tesstrainsh-win/langdata_lstm. For example, the font library trained in this article belongs to English, download all the files under the langdata_lstm/eng folder and put them in the tesstrainsh-win/langdata_lstm/eng path. If you need to train Simplified Chinese, create a chi_sim folder under the tesstrainsh-win/langdata_lstm path, download all files under langdata_lstm/chi_sim and place them under the tesstrainsh-win/langdata_lstm/chi_sim path.

7. Copy the basic font library of the font to be trained to the path tesstrainsh-win/tessdata. The file under this path in this article is tesstrainsh-win/tessdata/eng.traineddata. The .traineddata here needs to be downloaded from the github/tesseract-ocr/tessdata_best project.

8. Copy lstm.train to the path tesstrainsh-win/tessdata/configs. The file is in C:/Program Files/tesseract/tessdata/configs under the tesseract installation path. I have prepared this file in tesstrainsh-win, but mine is Tesseract4.1 version. If you are using other versions of training tools, it is recommended to keep the same version of the file. Find the file in the installation path and copy it to the path tesstrainsh-win/tessdata/configs to overwrite the existing file.

9. The three files tesstrain.sh, tesstrain_utils.sh, and language-specific.sh under the tesstrainsh-win project are copied from the Tesseract source code (Tesseract/src/training), which is the release version of Tesseract4.1. If you are using other versions of the training tool, it is recommended that the three file versions also be consistent. Find these three files under the source path and copy them to the path tesstrainsh-win to overwrite the existing files.

10. Download radical-stroke.txt from langdata_lstm and copy it to the path tesstrainsh-win/langdata_lstm. I have prepared this file in tesstrainsh-win. Radical-stroke.txt has not been updated for 2 years. Currently, the Tesseract version with LSTM should be common.

start training

We have done the preparations above, and we will start training when we come down

1. Open the command prompt (cmd.exe) as an administrator and enter the path where tesstrainsh-win is located.

2. Execute sh tesstrainDone.sh, if there is no sh command, please use the git command window to execute

Then wait for the training to complete, and the training doesn't take too long

After the training is completed, the output file is under the output folder, and the sh eval.sh command can also be used to evaluate the training results

Reference: Train Tesseract LSTM with tesstrain.sh on Windows - LiveZingy

Guess you like

Origin blog.csdn.net/baoolong/article/details/122231259