1. Download the jTessBoxEditor tool
The jTessBoxEditor tool is a professional orc sample recognition training software using Tesseract. It is developed based on java. It can perform Tesseract sample training, form its own language library, and improve the recognition rate and accuracy of pictures and texts.
Official website download address:
https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/
2. How to use
-
Configure the Java development environment, decompress the file, click the two files in the figure below to start the
interface after the startup is successful
-
Operation steps
Make picture --> generate box file --> word training operation --> make new library -
Generate box file
-
word training operation
- Generate a box file in the same directory as the picture after running
- Or use the jTessBoxEditor software, open the picture, and see the following interface
- correct wrong words
- make new library
- After the new library is created, a tessdata directory will be generated under the picture folder, and the new library will be under the tessdata directory
- use new library
- Then copy the new library to the Tesseract-OCR\tessdata directory and use it:
- When using new libraries in Python code, remember to modify the configuration
text = pytesseract.image_to_string(im, lang='pingan_ocr')