tesseract-ocr training data steps

1. Download the jTessBoxEditor tool

The jTessBoxEditor tool is a professional orc sample recognition training software using Tesseract. It is developed based on java. It can perform Tesseract sample training, form its own language library, and improve the recognition rate and accuracy of pictures and texts.

Official website download address:
https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

2. How to use

  1. Configure the Java development environment, decompress the file, click the two files in the figure below to start the
    insert image description here
    interface after the startup is successful
    insert image description here

  2. Operation steps
    Make picture --> generate box file --> word training operation --> make new library

  3. Generate box file
    insert image description here

  4. word training operation

  • Generate a box file in the same directory as the picture after running
  • Or use the jTessBoxEditor software, open the picture, and see the following interface
  • correct wrong words
    insert image description here
  1. make new library
    insert image description here
  2. After the new library is created, a tessdata directory will be generated under the picture folder, and the new library will be under the tessdata directory
  3. use new library
  • Then copy the new library to the Tesseract-OCR\tessdata directory and use it:
  • When using new libraries in Python code, remember to modify the configuration
text = pytesseract.image_to_string(im, lang='pingan_ocr')

Guess you like

Origin blog.csdn.net/zhuan_long/article/details/131844042