Successfully configure Tesseract-OCR in Anacoda3
Introduction to Tesseract-OCR
Tesseract-OCR was originally an OCR engine developed by Hewlett-Packard (HP) Laboratories in 1985. By 1995, it became one of the three most accurate OCR recognition engines, but soon after, HP gave up the development and maintenance of Tesseract. , contributing it to the open source software industry. In 2005, Tesseract was obtained by the Nevada Institute of Information Technology and cooperated with Google to improve and optimize Tesseract. Until now, Tesseract-OCR is still one of the recognition engines with higher recognition accuracy. The original Tesseract-OCR is written in C language, and the Pytesseract library is the Python API package of Tesseract-OCR.
By downloading and installing the Pytesseract library and calling related functions, you can use Tesseract-OCR for OCR text recognition in the Python environment .
The following are commonly used URLs about Tesseract
Download address: https://digi.bib.uni-mannheim.de/tesseract/
Official website: https://github.com/tesseract-ocr/tesseract
Official documentation: https://github .com/tesseract-ocr/tessdoc
language pack address: https://github.com/tesseract-ocr/tessdata
Precautions:
1. Try not to download dev (version in development), alpha (internal test version, generally not released to the outside, there will be many bugs), beta (public test version, that is, a test version open to all users) and other versions.
2. It is recommended to download the latest stable version (currently the latest version tesseract-ocr-w64-setup-5.3.1.20230401.exe, if the old version is installed after testing, if Chinese in Additional Language is checked during installation, an error may be reported)
Installation and configuration environment steps
1. Install the pytesseract third-party library in the Anaconda virtual environment
The installation of the Pytesseract library is basically the same as that of the OpenCV library, and you can
directly enter the "pip install pytesseract" command in the Anaconda Prompt to install it.
pip install pytesseract
After installation, if you run the program directly, an error will be prompted
Next we need to configure the environment
2. Download the tesseract-ocr installation package and install it
(1) The installation package of Tesseract-OCR can be obtained from the official website or other open source projects. Choose the version with the same digits as your computer to download.
(2) Double-click the downloaded Tesseract-OCR installation package to open it, enter the installation interface, and click the "Next" button to proceed to the next step.
The latest version has a language selection interface.
(3) In the "License Agreement" license agreement window, click the "I Agree" button to agree to the installation agreement and proceed to the next step.
(4) Select the installation type and click the "Next" button to proceed to the next step.
(5) The default recognition language in Tesseract-OCR is English. If you need to recognize Chinese or other characters, you can find the "Additional language data (download)" additional language data download in the "Choose Components" selection component window, below Find "Chinese (Simplified)" Simplified Chinese and "Chinese (Simplified Vertical)" Vertical Simplified Chinese in the options, and click "Next" to proceed to the next step.
(6) The installation location of Tesseract-OCR can be kept as default, or you can click "Browse" to customize the installation location. This path will be used in subsequent environment configuration operations. Remember the installation location of Tesseract-OCR. Click the "Next" button to proceed to the next step.
(7) In the "Choose Start Menu Folder" window to select the start menu folder, choose to keep the default, and click "Install" to install.
(8) After the Tesseract-OCR installation is complete, click the "Next" button to proceed to the next step, and finally click the "Finish" button to end the installation.
3. Environment configuration
Open your computer's advanced system settings.
Click Environment Variables, find Path in System Variables, and add the installation path of Tesseract-OCR to it.
Then create a new system variable TESSDATA_PREFIX, the variable value is the tessdata path:
C:\Program Files\Tesseract-OCR\tessdata
Check if the installation was successful
Open Anaconda Prompt, activate the virtual environment used (enter the activate environment name), the default is in the base environment.
Switch to the Tesseract-OCR installation path , otherwise you will be prompted to enter
cd C:\Program Files\Tesseract-OCR
"not an internal or external command" Enter
tesseract --version
tesseract --list-langs
Modify the pytesseract.py file (very important!)
Find tesseract_cmd = 'tesseract' in the pytesseract.py file under the pytesseract library corresponding to Anaconda's virtual environment, and modify it to
tesseract_cmd =r' C:\Program Files \Tesseract-OCR\tesseract.exe'
(replace the bold part with own installation path)
and finally run the program, success! ! !