Successfully configure Tesseract-OCR in Anacoda3 to realize OCR text recognition super detailed tutorial! ! (win7, win10)

Introduction to Tesseract-OCR

Tesseract-OCR was originally an OCR engine developed by Hewlett-Packard (HP) Laboratories in 1985. By 1995, it became one of the three most accurate OCR recognition engines, but soon after, HP gave up the development and maintenance of Tesseract. , contributing it to the open source software industry. In 2005, Tesseract was obtained by the Nevada Institute of Information Technology and cooperated with Google to improve and optimize Tesseract. Until now, Tesseract-OCR is still one of the recognition engines with higher recognition accuracy. The original Tesseract-OCR is written in C language, and the Pytesseract library is the Python API package of Tesseract-OCR.
By downloading and installing the Pytesseract library and calling related functions, you can use Tesseract-OCR for OCR text recognition in the Python environment .
The following are commonly used URLs about Tesseract
Download address: https://digi.bib.uni-mannheim.de/tesseract/
Official website: https://github.com/tesseract-ocr/tesseract
Official documentation: https://github .com/tesseract-ocr/tessdoc
language pack address: https://github.com/tesseract-ocr/tessdata

Precautions:
1. Try not to download dev (version in development), alpha (internal test version, generally not released to the outside, there will be many bugs), beta (public test version, that is, a test version open to all users) and other versions.
2. It is recommended to download the latest stable version (currently the latest version tesseract-ocr-w64-setup-5.3.1.20230401.exe, if the old version is installed after testing, if Chinese in Additional Language is checked during installation, an error may be reported)

Installation and configuration environment steps

1. Install the pytesseract third-party library in the Anaconda virtual environment

The installation of the Pytesseract library is basically the same as that of the OpenCV library, and you can
directly enter the "pip install pytesseract" command in the Anaconda Prompt to install it.

pip install pytesseract

After installation, if you run the program directly, an error will be prompted
insert image description here
Next we need to configure the environment

2. Download the tesseract-ocr installation package and install it

(1) The installation package of Tesseract-OCR can be obtained from the official website or other open source projects. Choose the version with the same digits as your computer to download.
insert image description here
(2) Double-click the downloaded Tesseract-OCR installation package to open it, enter the installation interface, and click the "Next" button to proceed to the next step.
insert image description here
The latest version has a language selection interface.

insert image description here
(3) In the "License Agreement" license agreement window, click the "I Agree" button to agree to the installation agreement and proceed to the next step.
insert image description here
(4) Select the installation type and click the "Next" button to proceed to the next step.
insert image description here

(5) The default recognition language in Tesseract-OCR is English. If you need to recognize Chinese or other characters, you can find the "Additional language data (download)" additional language data download in the "Choose Components" selection component window, below Find "Chinese (Simplified)" Simplified Chinese and "Chinese (Simplified Vertical)" Vertical Simplified Chinese in the options, and click "Next" to proceed to the next step.
insert image description here
(6) The installation location of Tesseract-OCR can be kept as default, or you can click "Browse" to customize the installation location. This path will be used in subsequent environment configuration operations. Remember the installation location of Tesseract-OCR. Click the "Next" button to proceed to the next step.
insert image description here
(7) In the "Choose Start Menu Folder" window to select the start menu folder, choose to keep the default, and click "Install" to install.
insert image description here
(8) After the Tesseract-OCR installation is complete, click the "Next" button to proceed to the next step, and finally click the "Finish" button to end the installation.
insert image description here
insert image description here

3. Environment configuration

Open your computer's advanced system settings.
insert image description here
Click Environment Variables, find Path in System Variables, and add the installation path of Tesseract-OCR to it.
insert image description here
Then create a new system variable TESSDATA_PREFIX, the variable value is the tessdata path:
C:\Program Files\Tesseract-OCR\tessdata
insert image description here

Check if the installation was successful

Open Anaconda Prompt, activate the virtual environment used (enter the activate environment name), the default is in the base environment.
Switch to the Tesseract-OCR installation path , otherwise you will be prompted to enter
cd C:\Program Files\Tesseract-OCR
insert image description here
"not an internal or external command" Enter
insert image description here
tesseract --version
insert image description here
tesseract --list-langs
insert image description here

Modify the pytesseract.py file (very important!)

Find tesseract_cmd = 'tesseract' in the pytesseract.py file under the pytesseract library corresponding to Anaconda's virtual environment, and modify it to
tesseract_cmd =r' C:\Program Files \Tesseract-OCR\tesseract.exe'
(replace the bold part with own installation path)
insert image description here
and finally run the program, success! ! !

Guess you like

Origin blog.csdn.net/weixin_42149550/article/details/131512759