pytesseract usage

linux

1. Download tesseract-ocr source

git clone -b master https://github.com/tesseract-ocr/tesseract.git tesseract-ocr

 

2. Install the g ++

yum install gcc gcc-c++ make

 

3.   安装autoconf automake libtool libjpeg-devellibpng-devel libtiff-devel zlib-devel

yum installautoconf automake libtool

yum installlibjpeg-devel libpng-devel libtiff-devel zlib-devel

 

4. Installation leptonica

wget http://www.leptonica.org/source/leptonica-1.76.0.tar.gz

After entering the directory after decompression followed by the implementation:

./configure

make

make install

 

After compilation use vim to add the following three variables:

vim /etc/profile

 exportLD_LIBRARY_PATH=$LD_LIBRARY_PAYT:/usr/local/lib

export LIBLEPT_HEADERSDIR=/usr/local/include
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

 

After the execution save: source / etc / profile

5. Go to step 1 tesseract-ocr download directory sequentially execute the following command:

./autogen.sh

./configure

make

make install

 

6. Install pytesseract

pip3 installpytesseract

7. Configure the language environment variables

cp /usr/share/tesseract/tessdata/*      /usr/local/share/tessdata/
vim /etc/profile
添加
export TESSDATA_PREFIX=/usr/local/share/tessdata/
export PATH=$PATH:$TESSDATA_PREFIX

 



win

 Reference: https://blog.csdn.net/showgea/article/details/82656515

1. Installation Tesseract
the OCR, i.e., Optical Character Recognition, OCR, refers to the process by scanning the character, and then by its shape will be translated into electronic text. For graphics codes, they are some irregular characters, which is indeed slight twisting converted content obtained by the character.

 

tesseract Download: https: //digi.bib.uni-mannheim.de/tesseract/

 

The download page, you can see there are a variety .exe file download list, where you can choose to download version 3.0.

 

 

 

Where the file name with the dev version for developers, without the dev is stable version, you can choose to download without the dev version, for example, can choose to download tesseract-ocr-setup-3.05.02.exe.

 

After the download is complete double click, then the page will appear as shown below.

 

 

 

At this point you can check Additional language data (download) option to install the OCR language support packages, so we can OCR recognizes multiple languages. Then all the way click on the Next button.

 

Next, in order to use the function tesseract python code, using pip installation pytesseract:

 

pip install pytesseract

 

2, configure the environment variables
for the global ease of use, such as installation path D: \ Program Files (x86) \ Tesseract-OCR, the path to the path environment variables

 

 

 

After the configuration at the command line tesseract -v, if shown in the figure below, the configuration described environment variable success

 

 

 

3, verify the installation
Next, we can use the tesseract and pytesseract be tested separately.

 

We test for the sample to the picture shown below.

 

 

 

Link to the picture of https://raw.githubusercontent.com/Python3WebSpider/TestTess/master/image.png, can be directly saved or downloaded.

 

First performed using the command-line test, the pictures downloaded to the D drive chromeDownload folder, save it as image.png, and then open a command line in that folder, test tesseract command:

 

tesseract image.png result 

 

Results are as follows:

 

D:\chromeDownload>tesseract image.png result
Tesseract Open Source OCR Engine v3.05.02 with Leptonica

 

Here we call the tesseract command, where the first parameter is the name of the picture, the second argument result as the destination file name to save the results.

 

The result is recognition results run pictures: Python3WebSpider. Result.txt can be seen in chromeDownload folder, then the picture has been successful text into electronic text up.

 

Then Python code may also be utilized to test, here needs the help pytesseract libraries, the test code is as follows:

 

PIL Image Import from
Import pytesseract

text = pytesseract.image_to_string (Image.open (r'D: \ chromeDownload \ image.png '))
Print (text)
we first read the picture using the Image file, and then calls the image_to_string pytesseract of ( ) method, and then outputs the recognition result.

 

Results are as follows:

 

Python3WebSpider

 

If successful output, then prove tesseract and pytesseract have been installed successfully.

 

4, pit encounter with
using the command line when tested tesseract, meeting gettin following error

 

Error opening data file \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

 

Meaning being given is the lack of environmental variables TESSDATA_PREFIX, fails to load any language, you can not initialize tesseract.

 

The solution is simple, add TESSDATA_PREFIX environment variable, as shown below

 

 

 

Note: Path variable value is "D: / Program Files (x86) / Tesseract-OCR", a forward slash "/." windows are copied path default is the backslash "\"

 

Once configured, re-open the command line, it can be used normally.

 

The second hole is the use of pytesseract, the following error

 

Traceback (most recent call last):
  File "D:\Python36\lib\site-packages\pytesseract\pytesseract.py", line 170, in run_tesseract
    proc = subprocess.Popen(cmd_args, **subprocess_args())
  File "D:\Python36\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "D:\Python36\lib\subprocess.py", line 997, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] 系统找不到指定的文件。

 

During handling of the above exception, another exception occurred:

 

Traceback (most recent call last):
  File "D:/python/20180911.py", line 4, in <module>
    text = pytesseract.image_to_string(Image.open(r'D:\chromeDownload\image.png'))
  File "D:\Python36\lib\site-packages\pytesseract\pytesseract.py", line 294, in image_to_string
    return run_and_get_output(*args)
  File "D:\Python36\lib\site-packages\pytesseract\pytesseract.py", line 202, in run_and_get_output
    run_tesseract(**kwargs)
  File "D:\Python36\lib\site-packages\pytesseract\pytesseract.py", line 172, in run_tesseract
    raise TesseractNotFoundError()
pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your path

 

This is very pit, added a global variable, or to prompt tesseract not installed or not in the PATH.

 

Baidu a bit, the solution is as follows.

 

After pytesseract installed at site-packges under python Lib directory will generate a pytesseract folder, find the folder pytesseract.py, path: D: \ Python36 \ Lib \ site-packages \ pytesseract, using software like notepad open pytesseract.py, find the following two lines:

 

# CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY
tesseract_cmd = 'tesseract'

 

The tesseract_cmd = 'tesseract' Review of: tesseract_cmd = 'D: / Program Files (x86) /Tesseract-OCR/tesseract.exe'

 

It represents tesseract_cmd configuration is that you install tesseract absolute path, so you can find a tesseract. After the save changes, go to run python code, you can succeed.

Guess you like

Origin www.cnblogs.com/xdlzs/p/10954520.html