A Python package pytesseract, a few lines of code to achieve OCR text recognition technology!

If you think the article is well written and want the data in the blog article, please pay attention to the official account: [Mr. Z's note], 50+ Python e-books and 200G + high-quality video materials have been prepared for you. The backstage reply keywords: 1024 can be obtained; add the author [personal WeChat], you can directly communicate with the author,

The text OCR recognition technology is now quite mature, no matter its accuracy or recognition speed can meet our daily needs; today I will introduce you a Python package, the main function of the package is for OCR recognition, the package name is Pyteeseract , With the help of a few lines of code in this package, you can quickly identify a text image

pytesseract.jpg

The Pytesseract package is obtained by the open source tool Tesseract, developed by Hewlett Packard Lab, and implemented as open source in 2005; since 2006, it has been jointly developed and maintained by Google and some outstanding open source contributors

Tesseract has gradually matured after the 3.x version, supporting multiple image formats and gradually adding multilingual text recognition; but the Tesseract 3.x version is still based on traditional computer vision algorithms and has benefited from the rapid iteration of Deep Learning in the past few years. Both accuracy and speed are better than traditional algorithms; after version 4.0, Tesseract has added the Deep Learning module, which is based on Recognition LSTM, and LSTM can be classified as RNN (Circular Convolutional Neural Network);

The experiment in this article is based on Tesseract3.05 version. Finally, the accuracy of Chinese language recognition is slightly lower. It may be because 4.0+ is not used. Later I learned that there are 4.0+ or ​​even 5.0+ (but not Too stable) and are based on the Deep Learning module, but I don’t want to change it because I’m too lazy,,,

First explain the experimental environment:

  • os: Win10;

  • Python 3.8;

  • pyteeseract 0.3.8;

  • Tesseract 3.05;

pyteeseract installation

1. Install the tesseract tool

Compared with other packages, the installation steps of pyteeseract will be a bit more cumbersome, because the pyteeseract recognition function is based on the tesseract open source tool, so the first step is to install tesseract, the installation package download link:

https://digi.bib.uni-mannheim.de/tesseract/

Snipaste_2020-09-12_14-30-55.jpg

Available in 3.0+, 4.0+ and 5.0+ versions, install after downloading (the installation method is fool-type installation)

After tesseract is successfully installed, you need to add the file path where tesseract.exe is stored to the environment variable. As shown in the figure below, the folder where my tesseract.exe is stored is F:/Program Files/Tesseract-OCR and add the environment variable;

Snipaste_2020-09-12_14-35-04.jpg

2,pip install pytesseract

In the command line, use the pip tool to download the pytesseract package

pip install pyteeseract

3. Modify the pytesseract.py script

On the basis of step 2, find the installation path of pytesseract. If Python is installed through Anaconda, the installation path is generally under the Anaconda/Lib/site-packages folder; after finding it, find pytesseract.py under the pytesseract folder Script file,

Snipaste_2020-09-12_15-54-20.jpg

After finding it, open pyresseract.py with Notepad, locate tesseract_cmd through the ctrl +f quick search function , and modify the following file path information (replace with the tesseract.exe installation path mentioned above);

Snipaste_2020-09-12_15-54-30.jpg

2. Use of pytesseract

Use the package is relatively simple, you can get a few lines of code, the following code is to identify a character in the picture and print it out into a string, select the recognition languages English (Change lang = 'eng' parameters can be )

import pytesseract
import cv2

img_path = "G:/Coding/One_hundred_days/Data/orc_image2.jpg"

# 下面一行代码很重要
tessdata_dir_config = '--tessdata-dir "F://Program Files//Tesseract-OCR//tessdata"'

im = cv2.imread(img_path)
img = cv2.cvtColor(im,cv2.COLOR_BGR2RGB)

text = pytesseract.image_to_string(img,lang= 'eng',config= tessdata_dir_config,)
print(text)

Effect preview, before recognition

orc_image1.jpg

After recognition

Snipaste_2020-09-12_12-54-01.png

pytesseract supports the image read by OpenCV and PIL as input, but the image format needs to be RGB mode, so after OpenCV reads a line of code is added to convert the BGR mode of the image to RGB

Another thing to note is that the following line of code in the above example cannot be removed (used for the setting of the config parameter in the image_to_string() function later)

tessdata_dir_config = '--tessdata-dir "F://Program Files//Tesseract-OCR//tessdata"'

Otherwise, the following error will be reported, tessdata file path location failed ,

Failed loading language ‘eng’ Tesseract couldn’t load any languages! Could not initialize tesseract.’)

The tessdata file path stores the language package file, which is used to identify different languages ​​in the image and is set by modifying the lang parameter; but what you need to know is that the default language of the tesseract tool is eng (English) at first, if you need to use tesseract to identify Different languages ​​need to download the corresponding language pack files and install them in the tessdata folder.

For example, I used English in the above case. Here I want to recognize the Chinese characters in the picture. I need to download the Chinese language pack to testdata. The download address of each language pack is https://github.com/tesseract-ocr/ tessdata

Then set the lang parameter in image_to_string() in the code to chi_sim

Effect preview, before recognition

orc_image2.jpg

After recognition, the recognition effect is not very good for Chinese, but guess the reason for the version:

Snipaste_2020-09-12_15-57-06.jpg

pyteeseract other usage

1. In addition to the above, the content recognition in the image can be directly converted into a string, and it can also be directly converted into a pdf file for export

# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
with open('test.pdf', 'w+b') as f:
    f.write(pdf) # pdf type is bytes by default

Snipaste_2020-09-12_15-35-49.jpg

2. It is estimated that the frame information of each character is recognized, and the resolution range of the position in the picture:

print(pytesseract.image_to_boxes(img_path,lang = 'chi_sim',config= tessdata_dir_config))

Snipaste_2020-09-12_15-38-26.jpg

3. There are still many usages of pyteeseract that have not been introduced yet. Interested friends can go to the official website to introduce them. The link is as follows:

https://pypi.org/project/pytesseract/

Guess you like

Origin blog.csdn.net/weixin_42512684/article/details/108702142