[AI test] python text image recognition tesseract
Github official website: https://github.com/tesseract-ocr/tesseract
python version: https://github.com/madmaze/pytesseract
OCR, Optical Character Recognition, refers to the process of scanning characters and then translating them into electronic text through their shapes. For graphic verification codes, they are all irregular characters, and these characters are indeed the content obtained by slightly distorting the characters.
tesseract-OCR is an open source OCR engine that can recognize more than 100 languages. It is specially used to recognize image text and obtain text. But its disadvantage is that its handwriting recognition ability is relatively poor.
Tesseract supports a variety of image formats, including PNG, JPEG, and TIFF.
List of recognized languages: Languages/Scripts supported in different versions of Tesseract | tessdoc (tesseract-ocr.github.io)
(I’m fascinated by so many forks)
Download and install
The first step is to install the Tesseract OCR engine.
The second step requires installing the pytesseract library that supports python and its related dependencies.
Tesseract OCR engine download
Install the Tesseract OCR engine: pytesseract depends on the Tesseract OCR engine.
Official documentation: Introduction | tessdoc (tesseract-ocr.github.io)
According to the official introduction we need to know:
- There are two parts that need to be installed, the engine itself and the training data for the language.
- The data packages for language training are called "tesseract-ocr-langcode" and "tesseract-ocr-script-scriptcode", where is the
langcode
three-letter language code andscriptcode
is the four-letter script code. - For example: tesseract-ocr-eng (English), tesseract-ocr-ara (Arabic), tesseract-ocr-chi-sim (Simplified Chinese), tesseract-ocr-script-latn (Latin alphabet), tesseract-ocr-script -deva (Sanskrit) etc.
- Data set download address: Traineddata Files for Version 4.00 + | tessdoc (tesseract-ocr.github.io)
Install tesseract on Mac
1. There are four ways to install:
"安装tesseract, 同时安装训练工具"
brew install --with-training-tools tesseract
"安装tesseract,同时它还会安装所有语言"
brew install --all-languages tesseract
"安装附加组件"
brew install --all-languages --with-training-tools tesseract
"安装tesseract,但是不安装训练工具,一般情况用这种方式就可以"
brew install tesseract
2. After installing tesseract, test:
tesseract -v
3. Install the language data set
sudo port install tesseract-<langcode>
Supported languages: https://ports.macports.org/search/?q=tesseract-&name=on
Windows installation tesseract
1. Download the tesseract installation package
-
Download address of tesseract installation package: https://digi.bib.uni-mannheim.de/tesseract/
-
Pay attention to distinguish between 32-bit and 64-bit
-
What I downloaded is the latest one. You can click to download the 64-bit directly, tesseract-ocr-w64-setup-5.3.1.20230401.exe
-
If the Internet speed is slow, you can download it from my network disk.
-
链接:https://pan.baidu.com/s/1B5CyYZ5D5qwCXzZ9dnSGpQ?pwd=mwj6 提取码:mwj6
-
2. Install
-
(1) Double-click the downloaded exe. It is recommended to right-click and run as administrator.
-
(2) Click next
-
(3) Click
I Agree
-
(4) Choose according to your needs. The first one is to download for all users of this computer, and the second one is to download only for the current user.
-
(5) Here is to configure the language pack download. You can click on
Additional
the two options at the beginning to view the language that needs to be downloaded. If you only want Chinese, youChinese
can find and download it. After selecting, click Next. -
(6) Select the path you want to install. Note that if you do not use the default path, subsequent code will report
FileNotFoundError:[WinError 2]系统找不到指定文件
errors. The solution is to usetesseract.exe
the absolute path. Here I use the default path to install. -
(7) Click Install
-
(8) After the installation is complete, click Next, then click Finish
2. If you fail to download the language library above, you can use the following official link to download the corresponding language library data yourself, which is tens of megabytes in size.
https://github.com/tesseract-ocr/tessdata_best
-
网速不好的用这个 链接:https://pan.baidu.com/s/11k5od_fd3_THN2YiGgmH3w?pwd=mwj6 提取码:mwj6
3. Configure environment variables
-
If you are using the default address,
C:\Program Files\Tesseract-OCR
just add it to the environment variable -
My Computer (This Computer) -> Right-click Properties -> Advanced System Settings -> Environment Variables -> System Environment Variables Find Path and click in -> New -> Enter your installation address
-
# 默认安装地址则输入以下内容 C:\Program Files\Tesseract-OCR
4. Verify whether the installation is successful
- ctrl+R Enter cmd and press Enter
- Enter
tesseract -v
, and if the content is displayed, it proves success. If it is not an internal command, it means that the environment variables are not done properly. Reconfigure it.
Install pytesseract
pip install pytesseract
Installation of other related dependencies
pip install opencv-python
pip install pillow
Code demo
from PIL import Image
import pytesseract
im = Image.open('imgs\csdn_homepage.png')
# 识别文字,并指定语言
string = pytesseract.image_to_string(im, lang='chi_sim')
print(string)
The corresponding identified pictures are as follows:
The result of the operation is as follows:
Seeing the recognized content, I was speechless and even wanted to punch the computer! I’ve written so much content, and that’s all you have? ? ?
Adjust ideas (invalid)
After consulting relevant information, I found that the pre-downloaded Chinese language package is relatively small and the accuracy is not high.
I learned from the official website that the language pack recognition accuracy under tessdata_best is the highest, so I went to download it directly.
It was also mentioned in the previous article: https://github.com/tesseract-ocr/tessdata_best, and the network disk link is also in front.
Unzip the downloaded package and copy the contents to C:\Program Files\Tesseract-OCR\tessdata
the directory (first delete all the contents of the directory).
Then run the code.
There are a dozen curse words here…
Calm down, it’s because I’m not capable enough, it’s because I don’t know how to train the model, it’s because I shouldn’t just pick it up and use it.
After a few minutes, the swear words…
Model training
You can search for information on the Internet, and I have included an article in the reference materials.
Model training search keywords: tesseract-ocr training method
I won't bother anymore. This is the result of not researching the relevant information according to the needs. If you see one, you will rush to it and fail.
Change plan
Everyone must remember that when studying new things, do research first and then step into it.
Simple github search:
After various investigations, it was found that:
Tesseract OCR
- Advantages: Supports supplementary training
- Disadvantages: Huge difference in Chinese recognition! Huge difference! (angry roar)
EasyOCR
- Advantages: OCR recognition is okay, better than general open source models
- Disadvantages: Recognition speed is very slow and training is not supported
Paddle OCR
- Advantages: can be supplemented with training, OCR recognition effect is good, execution speed is fast, documents are complete, and there is a lot of information
- Disadvantages: Occasionally some content may be lost
CnOCR
- Advantages: Supports training your own model, fast execution, and good recognition results
- Disadvantages: training is more troublesome than PaddleOCR, and rarely updated and maintained
Already have code
Although it failed, the relevant code was still released for use by friends in need.
Take text only (official code)
import cv2
import pytesseract
from PIL import Image
im = 'imgs\csdn_homepage.png'
img_cv = cv2.imread(im)
# By default OpenCV stores images in BGR format and since pytesseract assumes RGB format,
# we need to convert from BGR to RGB format/mode:
img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
print(pytesseract.image_to_string(img_rgb, lang='chi_sim'))
# OR
img_rgb = Image.frombytes('RGB', img_cv.shape[:2], img_cv, 'raw', 'BGR', 0, 0)
print(pytesseract.image_to_string(img_rgb, lang='chi_sim'))
(The output is very miserable, and a lot of text is lost)
Recognize text and return corresponding coordinates
# -*- coding: utf-8 -*-
'''
@Time : 2023/8/18 13:01
@Email : [email protected]
@公众号 : 梦无矶的测试开发之路
@File : python文字识别.py
'''
__author__ = "梦无矶小仔"
import cv2
import pytesseract
# 设置语言数据
# 下面一行代码很重要
tessdata_dir_config = '--tessdata-dir "C:\Program Files\Tesseract-OCR\\tessdata"'
# 1、加载并预处理图像
image = cv2.imread('imgs\csdn_homepage.png') # 替换为你的图像文件路径,注意文件名不能有中文
# 根据图像的复杂性,还可以在预处理步骤中使用额外的图像处理技术,如阈值化、去噪、边缘检测等,以提高准确度和结果。
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # cv2让图片黑白
# 2、执行文字识别和坐标提取 英语就是eng
results = pytesseract.image_to_data(gray, lang='chi_sim', config=tessdata_dir_config, output_type=pytesseract.Output.DICT)
text_coords = []
for i, text in enumerate(results['text']):
if text.strip():
x = results['left'][i]
y = results['top'][i]
width = results['width'][i]
height = results['height'][i]
text_coords.append({
'text': text, 'x': x, 'y': y, 'width': width, 'height': height})
# 输出结果
for coord in text_coords:
print(coord['text'], '-> 坐标:[', coord['x'], ",", coord['y'], "], ", "宽高:[", coord['width'], coord['height'], "]")
Output style:
Related references
# 官方文档
https://tesseract-ocr.github.io/tessdoc/
# 里面提到了艺术字的识别
https://www.jianshu.com/p/3326c7216696
# 简单的安装教程
https://zhuanlan.zhihu.com/p/186225362
# 比较详细的安装教程及pytesseract基本使用
https://zhuanlan.zhihu.com/p/341306710
# mac安装pytesseract
https://blog.csdn.net/wodedipang_/article/details/84585914
# 模型训练
https://www.cnblogs.com/cnlian/p/5765871.html
# OCR调研报告
https://blog.csdn.net/weixin_41021342/article/details/127203654
Next update PaddleOCR, wish me success!