[AI test] python text image recognition tesseract

[AI test] python text image recognition tesseract

Github official website: https://github.com/tesseract-ocr/tesseract

python version: https://github.com/madmaze/pytesseract

OCR, Optical Character Recognition, refers to the process of scanning characters and then translating them into electronic text through their shapes. For graphic verification codes, they are all irregular characters, and these characters are indeed the content obtained by slightly distorting the characters.

tesseract-OCR is an open source OCR engine that can recognize more than 100 languages. It is specially used to recognize image text and obtain text. But its disadvantage is that its handwriting recognition ability is relatively poor.

Tesseract supports a variety of image formats, including PNG, JPEG, and TIFF.

List of recognized languages: Languages/Scripts supported in different versions of Tesseract | tessdoc (tesseract-ocr.github.io)

(I’m fascinated by so many forks)

insert image description here

Download and install

The first step is to install the Tesseract OCR engine.

The second step requires installing the pytesseract library that supports python and its related dependencies.

Tesseract OCR engine download

Install the Tesseract OCR engine: pytesseract depends on the Tesseract OCR engine.

Official documentation: Introduction | tessdoc (tesseract-ocr.github.io)

According to the official introduction we need to know:

  • There are two parts that need to be installed, the engine itself and the training data for the language.
  • The data packages for language training are called "tesseract-ocr-langcode" and "tesseract-ocr-script-scriptcode", where is the langcodethree-letter language code and scriptcodeis the four-letter script code.
  • For example: tesseract-ocr-eng (English), tesseract-ocr-ara (Arabic), tesseract-ocr-chi-sim (Simplified Chinese), tesseract-ocr-script-latn (Latin alphabet), tesseract-ocr-script -deva (Sanskrit) etc.
  • Data set download address: Traineddata Files for Version 4.00 + | tessdoc (tesseract-ocr.github.io)

Install tesseract on Mac

1. There are four ways to install:

"安装tesseract, 同时安装训练工具"
brew install --with-training-tools tesseract 

"安装tesseract,同时它还会安装所有语言"
brew install --all-languages tesseract 

"安装附加组件"
brew install --all-languages --with-training-tools tesseract 

"安装tesseract,但是不安装训练工具,一般情况用这种方式就可以"
brew install tesseract 

2. After installing tesseract, test:

tesseract -v

3. Install the language data set

sudo port install tesseract-<langcode>

Supported languages: https://ports.macports.org/search/?q=tesseract-&name=on

insert image description here

Windows installation tesseract

1. Download the tesseract installation package

  • Download address of tesseract installation package: https://digi.bib.uni-mannheim.de/tesseract/

  • insert image description here

  • Pay attention to distinguish between 32-bit and 64-bit

  • What I downloaded is the latest one. You can click to download the 64-bit directly, tesseract-ocr-w64-setup-5.3.1.20230401.exe

  • If the Internet speed is slow, you can download it from my network disk.

    • 链接:https://pan.baidu.com/s/1B5CyYZ5D5qwCXzZ9dnSGpQ?pwd=mwj6 
      提取码:mwj6
      

2. Install

  • (1) Double-click the downloaded exe. It is recommended to right-click and run as administrator.

    • insert image description here
  • (2) Click next

    • insert image description here
  • (3) ClickI Agree

    • insert image description here
  • (4) Choose according to your needs. The first one is to download for all users of this computer, and the second one is to download only for the current user.

    • insert image description here
  • (5) Here is to configure the language pack download. You can click on Additionalthe two options at the beginning to view the language that needs to be downloaded. If you only want Chinese, you Chinesecan find and download it. After selecting, click Next.

    • insert image description here
  • (6) Select the path you want to install. Note that if you do not use the default path, subsequent code will report FileNotFoundError:[WinError 2]系统找不到指定文件errors. The solution is to use tesseract.exethe absolute path. Here I use the default path to install.

    • insert image description here
  • (7) Click Install

    • insert image description here
  • (8) After the installation is complete, click Next, then click Finish

    • insert image description here

    • insert image description here

2. If you fail to download the language library above, you can use the following official link to download the corresponding language library data yourself, which is tens of megabytes in size.

https://github.com/tesseract-ocr/tessdata_best
  • 网速不好的用这个
    链接:https://pan.baidu.com/s/11k5od_fd3_THN2YiGgmH3w?pwd=mwj6 
    提取码:mwj6
    

3. Configure environment variables

  • If you are using the default address, C:\Program Files\Tesseract-OCRjust add it to the environment variable

  • My Computer (This Computer) -> Right-click Properties -> Advanced System Settings -> Environment Variables -> System Environment Variables Find Path and click in -> New -> Enter your installation address

  • # 默认安装地址则输入以下内容
    C:\Program Files\Tesseract-OCR
    

4. Verify whether the installation is successful

  • ctrl+R Enter cmd and press Enter
  • Enter tesseract -v, and if the content is displayed, it proves success. If it is not an internal command, it means that the environment variables are not done properly. Reconfigure it.
  • insert image description here

Install pytesseract

pip install pytesseract

Installation of other related dependencies

pip install opencv-python
pip install pillow

Code demo

from PIL import Image
import pytesseract

im = Image.open('imgs\csdn_homepage.png')

# 识别文字,并指定语言
string = pytesseract.image_to_string(im, lang='chi_sim')
print(string)

The corresponding identified pictures are as follows:

insert image description here

The result of the operation is as follows:

insert image description here

Seeing the recognized content, I was speechless and even wanted to punch the computer! I’ve written so much content, and that’s all you have? ? ?

Adjust ideas (invalid)

After consulting relevant information, I found that the pre-downloaded Chinese language package is relatively small and the accuracy is not high.

I learned from the official website that the language pack recognition accuracy under tessdata_best is the highest, so I went to download it directly.

It was also mentioned in the previous article: https://github.com/tesseract-ocr/tessdata_best, and the network disk link is also in front.

insert image description here

Unzip the downloaded package and copy the contents to C:\Program Files\Tesseract-OCR\tessdatathe directory (first delete all the contents of the directory).

Then run the code.

insert image description here

There are a dozen curse words here…

Calm down, it’s because I’m not capable enough, it’s because I don’t know how to train the model, it’s because I shouldn’t just pick it up and use it.

After a few minutes, the swear words…

Model training

You can search for information on the Internet, and I have included an article in the reference materials.

Model training search keywords: tesseract-ocr training method

I won't bother anymore. This is the result of not researching the relevant information according to the needs. If you see one, you will rush to it and fail.

Change plan

Everyone must remember that when studying new things, do research first and then step into it.

Simple github search:

insert image description here

After various investigations, it was found that:

Tesseract OCR

  • Advantages: Supports supplementary training
  • Disadvantages: Huge difference in Chinese recognition! Huge difference! (angry roar)

EasyOCR

  • Advantages: OCR recognition is okay, better than general open source models
  • Disadvantages: Recognition speed is very slow and training is not supported

Paddle OCR

  • Advantages: can be supplemented with training, OCR recognition effect is good, execution speed is fast, documents are complete, and there is a lot of information
  • Disadvantages: Occasionally some content may be lost

CnOCR

  • Advantages: Supports training your own model, fast execution, and good recognition results
  • Disadvantages: training is more troublesome than PaddleOCR, and rarely updated and maintained

Already have code

Although it failed, the relevant code was still released for use by friends in need.

Take text only (official code)

import cv2
import pytesseract
from PIL import Image

im = 'imgs\csdn_homepage.png'

img_cv = cv2.imread(im)
# By default OpenCV stores images in BGR format and since pytesseract assumes RGB format,
# we need to convert from BGR to RGB format/mode:
img_rgb = cv2.cvtColor(img_cv, cv2.COLOR_BGR2RGB)
print(pytesseract.image_to_string(img_rgb, lang='chi_sim'))
# OR
img_rgb = Image.frombytes('RGB', img_cv.shape[:2], img_cv, 'raw', 'BGR', 0, 0)
print(pytesseract.image_to_string(img_rgb, lang='chi_sim'))

(The output is very miserable, and a lot of text is lost)

Recognize text and return corresponding coordinates

# -*- coding: utf-8 -*-
'''
@Time : 2023/8/18 13:01
@Email : [email protected]
@公众号 : 梦无矶的测试开发之路
@File : python文字识别.py
'''
__author__ = "梦无矶小仔"

import cv2
import pytesseract

# 设置语言数据
# 下面一行代码很重要
tessdata_dir_config = '--tessdata-dir "C:\Program Files\Tesseract-OCR\\tessdata"'

# 1、加载并预处理图像
image = cv2.imread('imgs\csdn_homepage.png')  # 替换为你的图像文件路径,注意文件名不能有中文
# 根据图像的复杂性,还可以在预处理步骤中使用额外的图像处理技术,如阈值化、去噪、边缘检测等,以提高准确度和结果。
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)  # cv2让图片黑白

# 2、执行文字识别和坐标提取 英语就是eng
results = pytesseract.image_to_data(gray, lang='chi_sim', config=tessdata_dir_config, output_type=pytesseract.Output.DICT)

text_coords = []

for i, text in enumerate(results['text']):
    if text.strip():
        x = results['left'][i]
        y = results['top'][i]
        width = results['width'][i]
        height = results['height'][i]
        text_coords.append({
    
    'text': text, 'x': x, 'y': y, 'width': width, 'height': height})

# 输出结果
for coord in text_coords:
    print(coord['text'], '-> 坐标:[', coord['x'], ",", coord['y'], "],  ", "宽高:[", coord['width'], coord['height'], "]")

Output style:

insert image description here

Related references

# 官方文档
https://tesseract-ocr.github.io/tessdoc/
# 里面提到了艺术字的识别
https://www.jianshu.com/p/3326c7216696
# 简单的安装教程
https://zhuanlan.zhihu.com/p/186225362
# 比较详细的安装教程及pytesseract基本使用
https://zhuanlan.zhihu.com/p/341306710
# mac安装pytesseract
https://blog.csdn.net/wodedipang_/article/details/84585914
# 模型训练
https://www.cnblogs.com/cnlian/p/5765871.html
# OCR调研报告
https://blog.csdn.net/weixin_41021342/article/details/127203654

Next update PaddleOCR, wish me success!

Guess you like

Origin blog.csdn.net/qq_46158060/article/details/132690058