How does tesseract limit the recognized text

This is a tesseract usage memo, which mainly discusses limiting the text to be recognized      

URL:

The tesseract project URL is: http://code.google.com/p/tesseract-ocr/

 

Use of the command line:

tesseract xxx.jpg result.txt -psm 7 digit

explain

tesseract command name

xxx.jpg file name, jpg, png can be

result.txt The recognized text is output to a file

-psm 7 digit parameter

 

Limit the text to be recognized

For example, to identify the ID card number, the general ID card number is the numbers 0 to 9 and a capital X,

After the restriction is added, the accuracy of the recognition is improved.

For example to identify part of an ID card:

Before the restriction is added, it is recognized as 1.3250

After adding only the number and X, it is recognized as: 43250

 

specific method:

Open the tesseract installation directory and enter

tessdata/configs/

Copy a copy of digits and change the name to: sfz, which means to add a configuration for identifying ID card rules

Use a text editor to open the file sfz

tessedit_char_whitelist followed by characters to be recognized

E.g

tessedit_char_whitelist 0123456789X

save and exit

This is the white list, and the words or symbols you want to recognize are written in


When identifying, you need to add the sfz configuration to the command, for example

tesseract xxx.jpg result -psm 7 sfz

python code:

import pytesseract
from PIL import Image

image = Image.open("../pic/c.png")
card_no = tess.image_to_string(cardImage,config='-psm 7 sfz')
print(card_no)

 

language settings

In addition, regarding image_to_string, there is also the language parameter setting language

code = pytesseract.image_to_string(image,lang="chi_sim",config="-psm 6")

language overlay

You can also overlay language packs. For example, there may be Chinese and English in the text you want to recognize. You can set it like this:

code = pytesseract.image_to_string(image,lang="chi_sim+eng",config="-psm 6")

View local language packs

You can view local language packages through tesseract --list-langs: 

Description of -psm

For the description of the -psm configuration item in config, you can view psm through tesseract --help-psm 

I found the Chinese descriptions of items 0-10 on the Internet (the other items were not found...), as follows:
0: Oriented script monitoring (OSD) 
1: Use OSD automatic paging 
2: Automatic paging, but do not use OSD or OCR ( Optical Character Recognition) 
3: Fully automatic pagination, but no OSD is used (default) 
4: A text column of variable size is assumed. 
5 : Assume a single uniform block of vertically aligned text. 
6: Assume a uniform block of text. 
7 : Treat the image as a single line of text. 
8: Treat images as single words. 
9: Treat the image as a single word in a circle. 
10 : Treat the image as a single character.
 

Part of this article is referenced: http://blog.csdn.net/github_33304260/article/details/79155154?from=singlemessage

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324970780&siteId=291194637