This is a tesseract usage memo, which mainly discusses limiting the text to be recognized
URL:
The tesseract project URL is: http://code.google.com/p/tesseract-ocr/
Use of the command line:
tesseract xxx.jpg result.txt -psm 7 digit
explain
tesseract command name
xxx.jpg file name, jpg, png can be
result.txt The recognized text is output to a file
-psm 7 digit parameter
Limit the text to be recognized
For example, to identify the ID card number, the general ID card number is the numbers 0 to 9 and a capital X,
After the restriction is added, the accuracy of the recognition is improved.
For example to identify part of an ID card:
Before the restriction is added, it is recognized as 1.3250
After adding only the number and X, it is recognized as: 43250
specific method:
Open the tesseract installation directory and enter
tessdata/configs/
Copy a copy of digits and change the name to: sfz, which means to add a configuration for identifying ID card rules
Use a text editor to open the file sfz
tessedit_char_whitelist followed by characters to be recognized
E.g
tessedit_char_whitelist 0123456789X
save and exit
This is the white list, and the words or symbols you want to recognize are written in
When identifying, you need to add the sfz configuration to the command, for example
tesseract xxx.jpg result -psm 7 sfz
python code:
import pytesseract
from PIL import Image
image = Image.open("../pic/c.png")
card_no = tess.image_to_string(cardImage,config='-psm 7 sfz')
print(card_no)
language settings
In addition, regarding image_to_string, there is also the language parameter setting language
code = pytesseract.image_to_string(image,lang="chi_sim",config="-psm 6")
language overlay
You can also overlay language packs. For example, there may be Chinese and English in the text you want to recognize. You can set it like this:
code = pytesseract.image_to_string(image,lang="chi_sim+eng",config="-psm 6")
View local language packs
You can view local language packages through tesseract --list-langs:
Description of -psm
For the description of the -psm configuration item in config, you can view psm through tesseract --help-psm
I found the Chinese descriptions of items 0-10 on the Internet (the other items were not found...), as follows:
0: Oriented script monitoring (OSD)
1: Use OSD automatic paging
2: Automatic paging, but do not use OSD or OCR ( Optical Character Recognition)
3: Fully automatic pagination, but no OSD is used (default)
4: A text column of variable size is assumed.
5 : Assume a single uniform block of vertically aligned text.
6: Assume a uniform block of text.
7 : Treat the image as a single line of text.
8: Treat images as single words.
9: Treat the image as a single word in a circle.
10 : Treat the image as a single character.
Part of this article is referenced: http://blog.csdn.net/github_33304260/article/details/79155154?from=singlemessage