Getting Started with Tesseract-OCR Engine

OCR (Optical Character Recognition): Optical Character Recognition, which refers to the process of analyzing, recognizing, and acquiring text in image files.

Tesseract: An open source OCR recognition engine. The initial Tesseract engine was developed by HP Labs, and later contributed to the open source software industry. It was later improved by Google to eliminate bugs, optimize, and re-release. The current version is 3.01.

The project address is: http://code.google.com/p/tesseract-ocr


The Windows command line uses the Tesseract-OCR engine to recognize the captcha:

1. Download and install the Tesseract-OCR engine (version 3.0+ only supports Chinese recognition)

tesseract-ocr-setup-3.01-1.exe

After downloading, install it. By default, the installer will configure the system environment variables for you to point to the installation directory (then you can run tesseract in any directory through the DOS interface). After the installation is complete, the directory is as follows:

appendix:

 The tessdata directory stores the language font files and the files corresponding to the parameters that may be used in the command line interface. The installer includes the English font by default.

If you want to be able to recognize Chinese, you can go to http://code.google.com/p/tesseract-ocr/downloads/list to download the font file of the corresponding language. 

Simplified Chinese font file download address is: http://tesseract-ocr.googlecode.com/files/chi_sim.traineddata.gz  After downloading, unzip it, and then cut the file to the tessdata directory.


2. Use the Tessract-OCR engine to identify the verification code

Open the DOS interface and enter tesseract:


If the above output appears, it means the installation is normal.

I prepared a verification code code.jpg and put it in the root directory of the D drive , as shown in the picture above:


The result is:



appendix:

Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.


tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

tesseract image name output file name-l font file-psm pagesegmode configuration file

E.g:

tesseract code.jpg result  -l chi_sim -psm 7 nobatch

-l chi_sim means to use Simplified Chinese font (you need to download the Chinese font file, after decompression, store it in the tessdata directory, the font file extension is .raineddata Simplified Chinese font file name: chi_sim.traineddata)

-psm 7 means to tell tesseract that the code.jpg image is a line of text. This parameter can reduce the recognition error rate. The default is 3

The configfile parameter value is the file name in the tessdata\configs and tessdata\tessconfigs directories

OCR (Optical Character Recognition): Optical Character Recognition, which refers to the process of analyzing, recognizing, and acquiring text in image files.

Tesseract: An open source OCR recognition engine. The initial Tesseract engine was developed by HP Labs, and later contributed to the open source software industry. It was later improved by Google to eliminate bugs, optimize, and re-release. The current version is 3.01.

The project address is: http://code.google.com/p/tesseract-ocr


The Windows command line uses the Tesseract-OCR engine to recognize the captcha:

1. Download and install the Tesseract-OCR engine (version 3.0+ only supports Chinese recognition)

tesseract-ocr-setup-3.01-1.exe

After downloading, install it. By default, the installer will configure the system environment variables for you to point to the installation directory (then you can run tesseract in any directory through the DOS interface). After the installation is complete, the directory is as follows:

appendix:

 The tessdata directory stores the language font files and the files corresponding to the parameters that may be used in the command line interface. The installer includes the English font by default.

If you want to be able to recognize Chinese, you can go to http://code.google.com/p/tesseract-ocr/downloads/list to download the font file of the corresponding language. 

Simplified Chinese font file download address is: http://tesseract-ocr.googlecode.com/files/chi_sim.traineddata.gz  After downloading, unzip it, and then cut the file to the tessdata directory.


2. Use the Tessract-OCR engine to identify the verification code

Open the DOS interface and enter tesseract:


If the above output appears, it means the installation is normal.

I prepared a verification code code.jpg and put it in the root directory of the D drive , as shown in the picture above:


The result is:



appendix:

Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.


tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

tesseract image name output file name-l font file-psm pagesegmode configuration file

E.g:

tesseract code.jpg result  -l chi_sim -psm 7 nobatch

-l chi_sim means to use Simplified Chinese font (you need to download the Chinese font file, after decompression, store it in the tessdata directory, the font file extension is .raineddata Simplified Chinese font file name: chi_sim.traineddata)

-psm 7 means to tell tesseract that the code.jpg image is a line of text. This parameter can reduce the recognition error rate. The default is 3

The configfile parameter value is the file name in the tessdata\configs and tessdata\tessconfigs directories

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326035726&siteId=291194637