最近写*车之家的爬虫,遇到动态,扭曲的自定义字符,以前直接比对不变的字符部分已经不行了,想了半天,对字符的操作不是很了解,所以就想到用orc来直接识别好了
遇到问题,使用pytesseract进行操作的时候,添加了中文的语言的选项,但是不添加psm参数时,识别不出来。经过一番查找 找到
应该加上--psm 8 ,将整个图像当初一个汉字来操作
-
Page segmentation modes:
-
0 Orientation and script detection (OSD) only.
-
1 Automatic page segmentation with OSD.
-
2 Automatic page segmentation, but no OSD, or OCR.
-
3 Fully automatic page segmentation, but no OSD. (Default)
-
4 Assume a single column of text of variable sizes.
-
5 Assume a single uniform block of vertically aligned text.
-
6 Assume a single uniform block of text.
-
7 Treat the image as a single text line.
-
8 Treat the image as a single word.
-
9 Treat the image as a single word in a circle.
-
10 Treat the image as a single character.
-
11 Sparse text. Find as much text as possible in no particular order.
-
12 Sparse text with OSD.
-
13 Raw line. Treat the image as a single text line,
-
bypassing hacks that are Tesseract-specific.
Here is a sample usage of image_to_string with multiple parameters.
-
target = pytesseract.image_to_string(image, lang='eng', boxes=False, \
-
config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')