Graphic verification code text recognition - pytesseract

First, the purpose

In our work, we will encounter some examples that need to recognize the text in the picture, such as: graphic verification code, extracting articles in the picture, etc.

If the workload is relatively small or does not need to be repeated, manual work is enough, but it is precisely some scenarios that require us to realize automatic recognition

This is annoying, and there is an urgent need for a method to realize this function. We often use OCR recognition, but how to recognize it?
insert image description here

2. Implementation method

This article mainly uses python's pytesseract library to realize the recognition of text in pictures.

In fact, the OCR of tesseract is called by pytesseract to identify the copy content in the picture, and then print the identified content.

3. Environmental preparation

  • pytesseract library

    • Installation: pip install pytesseract [Because the pytesseract library needs to be called, it needs to be installed in advance]
  • tesseract

    • Installation: https://blog.csdn.net/Alexa_/article/details/121192132

    • Language pack description: Remember to select the language pack during installation. We commonly use English and Chinese (default is English): eng, chi_sim (you can check the .traineddata file under the installation package path\tessdata, and you can download it separately if it is missing: https://developer.aliyun.com/article/832266 )

    • Set the path of pytesseract.py to call OCR: [After installing pytesseract, it will report an error because the path of OCR is not specified] Modify
      the file: the pytesseract.py file in the folder where the pytesseract library is located (no path is set and the error file is clicked to jump to quickly enter)

4. Code implementation

import pytesseract

def images_to_string(num):
	# 打开本地图片,或者网络获取也可以
    name = '验证码\\'+ str(num) +'.png'
    img1 = Image.open(name)
	
    # 获取图片的长宽用于复杂背景图片文字的定位提取
    w, h = img1.size
    print('Original image size: %sx%s' % (w, h))
    # 因为是PNG图片,像素不是直接以RGB保存的,PNG的每个像素里还有透明度
    
    img1rbg = img1.convert('RGB')

    # 读取全部的像素数据
    src_strlist = img1rbg.load()

    # 获取主干颜色用于提取出确定颜色的文字,防止背景干扰
    # 可以打开图片然后画图工具标一下就能获取到想要颜色的坐标
    data = src_strlist[119, 26]
    print(data)

    # 双层循环开始替换全部的像素点颜色确保只保留符合我们要求的文字颜色
    for x in range(0, w):
        for y in range(0, h):
            # 判断当前点颜色是否等于主干颜色
            co = src_strlist[x, y]
            if co[0] < 30 and co[1] <30 :
                src_strlist[x, y] = (0, 0, 0)
            else:
                src_strlist[x, y] = (255, 255, 255)
    # 输出处理过的图片,用于查看
    nume01 = '验证码\\处理\\'+str(num)+ '.png'
    img1rbg.save(nume01)

    # 直接调用内存里的PIL image对象进行图片识别,这里lang是声明识别文字使用的语言包,默认是英文,chi_sim代表中文
    text = pytesseract.image_to_string(img1rbg,lang='chi_sim')
    # 打印结果
    print(text)
   # 保存 识别出的文字到文本
    with open('验证码\\11.txt','ab') as fier:
        fier.write('\n'.encode())
        fier.write(('这是第几张图片'+str(num)).encode())
        fier.write('\n'.encode())
        fier.write(text.encode())

# 调用函数,提取本地15张图片中的文字到TXT文件
for num in range(15):
    images_to_string(num)

5. Expansion

Remaining problems: The clarity of regular text recognition is very high, but it is easy to go wrong for some rugged text/number recognition.

Guess you like

Origin blog.csdn.net/qq_32828053/article/details/124616673