Python crawler verification code recognition module tesseracr and pytesseract

Since tesserocr will have various incompatibility issues in the windows environment and incompatibility with the pycharm virtual environment, so in the windows system environment, select the pytesseract module to install, if you really want to install it, please use the whl file to install or use conda to install

pip install pytesseract

If the tesseract interpreter cannot be found when pytesseract is running, this situation usually occurs in a virtual environment. We need to configure the tesseract.ext executable file of tesseract-OCR to the PATH environment in the windows system, or modify pytesseract. py file, specify the "tesseract_cmd" field as the full path of tesseract.exe

Test recognition function:

import pytesseract
from PIL import Image

image = Image.open('tesseracttest.png')		# 图片名
text = pytesseract.image_to_string(image)
print(text)

In Ubuntu, linux system, the installation command is as follows

#安装tesseract
sudo apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev

#安装语言包
git clone https://github.com/tesseract-ocr/tessdata.git
sudo mv tessdata/* /usr/share/tesseract-ocr/tessdata

#安装pytesseract
pip3 install pytesseract

Identify the content in the picture and write it in another picture

from PIL import Image
import subprocess

def cleanFile(filePath, newFilePath):
    image = Image.open(filePath)

    # 对图片进行阈值过滤(低于143的置为黑色,否则为白色)
    image = image.point(lambda x: 0 if x < 143 else 255)
    # 重新保存图片
    image.save(newFilePath)

    # 调用系统的tesseract命令对图片进行OCR识别
    subprocess.call(["tesseract", newFilePath, "output"])

    # 打开文件读取结果
    with open("output.txt", 'r') as f:
        print(f.read())

if __name__ == "__main__":
    cleanFile("tesseracttest.jpg", "123.jpg")    # 读取tesseracttest内的文字,再把文字写入123中

Simple identification this time and update next time

Guess you like

Origin blog.csdn.net/weixin_43407092/article/details/88555394