tesseract_ocr+pytesseract图像识别

一、windows安装配置

其他系统安装配置参考github:https://github.com/tesseract-ocr/tesseract/wiki
下载tesseract-ocr参考:https://github.com/tesseract-ocr/tesseract/wiki/Downloads
下载chi_sim.traineddata参考:https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

1、pip install pytesseract
2、pip install pillow
3、安装tesseract-ocr
4、找到pytesseract模块中pytesseract.py 更改 tesseract_cmd = r'F:\tesseract_ocr\tesseract-Win64\tesseract.exe'
5、添加环境变量(变量名:TESSDATA_PREFIX,变量值:F:\tesseract_ocr\tesseract-Win64,即安装目录)
6、如果识别中文,下载训练数据chi_sim.traineddata,并拷贝到 F:\tesseract_ocr\tesseract-Win64\tessdata目录下

ps:
临时在 cmd 中设置环境变量,测试:set TESSDATA_PREFIX=F:\tesseract_ocr\tesseract-Win64
命令行运行(以.txt文件格式保存):tesseract.exe E:\python\project\mysite\media\tesseract.png C:\Users\konglingxi\Desktop\test -l chi_sim+equ+eng

二、例子

.py文件

#!/usr/bin/python
# coding:utf-8
from __future__ import unicode_literals
from django.conf import settings
import pytesseract
from PIL import Image as pillow_image
from django.shortcuts import render_to_response
from django.template import RequestContext


__author__ = "klx"


# Create your views here.

def binaryzation(threshold, image_address):
    """
    # 二值化,输入阈值和文件地址
    :param threshold:
    :param image_address:
    :return:
    """
    image = pillow_image.open(image_address)  # 打开图片
    image = image.convert('L')  # 灰度化
    table = []
    for x in range(256):  # 二值化
        if x < threshold:
            table.append(0)
        else:
            table.append(1)
    image = image.point(table, '1')
    return image


def main():
    """
    测试
    :return:
    """
    # 指定配置目录
    tessdata_dir_config = '--tessdata-dir "F:\\tesseract_ocr\\tesseract-Win64"'
    image_url = settings.MEDIA_ROOT + r"\tesseract.png"
    # image_url = settings.MEDIA_ROOT + r"\tesseract.jpg"
    image = binaryzation(200, image_url)
    image.show()  # 展示二值化后的效果,防止图片二值化效果不佳变成一片白无法识别
    result = pytesseract.image_to_string(image, config=tessdata_dir_config, lang="chi_sim+eng")  # 变图片为字符串
    return result


def test(request):
    res = main()
    return render_to_response("ocr_app/test.html", {"data": res}, context_instance=RequestContext(request))

 .html模板

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>tesseract_ocr</title>
</head>
<body>
{{ data }}
</body>
</html>

猜你喜欢

转载自www.cnblogs.com/konglingxi/p/10213739.html