一、windows安装配置
其他系统安装配置参考github:https://github.com/tesseract-ocr/tesseract/wiki
下载tesseract-ocr参考:https://github.com/tesseract-ocr/tesseract/wiki/Downloads
下载chi_sim.traineddata参考:https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
1、pip install pytesseract
2、pip install pillow
3、安装tesseract-ocr
4、找到pytesseract模块中pytesseract.py 更改 tesseract_cmd = r'F:\tesseract_ocr\tesseract-Win64\tesseract.exe'
5、添加环境变量(变量名:TESSDATA_PREFIX,变量值:F:\tesseract_ocr\tesseract-Win64,即安装目录)
6、如果识别中文,下载训练数据chi_sim.traineddata,并拷贝到 F:\tesseract_ocr\tesseract-Win64\tessdata目录下
ps:
临时在 cmd 中设置环境变量,测试:set TESSDATA_PREFIX=F:\tesseract_ocr\tesseract-Win64
命令行运行(以.txt文件格式保存):tesseract.exe E:\python\project\mysite\media\tesseract.png C:\Users\konglingxi\Desktop\test -l chi_sim+equ+eng
二、例子
.py文件
#!/usr/bin/python # coding:utf-8 from __future__ import unicode_literals from django.conf import settings import pytesseract from PIL import Image as pillow_image from django.shortcuts import render_to_response from django.template import RequestContext __author__ = "klx" # Create your views here. def binaryzation(threshold, image_address): """ # 二值化,输入阈值和文件地址 :param threshold: :param image_address: :return: """ image = pillow_image.open(image_address) # 打开图片 image = image.convert('L') # 灰度化 table = [] for x in range(256): # 二值化 if x < threshold: table.append(0) else: table.append(1) image = image.point(table, '1') return image def main(): """ 测试 :return: """ # 指定配置目录 tessdata_dir_config = '--tessdata-dir "F:\\tesseract_ocr\\tesseract-Win64"' image_url = settings.MEDIA_ROOT + r"\tesseract.png" # image_url = settings.MEDIA_ROOT + r"\tesseract.jpg" image = binaryzation(200, image_url) image.show() # 展示二值化后的效果,防止图片二值化效果不佳变成一片白无法识别 result = pytesseract.image_to_string(image, config=tessdata_dir_config, lang="chi_sim+eng") # 变图片为字符串 return result def test(request): res = main() return render_to_response("ocr_app/test.html", {"data": res}, context_instance=RequestContext(request))
.html模板
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>tesseract_ocr</title> </head> <body> {{ data }} </body> </html>