广告栏: 欢迎关注我的 个人博客
Tesseract OCR 介绍
Tesseract的OCR引擎最先由HP实验室于1985年开始研发,至1995年时已经成为OCR业内最准确的三款识别引擎之一。然而,HP不久便决定放弃OCR业务,Tesseract也从从此尘封。
数年以后,HP意识到,与其将Tesseract束之高阁,不如贡献给开源软件业,让其重焕新生--2005年,Tesseract由美国内华达州信息技术研究所获得,并求诸于Google对Tesseract进行改进、消除Bug、优化工作。
在修复了最重要的数个漏洞后,Google认为,Tesseract OCR已经足够稳定,可以重新以开源软件方式发布。
1 Windows 下安装测试
- 安装 Microsoft Visual C++ 2015 ,根据您的操作系统位数选择32 或 64 下载地址
- 配置环境变量
TESSDATA_PREFIX
, 值就是您 Tesseract 的路径,比如D:\abc\def\Tesseract-OCR
不安装vc 2015 会出现 gs 等找不到的错误
不配置环境变量,在使用时,会出现
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory
错误
如果在安装 vc 2015 出现
0x80240017
-未指定的错误,请参考此处解决
- 使用命令行测试
:: cd 到对应目录
tesseract sourcefile.jpg savename -l chi_sim
2. MacOS 下安装测试
使用 Homebrew
安装 tesseract
即可,brew 会自动安装依赖包。
brew install tesseract
- 使用命令行测试
tesseract sourcefile.jpg savename -l chi_sim
3. JAVA 环境
3.1 开发环境
- JDK 1.8
- IDEA 2017
- junit 4.2
- log4j2
- Maven3.x
3.2 pom.xml
如果不是 maven 项目跳过此步骤即可
<properties>
<java.version>1.8</java.version>
<log4j2.version>2.1</log4j2.version>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<!--日志包-->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-web</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-jcl</artifactId>
<version>${log4j2.version}</version>
</dependency>
<dependency>
<groupId>org.apache.directory.studio</groupId>
<artifactId>org.apache.commons.lang</artifactId>
<version>2.6</version>
</dependency>
</dependencies>
3.3 log4j2.xml
具体的 log4j2 配置,可参考: Log4j1 升级 Log4j2 实战
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="warn" monitorInterval="600">
<Properties>
<property name="BASE_LOG_PATTERN">%5p %d{yyyy-MM-dd HH:mm:ss.SSS} [%t] (%class{36}:%L)%M - %m%n</property>
<property name="LOG_DIR_HOME">logs</property>
<property name="BASE_LOG_FILENAME">nuna-ocr</property>
</Properties>
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<ThresholdFilter level="trace" onMatch="ACCEPT" onMismatch="DENY" immediateFlush="true"/>
<PatternLayout pattern="${BASE_LOG_PATTERN}" />
</Console>
<RollingRandomAccessFile name="stdout_appender"
immediateFlush="true" fileName="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-stdout.log"
filePattern="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-stdout-%d{yyyy-MM-dd}_%i.log.gz">
<PatternLayout>
<pattern>${BASE_LOG_PATTERN}</pattern>
</PatternLayout>
<Policies>
<TimeBasedTriggeringPolicy modulate="true" interval="1"/>
<SizeBasedTriggeringPolicy size="5120 KB"/>
</Policies>
<DefaultRolloverStrategy max="3"/>
</RollingRandomAccessFile>
<RollingRandomAccessFile name="error_appender"
immediateFlush="true" fileName="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-error.log"
filePattern="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-error-%d{yyyy-MM-dd}_%i.log.gz">
<PatternLayout>
<pattern>${BASE_LOG_PATTERN}</pattern>
</PatternLayout>
<Policies>
<TimeBasedTriggeringPolicy modulate="true" interval="1"/>
<SizeBasedTriggeringPolicy size="5120 KB"/>
</Policies>
<Filters>
<ThresholdFilter level="warn" onMatch="ACCEPT" onMismatch="DENY"/>
</Filters>
<DefaultRolloverStrategy max="3"/>
</RollingRandomAccessFile>
</Appenders>
<Loggers>
<logger name="com.liu.app.ocr" level="debug" additivity="false" >
<appender-ref ref="Console" />
<appender-ref ref="stdout_appender" />
<appender-ref ref="error_appender" />
</logger>
<root level="info" includeLocation="true">
<appender-ref ref="Console" />
<appender-ref ref="stdout_appender" />
<appender-ref ref="error_appender"/>
</root>
</Loggers>
</Configuration>
4. JAVA 代码实现
import java.io.*;
import java.util.*;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.lang.StringUtils;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
public class OcrProcess {
private static Logger logger = LogManager.getLogger(OcrProcess.class);
private String tessPath = "";
private String ocr_psm_num = "3";
private String ocr_language = "eng";
private String ocr_result_path = "";
private String ocr_result_file_path = "";
private final String OS = System.getProperty("os.name");
private final String ocr_lang_option = "-l";
public final String ocr_psm_option = "-psm";
private long procID = 0;
/**
* 文本换行符
*/
private final String EOL = System.getProperty("line.separator");
/**
* 当前系统,路径分割符
*/
private final String FIS = System.getProperties().getProperty("file.separator");
/**
* 构造函数
* @param tessPath 本地 tesseract 路径
*/
public OcrProcess(String tessPath) {
this.tessPath = tessPath;
if(logger.isDebugEnabled()){
logger.debug("OcrProcess Created .");
logger.debug("OcrProcess Current OS >>> {} ",this.OS);
logger.debug("OcrProcess User tessPath >>> {} ",this.tessPath);
}else if(logger.isInfoEnabled()){
logger.info("OcrProcess Created .");
}
}
/**
* 设置 pagesegmode 1-10
* 0 = Orientation and script detection (OSD) only.
* 1 = Automatic page segmentation with OSD.
* 2 = Automatic page segmentation, but no OSD, or OCR.
* 3 = Fully automatic page segmentation, but no OSD. (Default)
* 4 = Assume a single column of text of variable sizes.
* 5 = Assume a single uniform block of vertically aligned text.
* 6 = Assume a single uniform block of text.
* 7 = Treat the image as a single text line. // 识别内容为 横行, 可以提供单行文本的识别效率
* 8 = Treat the image as a single word.
* 9 = Treat the image as a single word in a circle.
* 10 = Treat the image as a single character.
* @param psm 参考 tesseract 文档
*/
public void setPageSegMode(Integer psm){
if(logger.isDebugEnabled()){
logger.debug("OcrProcess User set OCR Process PageSeqMode >>> {}",psm);
}else if(logger.isInfoEnabled()){
logger.info("OcrProcess User set OCR Process PageSeqMode >>> {}",psm);
}
if(psm == null){
throw new IllegalArgumentException("param psm is null,will use default 3");
}
if(psm > 10 || psm < 0){
throw new IllegalArgumentException("param psm only between 0 and 10,will use default 3");
}
this.ocr_psm_num = String.valueOf(psm);
}
/**
* 设置保存路径
* @param savePath
*/
public void setSaveDir(String savePath) throws FileNotFoundException {
if(logger.isDebugEnabled()){
logger.debug("OcrProcess User set OCR Process Result TXT SavePath >>> {}",savePath);
}else if(logger.isInfoEnabled()){
logger.info("OcrProcess User set OCR Process Result TXT SavePath >>> {}",savePath);
}
File saveDir = new File(savePath);
if(!saveDir.exists()){
throw new FileNotFoundException("the savePath is not found!");
}
this.ocr_result_path = savePath;
}
/**
*
* @author alexliu
* @date:2017年11月22日 下午2:39:42
* @Description:文件 ocr
* @param file 文件
* @param language 语言 chi_sim ,eng
* @return 识别后文件路径
* @throws NoSupportFileTypeException 自定义异常
*/
public String doOCR(File file , String language) throws NoSupportFileTypeException {
// 创建一个 ocr 执行id,便于日志、数据记录
//this.procID = OcrTools.createProcessId();
this.procID = new Random().nextInt(1000);
if(logger.isDebugEnabled()){
logger.debug("OcrProcess Begin. The file >>> [{}], the procID is [{}] .",file.getName(),this.procID);
}else if(logger.isInfoEnabled()){
logger.info("OcrProcess Begin. The file >>> [{}], the procID is [{}] . See More info PLZ set logger Debug or Trace.",file.getName(),this.procID);
}
String textSavePath = "";
if(logger.isDebugEnabled()){
logger.debug("[{}] OCR process info : FilePath >>> {}",this.procID,file.getAbsolutePath());
logger.debug("[{}] OCR process info : FileName >>> {}",this.procID,file.getName());
logger.debug("[{}] OCR process info : FileSize >>> {} KB",this.procID,(file.length() / 1024f));
logger.debug("[{}] OCR process info : OCR Language >>> {}",this.procID,language);
}
this.ocr_language = language;
//获取文件类型
int fileType = this.getFileType(file.getName());
if(logger.isDebugEnabled()){
logger.debug("[{}] OCR process info : FileType >>> {}",this.procID,fileType);
}
if(logger.isDebugEnabled()){
logger.debug("[{}] OCR process info : The FileType is Image .",this.procID);
}
textSavePath = callTesseractCommand(file);
return textSavePath;
}
/**
* 获取输出路径
* 如果 用户调用了 `setSaveDir` ,那么 OCR 输出到用户设置的目录
* 如果 没有调用 `setSaveDir` ,那么 OCR 输出与识别文件同一目录
* @param sourceFile 原始文件
* @return
*/
private String getOutPutDir(File sourceFile){
if(StringUtils.isEmpty(this.ocr_result_path)){
return sourceFile.getAbsolutePath().substring(0,sourceFile.getAbsolutePath().lastIndexOf(this.FIS));
}
return this.ocr_result_path;
}
/**
* 获取文件类型
* @param fileName
* @return
* @throws NoSupportFileTypeException 自定义异常
*/
private Integer getFileType(String fileName) throws NoSupportFileTypeException {
//此处根据文件后缀名判断是否是可以执行 OCR 的文件类型
//Integer type = OcrTools.getFileType(fileName);
Integer type = null;
if(type == null){
// NoSupportFileTypeException 为自定义异常,此处可自定义您的内容
//throw new NoSupportFileTypeException("["+fileName+"] , 无法识别的文件类型.");
}
return type;
}
/**
* 创建 ProcessBuilder
* @param sourceFile
* @return
*/
private ProcessBuilder createProcessBuilder(File sourceFile){
ProcessBuilder pb = new ProcessBuilder();
if(this.OS.startsWith("Mac OS") || this.OS.startsWith("Linux")){
//设置命令行工作目录,Linux , Mac OS 设置在 tesseract 目录下
//因为 tesseract 非安装模式,也没有添加到系统的环境变量中
pb.directory(new File(this.tessPath));
if(logger.isDebugEnabled()){
logger.debug("[{}] OCR process info : Set ProcessBuilder working dir >>> {}",this.procID,this.tessPath);
}
}else if(this.OS.startsWith("Windows")){
//设置命令行工作目录,windows 下设置在要解析的文件目录下
pb.directory(sourceFile.getParentFile());
if(logger.isDebugEnabled()){
logger.debug("[{}] OCR process info : Set ProcessBuilder working dir >>> {}",this.procID,sourceFile.getParentFile());
}
}
//输出错误日志流
pb.redirectErrorStream(true);
return pb;
}
/**
* 创建命令行
* @param sourceFile
* @return
*/
private List<String> createCommand(File sourceFile){
List<String> cmd = new ArrayList<String>();
//ocr 输出目录
String result_out_dir = getOutPutDir(sourceFile);
//ocr 命令输出文件名,没有后缀
String ocr_result_filename = sourceFile.getName().substring(0 , sourceFile.getName().lastIndexOf("."));
//ocr 输出文件名,有后缀
String result_out_filePath = result_out_dir + this.FIS + ocr_result_filename + ".txt";
this.ocr_result_file_path = result_out_filePath;
if(logger.isDebugEnabled()) {
logger.debug("[{}] OCR process info : Result SaveDir >>> {}",this.procID,result_out_dir);
logger.debug("[{}] OCR process info : Result SaveFilePath >>> {}",this.procID,result_out_filePath);
}else if(logger.isInfoEnabled()){
logger.info("[{}] OCR process info : Result SaveFilePath >>> {}",this.procID,result_out_filePath);
}
if(this.OS.startsWith("Mac OS") || this.OS.startsWith("Linux")){
cmd.add("tesseract");
// Linux or Mac 要设置工作目录为 tesseract 目录 ,所以`原始文件名`需包含路径
cmd.add(sourceFile.getAbsolutePath());
}else if(this.OS.startsWith("Windows")){
cmd.add(this.tessPath + this.FIS + "tesseract");
// windows 要设置工作目录为文件目录,所以`原始文件名`没有路径,只有文件名
cmd.add(sourceFile.getName());
}
cmd.add(result_out_dir + this.FIS + ocr_result_filename);
cmd.add(this.ocr_psm_option);
cmd.add(this.ocr_psm_num);
cmd.add(this.ocr_lang_option);
cmd.add(this.ocr_language);
if(logger.isDebugEnabled()){
logger.debug("[{}] OCR process info : The Command >>> {}",this.procID,cmd.toString());
}else if(logger.isInfoEnabled()){
logger.info("[{}] OCR process info : The Command >>> {}",this.procID,cmd.toString());
}
return cmd;
}
/**
* 处理识别后的空格字符
* @param txtFilePath
* @throws FileNotFoundException
*/
private void processSpace(String txtFilePath) throws FileNotFoundException {
FileInputStream fis = null;
InputStreamReader isr = null;
BufferedReader br = null;
String str;
FileOutputStream fos = null;
OutputStreamWriter osw = null;
BufferedWriter bw = null;
boolean readSuccess = true;
File txtFile = new File(txtFilePath);
if(!txtFile.exists()){
throw new FileNotFoundException("OCR process Result file is not exist!");
}
StringBuilder sb = new StringBuilder();
try {
//读取文件
fis = new FileInputStream(txtFile);
isr = new InputStreamReader(fis,"UTF-8");
br = new BufferedReader(isr);
while ((str = br.readLine()) != null) {
sb.append(str).append(this.EOL);
}
} catch (Exception e){
logger.error("[{}] OCR process space read file faild .",this.procID,e);
readSuccess = false;
} finally {
try {
br.close();
}catch (Exception e){
//ignore;
}
try {
isr.close();
}catch (Exception e){
//ignore;
}
try {
fis.close();
}catch (Exception e){
//ignore;
}
}
//处理空格
if(readSuccess){
try {
//写出文件
fos = new FileOutputStream(txtFile);
osw = new OutputStreamWriter(fos,"UTF-8");
bw = new BufferedWriter(osw);
bw.write(sb.toString().replaceAll(" ", ""));
} catch (Exception e){
logger.error("[{}] OCR process space write file faild .",this.procID,e);
} finally {
try {
bw.close();
}catch (Exception e){
//ignore;
}
try {
osw.close();
}catch (Exception e){
//ignore;
}
try {
fos.close();
}catch (Exception e){
//ignore;
}
}
}
}
/**
* 打印命令行执行错误日志
* @param process
*/
private void printCommandError(Process process){
InputStream fis = null;
InputStreamReader isr = null;
BufferedReader br = null;
try {
// 取得命令结果的输出流
fis = process.getInputStream();
// 用一个读输出流类去读
isr = new InputStreamReader(fis);
// 用缓冲器读行
br = new BufferedReader(isr);
String line = null;
// 直到读完为止
while ((line = br.readLine()) != null) {
logger.warn("[{}] OCR process warning : {} ",this.procID,line);
}
} catch (Exception e){
logger.warn("[{}] OCR process print command error Faild!",e);
} finally {
try {
br.close();
} catch (Exception e){
//ignore
}
try {
isr.close();
} catch (Exception e){
//ignore
}
try {
fis.close();
} catch (Exception e){
//ignore
}
}
}
/**
*
* @author alexliu
* @date:2017年11月22日 下午2:47:25
* @Description:用命令行执行ocr
* @param imageFile
* @return 返回识别后的文件路径目
* @throws Exception
*/
private String callTesseractCommand(File imageFile) {
String txt_path = "";
List<String> comand = this.createCommand(imageFile);
ProcessBuilder pb = this.createProcessBuilder(imageFile);
//添加命令行
pb.command(comand);
if(logger.isDebugEnabled()){
logger.debug("[{}] OCR process info : Add command to ProcessBuilder",this.procID);
}
Process process = null;
try {
if(logger.isDebugEnabled()){
logger.debug("[{}] OCR process info : Excute OCR Command.",this.procID);
}
process = pb.start();
int w = process.waitFor();
if(logger.isDebugEnabled()){
logger.debug("[{}] OCR process info : Command excute result >>> {}",this.procID,w);
}else if(logger.isInfoEnabled()){
logger.info("[{}] OCR process info : Command excute result >>> {}",this.procID,w);
}
if (w == 0) {
txt_path = this.ocr_result_file_path;
//处理空格
this.processSpace(txt_path);
} else {
//打印错误日志
this.printCommandError(process);
String msg = "[%s] excute command Faild ! Result %d , Reason : %s !";
switch (w) {
case 1:
msg = String.format(msg,this.procID,w,"无法访问文件,可能文件名中存在空格等特殊字符");
break;
case 29:
msg = String.format(msg,this.procID,w,"无法识别图像或其选定区域");
break;
case 31:
msg = String.format(msg,this.procID,w,"不支持的图片格式");
break;
default:
msg = String.format(msg,this.procID,w,"未知错误");
}
throw new RuntimeException(msg);
}
} catch (IOException e) {
logger.error("[{}] OCR process info : Command excute [pb.start()] Faild !",this.procID,e);
} catch (InterruptedException e) {
logger.error("[{}] OCR process info : Command excute [process.waitFor()] Faild !",this.procID,e);
}
if(logger.isDebugEnabled()){
logger.debug("[{}] OCR process info : OCR Process End.",this.procID);
}else if(logger.isInfoEnabled()){
logger.info("[{}] OCR process info : OCR Process End.",this.procID);
}
return txt_path;
}
}
4.1 测试
import org.apache.logging.log4j.core.config.ConfigurationSource;
import org.apache.logging.log4j.core.config.Configurator;
import org.junit.Before;
import org.junit.Test;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
public class TestFileOCR {
private static String log4jxml = "/your/log/config/path/log4j2.xml";
@Before
public void before() throws FileNotFoundException {
File config=new File(log4jxml);
ConfigurationSource source = new ConfigurationSource(new FileInputStream(config),config);
Configurator.initialize(null, source);
}
public void test_JPG_Linux(){
String tessPath = "/your/tesseract/path/Tesseract-OCR";
String img = "/your/test/file/path/test.jpg";
OcrProcess ocr = new OcrProcess(tessPath);
//测试保存其他目录
// try {
// ocr.setSaveDir("/others/save/path");
// } catch (FileNotFoundException e) {
// e.printStackTrace();
// }
File file = new File(img);
try {
String path = ocr.doOCR(file, "chi_sim");
System.out.println(path);
}catch (Exception e){
e.printStackTrace();
}
}
public void test_JPG_Windows(){
String tessPath = "D:\\your\\tesseract\\path\\Tesseract-OCR";
String img = "C:\\your\\test\\file\\path\\test.jpg";
OcrProcess ocr = new OcrProcess(tessPath);
//测试保存其他目录
try {
ocr.setSaveDir("C:\\others\\save\\path");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
File file = new File(img);
try {
String path = ocr.doOCR(file, "chi_sim");
System.out.println(path);
}catch (Exception e){
e.printStackTrace();
}
}
}
5. 第三方 JAR
如果觉得使用命令行方式不方便,可以使用 Tess4j, Maven 引入如下。
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>3.4.3</version>
</dependency>
Tess4j在tesseract 之上丰富了图像的处理,可以放大后识别,还有一些学习功能,但是在测试过程中,tess4j 在处理时比直接使用命令行的内存消耗要多点。
6. 实际应用
OCR 是一个特别消耗内存的操作,建议做成组件独立部署,与您的服务通过 API 来调用。
以下是我的应用的一些数据,OCR 为独立部署。仅供参考:
-
内存消耗
- 开5个线程同时处理,内存持续维持在8-12G 之间。
- 单线程维持在 3-4G 之间。
-
识别效率
- 100张 A4 纸张内容,300kb 每张,单线程,处理时间为20-24分钟,12~16秒一张
- 10000张 A4 纸张内容,300kb 每张,5个线程,处理时间为7-10小时,中间失败几率会增大,重试次数多。
广告栏: 欢迎关注我的 个人博客