Tika binding Tesseract-OCR Optical Character Recognition [Java source code is attached and the real test data to achieve]

 

 OCR (Optical character recognition) - Optical character recognition, image processing is an important branch of Chinese identification with a certain challenge, especially in recognition of cursive handwriting and is an important and popular scientific direction ( Unfortunately, the domestic research centers the basic training did not set a lot of high recognition rate ).

Stanford University has a project devoted exclusively to identify Chinese characters. Research institutes in developed countries in Europe and America more research spirit

 Improve the recognition rate, the training set is the key!

 Improve the recognition rate, the training set is the key! !

 Improve the recognition rate, the training set is the key! ! !

 

After testing the following conclusions:

  • For Arial, white background, non-tilt, the pixel is greater than 100% recognition rate equal 300dpi-
  • English and numbers, identifying more than 90%
  • Special character recognition rate is not high
  • Pixel is too low, a sharp decline in the rate of recognition
  • A variety of background colors change, the recognition rate is very low
  • Replaced cursive font, etc., the recognition rate significantly reduced
  • Movie screen captions and screenshots low recognition rate
  • If the font is too light scanning member, is too small, it is not fully identified
  • Improve the recognition rate, need to do their own training set, a huge amount of work manual labor (Simplified Chinese characters at least 6753 Ge, mixing some of the complex, at least 10000 characters; different fonts to re-do, because it is essentially a graphical geometry calculations, the domestic research institutes and open source is not much to do - to be confirmed)

 

  • Java source code implementation, tika combination Tesseract-OCR

( 1 ) as the source (a plurality of picture recognition support)

    @Test
    public void testCode() throws IOException, SAXException, TikaException, InterruptedException {
        List<String> fileNames = new ArrayList<>();
        fileNames.add("chi_eng.png");
        fileNames.add("chi_eng01.png");
        fileNames.add("chi_old.png"); fileNames.add("chi-scan-75dpi.jpg"); fileNames.add("chi-scan-100dpi.jpg"); fileNames.add("chi-scan-300dpi.jpg"); fileNames.add("chi-smartphone.jpg"); fileNames.add("chi-subtitle-v1.jpg"); fileNames.add("english00.png"); fileNames.add("pdf_shaomiao.png"); fileNames.add("test.tiff"); fileNames.add("weather.png"); // 转载请注明出处:https://www.cnblogs.com/NaughtyCat/p/how-to-install-tesseract-ocr-on-windows-and-centos.html TesseractOCRParser parser = new TesseractOCRParser(); TesseractOCRConfig config = new TesseractOCRConfig(); // 设置简体中文训练集 config.setLanguage("chi_sim"); // 设置Tesseract 安装路径 config.setTesseractPath("C:/Program Files/Tesseract-OCR"); // 设置train data 路径 config.setTessdataPath("C:/Program Files/Tesseract-OCR/tessdata"); ParseContext context = new ParseContext(); context.set(TesseractOCRConfig.class, config); context.set(TesseractOCRParser.class, parser); fileNames.forEach(filename -> { BodyContentHandler handler = new BodyContentHandler(); File file = new File("E:/tika/testData" + File.separator + filename); if (file.exists()) { Metadata metadata = new Metadata(); try (InputStream stream = new FileInputStream(file)) { parser.parse(stream, handler, metadata, context); } catch (Exception e) { } handler.toString(); } }); } }
  • Test data (picture) Description and Download

Specific instructions and test results, see: https://ocr.space/blog/2015/03/best-ocr-software-for-chinese.html

See related test picture: https://github.com/A9T9/OCR-Benchmark

 

  •  How to do your own test data set

 

See examiner Net: https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-3.00%E2%80%933.02 

 

(2) the effect of the original picture and

figure 1

 

Conversion results are as follows:

 

【in conclusion】

300dpi, the recognition rate: 100%

 


 

figure 2

Conversion results are as follows:

Brief history

Tesseractwes orginally developed at HewlettPackard Laboratones Bristol and
atHewettPackard Co Greeley Colorado beween 1985 and 1994 wthsome
more changes made in 1996 to portto Windows and some C++zing in1998
In2005 Tesseract was open sourced by HP Since 2006 itis developed by Goosgle

Thelatest (LSTM based]j stableversionis4.10, released on July 7.2019.Latest source codes avaable from
master branch on GlHub.Openissues can be foundin ssue racker and Planning iki

 

Thelatest35 version 5 3.05.02 released onjune 19,2018.Latestsource code for3.055 avaable from
305 branch on GlHHub.There sno development forthisversion,butitcan be used forspecial cases .
see Regression offeatures from 30x

 

See Release Notes and Change Log formore detas ofthe releases-
Installing Tesseract

You can ettherInstall Tesseractvia prepulltbinary package or pulld iLfrom sourcey
Supported Complersare:

* GCC48 and above
* ang34and above
* MSVC 2015.2017.2019

Othercompllersmightwork butare notofially supportedl
Running Tesseract
Basiccommand line usage:

tesseract inagenane outputbase [-1 ]ang】 [--osn ocrenginenode] [--psn pagesegnode
[configfiles...]

Formore information aboutthe various command line options use esseract --henp or man tesseract .

Examples can befoundin thewiki
For developers

Developers can use Tbtessaract Cor

【结论】
英文,特殊符号等会识别失败。识别率:>%80


 

图3.

 转换效果如下:

E g 气

 

Even as Tvanja praised 8e parties Envoyed i 功 i5 7el gzamt7 comgpi 地 08
Qchieveze1 Q 7W7Der- Ofsocial media lsers appeared crilical of er as-
Sesszet 0f 加 e Trip adiistration「5 role 加 功 i5 endeavou7
IBM 表 示 不 服 ,Google 不 care。 下 而 让 我 们 逐 字 逐 句 来 看 他 们 的 论 文
吧 , 对 于 争 论 的 事 情 , 自 己 下 功 夫 搞 清 楚 。

 

松 贵 莹 坊 办 少
忠 : https:/ww.cnblogs-com/NaughtyCatpytranslate-of-google-
Quantum-supremacy-article-published-on-nature.html

Quantum supremacy using
a programmable

 

superconducting
processor

基 于 可 编 程 的 超 导 处 理 器 实 现 的 量 子 霸

 

动 关 盘 源 ,https://doorg/10.1038/s41586-019-1666-5
煌 收 船 2019 乐 7 历 20 历
旋 准 8 船 2019 乐 9 历 20 厂
坊 终 发 疗 2019 知 10 月 23 厅

Abstract
引 言

 

量 子 计 算 机 吹 牛 遢 说 , 对 于 特 定 的 计 算 任 务 , 基 于 量 子 处 理 器 的 计 算
机 , 其 速 度 相 较 于 经 典 处 理 器 呈 指 数 级 增 长 。 根 本 的 挑 战 在 于 构 建 一

 【结论】
宋体,加粗,黑色——识别率%100;倾斜,绿色等——识别率%70

图4(扫描件).

 

 

 

 转换效果如下:

节 P a
为客户服务是华为存在的睢一理由” 从 公 司 层 面
看 , 为客户创造价值的主业务流只有一个!

Ipo - nisgniedProductDevelopment

B croeis PaFA 4 辜蒙扁)




Unc - LomdTocash
芸 a npe waa8 2 菅墨

E Ig - ssueToResoliton 林
P L a 颤〉

 

n i t t

 

 

 

6 P: 01

IP0 主 业 务 流 包 括 : MW 流 程 、0R 流 程 、IPD 流 程

 

 

D
4 一


【结论】
pdf扫描件,只有比较大,比较粗的字能识别出来,颜色较淡的识别不出来
识别率:约%10


 图5.

 

 

转化效果如下:

大 行 佳 孔 当 自 弼 不 。

巧 者 劳 而 春 者 忱 , 无 能 者 无 所 必 , 作 食 而 邀
游 , 陆 若 不 系 之 舟 。

Chacgyuisdt.

124565.


12256 dogdogunnn

【结论】
汉字、英文、数字混合
识别率:%60~%70


 

图6(天气网页截图)

 

转换效果如下:

L f

全 国 > 囚 川 > 尿 膳 > 坂 区
今 夺 伟 8-15 天

 

llc/4rc

 

 

 

 

 

208 238 028 058
人 [ [ 92
s
c E E
无 RR 无 RR 无 RR 无 RR

< < < <

【结论】
背景颜色(蓝色,灰色,黑色、橙色);字体颜色(黑色、白色)。识别率:不到%10


 

图7.

转换效果如下:

机 器 人 餐 厅

cra arenzanmu nnanmes
seeu xraguagpt. ssepumes
人 吊 pahs ztpznaapsus anea
an sro an sessuassnet
e ssoangm crmazees aas
iusiaanorg.mmouz rpeae
snreenatesezur eeae t
+ngszensenapenecieme
矿 svapgzanohat


【结论】
75dpi,识别率:约为%5

 

转载请注明出处:https://www.cnblogs.com/NaughtyCat/p/tika-support-Tesseract-OCR-with-source-code-and-test-data

 

参考:

1)https://stackoverflow.com/questions/23792373/installing-tesseract-ocr-on-centos-6

2)http://www.zmonster.me/2015/04/17/tesseract-install-usage.html 

 

*****************************************************************************************************

精力有限,想法太多,专注做好一件事就行

  • 我只是一个程序猿。5年内把代码写好,技术博客字字推敲,坚持零拷贝和原创
  • 写博客的意义在于打磨文笔,训练逻辑条理性,加深对知识的系统性理解;如果恰好又对别人有点帮助,那真是一件令人开心的事

*****************************************************************************************************

 

Guess you like

Origin www.cnblogs.com/NaughtyCat/p/tika-support-Tesseract-OCR-with-source-code-and-test-data.html