java image text recognition--SpringBoot actual e-commerce project mall4j

Summary

Java + OCR to extract image text. In the e-commerce platform, the platform may need to review some business licenses and identity documents of the merchant, and we may need the number in the certificate to extract. When there are a small number of merchants, manual review can be performed to extract the number in the certificate; when the number of merchants increases, the platform review is very labor-intensive. At this time, we naturally think about whether we can extract the text in the picture. Let's take a look at OCR image text extraction.

SpringBoot actual combat e-commerce project mall4j address: [ https://gitee.com/gz-yami/mall4j ]

Java+Tesseract_OCR image text extraction first entry

You can call the third release interface to extract image text, for example: Baidu API

Here we do a picture text extraction locally, using Java+Tesseract

Environment preparation win10, jdk1.8, idea, maven

Download tesseract3.02, the Chinese thesaurus must first have a Java environment JDK

1. Download and install tesseract3.02

2. Download the Chinese thesaurus chi_sim.traineddata

Baidu network disk: Link: https://pan.baidu.com/s/1VYx6zbKAacYkiUY5UvogEQ Extraction code: hsr7

3. Put chi_sim.traineddata in the tessdata directory

4. Open the digits file in the \tessdata\configs directory and change tessedit_char_whitelist 0123456789-. to tessedit_char_whitelist 0123456789x.

5. Configure environment variables (installation directory of tesseract3.02)

6. Test whether the installation is complete

WIN+R opens cmd command

enter tesseract

Appears to indicate that the installation is complete

Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

pagesegmode values are:

0 = Orientation and script detection (OSD) only.

1 = Automatic page segmentation with OSD.

2 = Automatic page segmentation, but no OSD, or OCR

3 = Fully automatic page segmentation, but no OSD. (Default)

4 = Assume a single column of text of variable sizes.

5 = Assume a single uniform block of vertically aligned text.

6 = Assume a single uniform block of text.

7 = Treat the image as a single text line.

8 = Treat the image as a single word.

9 = Treat the image as a single word in a circle.

10 = Treat the image as a single character.

-l lang and/or -psm pagesegmode must occur before anyconfigfile.

Single options:

-v --version: version info

--list-langs: list available languages for tesseract engine

Extract text

tesseract C:/pic/1.png C:/pic/1 -l chi_sim

C drive picture, generate 1.txt (extracted text content)

-l : not 1 is the lowercase letter l corresponding to the English letter L

chi_sim Chinese Thesaurus

import dependencies in java project

 <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>3.2.1</version>
            <exclusions>
                <exclusion>
                    <groupId>com.sun.jna</groupId>
                    <artifactId>jna</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>net.java.dev.jna</groupId>
            <artifactId>jna</artifactId>
            <version>4.1.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/javax.media/jai_imageio -->
        <dependency>
            <groupId>javax.media</groupId>
            <artifactId>jai_imageio</artifactId>
            <version>1.1.1</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.swinglabs/swingx -->
        <dependency>
            <groupId>org.swinglabs</groupId>
            <artifactId>swingx</artifactId>
            <version>1.6.1</version>
        </dependency>

Code ImageIOHelper.class

package com.example.excel.excelexportodb.Utils;

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.Iterator;
import java.util.Locale;

import javax.imageio.IIOImage;
import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.ImageWriteParam;
import javax.imageio.ImageWriter;
import javax.imageio.metadata.IIOMetadata;
import javax.imageio.stream.ImageInputStream;
import javax.imageio.stream.ImageOutputStream;

import com.github.jaiimageio.plugins.tiff.TIFFImageWriteParam;
//import com.sun.media.imageio.plugins.tiff.TIFFImageWriteParam;


/**
 * 类说明 :创建临时图片文件防止损坏初始文件
 */
public class ImageIOHelper {

	//设置语言
     private Locale locale=Locale.CHINESE;

	//自定义语言构造的方法
    public ImageIOHelper(Locale locale){
        this.locale=locale;
    }
    //默认构造器Locale.CHINESE

    public ImageIOHelper(){
    }


    /**
     * 创建临时图片文件防止损坏初始文件
     * @param imageFile
     * @param imageFormat like png,jps .etc
     * @return TempFile of Image
     */
    public File createImage(File imageFile, String imageFormat) throws IOException {

        //读取图片文件
        Iterator<ImageReader> readers = ImageIO.getImageReadersByFormatName(imageFormat);
        ImageReader reader = readers.next();
        //获取文件流
        ImageInputStream iis = ImageIO.createImageInputStream(imageFile);
        reader.setInput(iis);
        IIOMetadata streamMetadata = reader.getStreamMetadata();

        //设置writeParam
        TIFFImageWriteParam tiffWriteParam = new TIFFImageWriteParam(Locale.CHINESE);
        tiffWriteParam.setCompressionMode(ImageWriteParam.MODE_DISABLED);
        //设置可否压缩

        //获得tiffWriter和设置output
        Iterator<ImageWriter> writers = ImageIO.getImageWritersByFormatName("tiff");
        ImageWriter writer = writers.next();


        BufferedImage bi = reader.read(0);
        IIOImage image = new IIOImage(bi,null,reader.getImageMetadata(0));
        File tempFile = tempImageFile(imageFile);
        ImageOutputStream ios = ImageIO.createImageOutputStream(tempFile);

        writer.setOutput(ios);
        writer.write(streamMetadata, image, tiffWriteParam);

        ios.close();
        iis.close();
        writer.dispose();
        reader.dispose();

        return tempFile;
    }

    /**
     * 给tempfile添加后缀
     * @param imageFile
     * @throws IOException
     */
    private File tempImageFile(File imageFile) throws IOException {
        String path = imageFile.getPath();
        StringBuffer strB = new StringBuffer(path);
        strB.insert(path.lastIndexOf('.'),"_text_recognize_temp");
        String s=strB.toString().replaceFirst("(?<=//.)(//w+)$", "tif");
        //设置文件隐藏
        Runtime.getRuntime().exec("attrib "+"\""+s+"\""+" +H");
        return new File(strB.toString());
    }
}

OCRUtil.class

package com.example.excel.excelexportodb.Utils;
import java.io.*;
import java.util.ArrayList;
import java.util.List;
import java.util.Locale;

import org.jdesktop.swingx.util.OS;

/**
 * 类说明:OCR工具类
 */
public class OCRUtil {
    private final String LANG_OPTION = "-l";
    //英文字母小写l,并非阿拉伯数字1

    private final String EOL = System.getProperty("line.separator");

    /**
     * ocr的安装路径
     */
    private String tessPath =  "D://tess4j//tess4j-src//tesseract-ocr-set-3.0//Tesseract-OCR";



    public OCRUtil(String tessPath,String transFileName){
        this.tessPath=tessPath;
    }

    //OCRUtil的构造方法,默认路径是"C://Program Files (x86)//Tesseract-OCR"

    public OCRUtil(){     }

    public String getTessPath() {
        return tessPath;
    }
    public void setTessPath(String tessPath) {
        this.tessPath = tessPath;
    }
    public String getLANG_OPTION() {
        return LANG_OPTION;
    }
    public String getEOL() {
        return EOL;
    }


    /**
     * @return 识别后的文字
     */
    public String recognizeText(File imageFile,String imageFormat)throws Exception{
        File tempImage = new ImageIOHelper().createImage(imageFile,imageFormat);
        return ocrImages(tempImage, imageFile);
    }

    /**
     *  可以自定义语言
     */
    public String recognizeText(File imageFile,String imageFormat,Locale locale)throws Exception{
        File tempImage = new ImageIOHelper(locale).createImage(imageFile,imageFormat);
        return ocrImages(tempImage, imageFile);
    }

    /**
     * @param
     * @param
     * @return 识别后的内容
     * @throws IOException
     * @throws InterruptedException
     */
    private String ocrImages(File tempImage,File imageFile) throws IOException, InterruptedException{

        //设置输出文件的保存的文件目录,以及文件名
        File outputFile = new File(imageFile.getParentFile(),"test");
        StringBuffer strB = new StringBuffer();


        //设置命令行内容
        List<String> cmd = new ArrayList<String>();
        if(OS.isWindowsXP()){
            cmd.add(tessPath+"//tesseract");
        }else if(OS.isLinux()){
            cmd.add("tesseract");
        }else{
            cmd.add(tessPath+"//tesseract");
        }
        cmd.add("");
        cmd.add(outputFile.getName());
        cmd.add(LANG_OPTION);
        //中文包
        cmd.add("chi_sim");
        //常用数学公式包
        cmd.add("equ");
        //英语包
        cmd.add("eng");


        //创建操作系统进程
        ProcessBuilder pb = new ProcessBuilder();
        //设置此进程生成器的工作目录
        pb.directory(imageFile.getParentFile());
        cmd.set(1, tempImage.getName());
        //设置要执行的cmd命令
        pb.command(cmd);
        //设置后续子进程生成的错误输出都将与标准输出合并
        pb.redirectErrorStream(true);

        long startTime = System.currentTimeMillis();
        System.out.println("开始时间:" + startTime);
        //开始执行,并返回进程实例
        Process process = pb.start();
        //最终执行命令为:tesseract 1.png test -l chi_sim+equ+eng
        // 输入输出流优化
//        printMessage(process.getInputStream());
//        printMessage(process.getErrorStream());
        int w = process.waitFor();
        //删除临时正在工作文件
        tempImage.delete();
        if(w==0){
            // 0代表正常退出
            BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(outputFile.getAbsolutePath()+".txt"),"UTF-8"));
            String str;
            while((str = in.readLine())!=null){
                strB.append(str).append(EOL);
            }
            in.close();

            long endTime = System.currentTimeMillis();
            System.out.println("结束时间:" + endTime);
            System.out.println("耗时:" + (endTime - startTime) + "毫秒");

        }else{
            String msg;
            switch(w){
                case 1:
                    msg = "Errors accessing files.There may be spaces in your image's filename.";
                    break;
                case 29:
                    msg = "Cannot recongnize the image or its selected region.";
                    break;
                case 31:
                    msg = "Unsupported image format.";
                    break;
                default:
                    msg = "Errors occurred.";
            }
            tempImage.delete();
            throw new RuntimeException(msg);
        }
        // 删除提取到文字的临时文件
        new File(outputFile.getAbsolutePath()+".txt").delete();
        return strB.toString().replaceAll("\\s*", "");
    }

//    public static void main(String[] args) throws IOException, InterruptedException {
//    String cmd = "cmd /c dir c:\\windows";
//    final Process process = Runtime.getRuntime().exec(cmd);
//    printMessage(process.getInputStream());
//    printMessage(process.getErrorStream());
//    int value = process.waitFor();
//    System.out.println(value);
//    }
    private static void printMessage(final InputStream input) {
        new Thread(new Runnable() {
            @Override
            public void run() {
                Reader reader = new InputStreamReader(input);
                BufferedReader bf = new BufferedReader(reader);
                String line = null;
                try {
                    while ((line = bf.readLine()) != null) {
                        System.out.println(line);
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }).start();
    }
}

main function

 public static void main(String[] args) throws TesseractException {
        String path = "D://1.png";
        System.out.println("ORC Test Begin......");
        try {
            String valCode = new OCRUtil().recognizeText(new File(path), "png");
            System.out.println(valCode);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.out.println("ORC Test End......");


    }

Identify pictures

ORC Test Begin......

Test 123

ORC Test End......

The final recognition effect is average, and some complex text recognition appears garbled. At this time, we need to train the thesaurus.

The more training, the more accurate the recognition

SpringBoot actual combat e-commerce project mall4j address: [ https://gitee.com/gz-yami/mall4j ]

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324284142&siteId=291194637