Simple use of Tesseract-OCR command and WPF

Tesseract is an open source OCR engine that can recognize image files in various formats and convert them into text. It was originally developed by HP and later maintained by Google.
Download address: https://digi.bib.uni-mannheim.de/tesseract/
insert image description here
Among them, the version with dev in the file name is the development version, and the one without dev is the stable version.
Supported language packs can be added during installation. Click on the last option in the following interface to select, and we can choose Simplified Chinese Chiness (Simplified).
insert image description here
Add Chinese recognition library:

https://github.com/tesseract-ocr/tessdata/find/master

Download chi_sim.traineddata from this URL, and place it in the Tesseract-OCR\tessdata folder after downloading.

Set the environment variable:
After the installation is complete, add the path where tesseract.exe is located to the PATH environment variable under Windows.

Another environment variable is not added to my computer, and the program can run normally. For reference:


When testing using the tesseract command line, the following error is reported

Error opening data file \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

The error means that the environment variable TESSDATA_PREFIX is missing , so that any language cannot be loaded, and tesseract cannot be initialized.

The solution is also very simple. In the environment variable, add a variable named TESSDATA_PREFIX, and the variable value is the address of the teseractdata directory.


Use tesseract to identify images on the command line:
If you want to use the tesseract command under cmd, you need to put the directory where tesseract.exe is located in the PATH environment variable. Then use the command: tesseract image path file path.
Example:

tesseract a.png a

Then it will recognize the picture in a.png and write the text into a.txt.

If you recognize Chinese, you need to add a parameter:

tesseract a.png a -l eng 默认的是eng,中文的就改成chi_sim。

As for the quick way to open cmd in the current folder, press and hold the shift key, then right click, you can have the option of "open command line window here", and directly navigate to the current folder.


#region 图片转文字 ocr

private static string CmdPath = @"C:\Windows\System32\cmd.exe";

/// <summary>
/// 执行cmd命令
/// 多命令请使用批处理命令连接符:
/// <![CDATA[
/// &:同时执行两个命令
/// |:将上一个命令的输出,作为下一个命令的输入
/// &&:当&&前的命令成功时,才执行&&后的命令
/// ||:当||前的命令失败时,才执行||后的命令]]>
/// 其他请百度
/// </summary>
/// <param name="cmd"></param>
/// <param name="output"></param>
public  void RunCmd(string cmd, out string output)
{
    
    
    cmd = cmd.Trim().TrimEnd('&') + "&exit";//说明:不管命令是否成功均执行exit命令,否则当调用ReadToEnd()方法时,会处于假死状态
    Console.WriteLine(cmd);
    using (Process p = new Process())
    {
    
    
        p.StartInfo.FileName = CmdPath;

        p.StartInfo.UseShellExecute = false;        //是否使用操作系统shell启动

        p.StartInfo.RedirectStandardInput = true;   //接受来自调用程序的输入信息

        p.StartInfo.RedirectStandardOutput = true;  //由调用程序获取输出信息

        p.StartInfo.RedirectStandardError = true;   //重定向标准错误输出

        p.StartInfo.CreateNoWindow = true;          //不显示程序窗口

        p.Start();//启动程序

        //向cmd窗口写入命令
        p.StandardInput.WriteLine(cmd);

        p.StandardInput.AutoFlush = true;

        p.StandardInput.Close();
        //获取cmd窗口的输出信息
        output = p.StandardError.ReadToEnd();

        p.WaitForExit();//等待程序执行完退出进程

        p.Close();

    }

}
public string ImageToText(string imgPath)
{
    
    
    try
    {
    
    
        //Thread.Sleep(3000);

        //using (var engine = new TesseractEngine("tessdata", "eng", EngineMode.Default))
        //{
    
    
        //    using (var img = Pix.LoadFromFile(imgPath))
        //    {
    
    
        //        using (var page = engine.Process(img))
        //        {
    
    
        //            return page.GetText();
        //        }
        //    }
        //}
        string saveDir = string.Format(@"{0}Images\{1}\", this.GetBaseDirectory(), DateTime.Now.ToString("yyyy-MM-dd"));

        if (!System.IO.Directory.Exists(saveDir))
        {
    
    
            System.IO.Directory.CreateDirectory(saveDir);
        }
        string txtPath = saveDir + string.Format(@"{0}_{1}", ViewModel.TestInfo.Barcode, DateTime.Now.ToString("yyyy-MM-dd_HH-mm-ss"));

        string cmd = "tesseract " + imgPath + " " + txtPath;
        
        string iii = "";
        
        RunCmd(cmd,out iii);
      
        Thread.Sleep(1000);

        string txtfile = txtPath + ".txt" ;

        if(File.Exists(txtfile))
        {
    
    
            string text = File.ReadAllText(txtfile);

            return text;
        }
        else
        {
    
    
            return null;
        }

    } catch (Exception ex){
    
    

        System.Windows.MessageBox.Show("OCR解析失败:" + ex.Message);
        return null;
    }
}
#endregion

Guess you like

Origin blog.csdn.net/BeanGo/article/details/129194369