OCR-Form-Tools project demo recording (two) Product Review

This is a software engineering course blog

[TODO form PLACEHOLDER]

In a previous blog , I briefly introduced OCR Form Tools and its local deployment, this blog will be further evaluating the software.

First to go again complete software running processes, intuitive understanding of its function

This tool-based data storage Azure storage services are used by teachers to develop the test provided below warehouse, containing 5 parts of training form pdf file. At the same time there is a local form of a pdf file of the same format as the test data.

Create a project

After running to see the initial screen.

You can see the overall interface design of Microsoft is taking the wind has always been flat in 10 years, dark theme color of a sudden people think of its flagship product vs and vsc. Click on New Project to try to create a new form identification project:

Various verify this form is complete, placeholder and type a hint is very clear.

It should be noted that adding a new Connection to associate with Azure storage services. Interface provides the "Add Connection" button very close to, you can also click on the small icon interface plugs into the left side of the page and complete the Add Connection Management.

After the completion of the new project back to the form just made. Continue through the rest of the information to fill in and create a new form identification project

Into the editor, see the pdf preview page to be marked.

Add tags

In order to identify the training model, we need to be marked form, we are interested in the information (such as name, address, e-mail) marked out, using as a model to prepare different features. To distinguish this information, we want to mark it on a different tag

First Name Add the name of the tag and its type to string

Click on the name field pdf document John Singer, see the press after the prompt box marked discoloration key "1", the name was chosen to see the red box and appear at the right side Name tag, label success.

Sequentially added Email, Zipcode, ExpDate, Amount several Tag, and assign string, integer, date, number type, tags to label different types of tests; complete annotation tags on all of the above five pdf

You can see the file there will be a small icon labeled annotated.

Marked with a pdf reader support wheel zoom and drag to move, as do the pre-ocr text click so very convenient, prompt, type digital label press, type delete delete can be done with the mouse and keyboard marked down quickly. Five kinds of tags marked me five documents completed in ten minutes, very high efficiency.

Model training

Click on the left marked the completion of the train button to enter the training page

Click on the right side of the train to train a new model, after the completion of the return of the prediction accuracy and model information of each tag

Model test

Click on the left side of the page after the model predict the trained model, try using just trained to predict a new pdf. Browse to select the file on the left after the preview, then start clicking predict prediction

After completion of the confidence returns, and

You can see the individual tags are properly the marquee. Since this pdf does not appear in the training set, the description model training was very successful. Noting prediction result can also download json format (text is too long, some of which taken here):

"fields":{"Email":{"type":"string","valueString":"[email protected]","text":"[email protected]","page":1,"boundingBox":[2.045,6.0200000000000005,3.345,6.0200000000000005,3.345,6.15,2.045,6.15],"confidence":0.99,"elements":["#/analyzeResult/readResults/0/lines/25/words/0"],"fieldName":"Email","displayOrder":1},"Zipcode":{"type":"integer","valueInteger":5001,"text":"05001","page":1,"boundingBox":[7.2250000000000005,6.55,7.58,6.55,7.58,6.655,7.2250000000000005,6.655],"confidence":0.999,"elements":["#/analyzeResult/readResults/0/lines/33/words/0"],"fieldName":"Zipcode","displayOrder":2},"Amount":{"type":"number","text":"45.00","page":1,"boundingBox":[6.54,7.84,6.875,7.84,6.875,7.95,6.54,7.95],"confidence":1,"elements":["#/analyzeResult/readResults/0/lines/42/words/0"],"fieldName":"Amount","displayOrder":4},"ExpDate":{"type":"date","text":"10 / 21","page":1,"boundingBox":[4.49,7.88,4.92,7.88,4.92,8.01,4.49,8.01],"confidence":1,"elements":["#/analyzeResult/readResults/0/lines/38/words/0","#/analyzeResult/readResults/0/lines/39/words/0","#/analyzeResult/readResults/0/lines/40/words/0"],"fieldName":"ExpDate","displayOrder":3},"Name":{"type":"string","valueString":"Jaime Gonzales","text":"Jaime Gonzales","page":1,"boundingBox":[2.365,5.74,3.35,5.74,3.35,5.845,2.365,5.845],"confidence":0.97,"elements":["#/analyzeResult/readResults/0/lines/15/words/0","#/analyzeResult/readResults/0/lines/15/words/1"],"fieldName":"Name","displayOrder":0}}}],"errors":[]}}

Since the main function of this project is that we collusion: the first will be uploaded pdf training set to azure storage blob, connect and create a project with which to mark their tools, then train the model, you can get a model to identify the format of the form . Since then, the need to identify a new form input trained model, you can export the form data after formatting.

Personal Experience

Overall I really like this tool, I think it can significantly improve the current situation form processing requires a lot of manpower. Specifically, I think the advantages are:

  1. With pre-marked ocr rapid realization of field selection, and operation shortcut based on this design is very user-friendly, labeling efficiency is very high
  2. One-stop model training, immediately labeled good data transfer model used immediately after training, save a lot of tedious api calls, hidden machine learning training - a lot of detail infer workflow, even if no relevant technical background can be easily get started
  3. Based react spa, provided in the form of a web application, without having to install deployment step, a bag ready
  4. Model disposed to the rear end thereof only need to provide base url, which means that any model can easily access the same backend interface api, has strong scalability
  5. Cool interface

Although the whole process is very smooth experience tool, but still I think there are some small problems:

  1. Annotation interface features tips too vague, difficult to understand for new users of digital icon that represents the new tags corresponding to the button labeled; nor suggest the use of the delete key to delete the marquee field
  2. While it is providing the type of tag, but do not point to open the settings menu tags can not see the tag type and therefore a review of the tag type setting is too much trouble, when more prone tag set omissions. In general model for different feature types will choose different precoding process, and therefore inaccurate tag types may cause erroneous or suboptimal model using encoding features influence the accuracy of the model. (It is recommended that, in the bottom of the interface and tagging training results tables are filling tag category)
  3. Now the model only supports Azure storage services for users already have their own form of storage solutions somewhat unfriendly
  4. Model can not predict bulk upload, bulk inference
  5. Json download format contains large amounts of raw data of the user are not of interest (e.g., the position detecting block, etc.); and the like excel format does not provide the results derived, so that the non professionals it difficult to integrate the tool directly into the workflow.

Testing and Bug Report

Since the course work required to find software Bug, I have a black-box testing the software in a different operating environment with the browser and found the following problems:

First at Docker Toolbox virtual runtime environment for docker, connect to a remote warehouse will fail. Error message is difficult to understand the user, so this should be the developer unexpected unhandled exception:

As the official image has been built as a docker release version, does not provide enough debugging information, taking into account the simulation Docker Toolbox network environment to reproduce the windows is more complex, and therefore there is no further attempts to locate the cause of the error, making only error reporting.

Another problem associated with labeling. When labeling decimal data test file, a single case of post-click data is repeatedly marked occur:

Click on FIG test files respectively CCAuth-1.pdfand the CCAuth-2.pdfresults noted in the amount field and can be found decimals are erroneously selected twice. Analysis probably because pdf document processing elements divided Sons two decimals, while the two are separately identified as a word ocr block, both of the collision block coincide - hence the check. To address this problem, consideration might be given, when the range of the selected two words when the block coincides with or comprises, some judgment and processing.

It should be noted, these are not what a serious problem - the former is sporadic error appears in extreme operating environment, the target group is relatively software and usage scenarios in terms of completely within the acceptable range; when the latter is marked dysfunction of some small probability, there is no significant reduction in the use of experience.

In fact, we must admit that the quality of this software tool is very high. I had a lot of black-box testing on Chrome, Firefox, Edge and other variety of mainstream browsers are no obvious function or display error.

To understand the needs and functional analysis

After complete run again, I already have a general understanding of the features of this project. My understanding is that this is a back-end form recognition algorithm design form annotation tool , provides a very efficient and easy to use format form file annotation capabilities, with which you can quickly build a training set ; at the same time it also simplifies the follow-up, It can be trained immediately, using the given training set on recognition model training .

I think this tool to resolve the pain points are:

  1. Form data is difficult to labeling issues. Normal terms, learning algorithms focus on data mostly include: location of the target field in the document, the data type of the true value of the target field of the target field. Since most document format (pdf, docx, etc.) or class xml an xml document in the form of tissue, as well as a large number of image formats pure form requires processing fields in the document (typically corner coordinates) is difficult to intuitive It is given by way of , so marked a feature often required based on various graphical tool for measuring the coordinates of the text element, and type in order to complete its true value after the manual - this is a very troublesome work, so high labor costs. And, as the preceding analysis, this marked a good tool simplifies this process.

I understand that currently the project is tentatively user groups:

  1. Microsoft OCR-Form users. This tool is described as README, it is a series of tools to lead the charge sheet, which is intended to (and indeed can) significantly optimize OCR-Form experience. With this tool you can quickly annotate data, model training, model validation

And I think this tool has the potential to solve the pain points:

  1. Non-technical personnel is difficult to learn to use machine learning model problems processing form data. Consider the human, financial and other departments, a large number of paper resumes a day, the report needs to be digitized in order to facilitate the statistics, this process is very tedious simple duplication of effort - and form identification model is the weapon can liberate these productivity. However, these reports frequently changing formats, the corresponding recognition model accordingly need to re-train - but the staff manpower, finance and other departments often do not have the required training model retrieval api expertise, and thus difficult to achieve this vision. This tool simplifies the entire workflow is concentrated, hidden algorithm, api calls and other technical details, making the new technology is also expected to energize these people.

So, I think the future potential users of this project are:

  1. The above-mentioned non-skilled practitioner. There are a lot of companies reporting work, this project may go on to develop front-end (or derived from) a more useful tool, provides a very strong weapon for their business, to solve urgent business pain points actually exist.

In order to cater to potential users, I think this tool also needs to be done include:

  1. Batch inference mentioned above, excel download capabilities. I think its a workflow based on the ideal is that users upload and marked some kind of report formats, complete model training, and then upload a large number of untreated report data, the inference can download a batch filter out unnecessary information already excel Summary tables: each table row corresponds to a report file, each column corresponding to the (basic information or report file name, etc.) a tag
  2. Further polished interface, complete with embedded tips help to further reduce the threshold

Guess you like

Origin www.cnblogs.com/MisTariano/p/12571423.html