Handbook of Document Image Processing and Recognition document image processing and recognition Manual

Edit: David Doermann (University of Maryland)
Karl Tombre (University of Lorraine)

 

Foreword

In the beginning, there was only OCR. After some false starts, OCR became a competitive commercial enterprise in the 1950’s. A decade later there were more than 50 manufacturers in the US alone. With the advent of microprocessors and inexpensive optical scanners, the price of OCR dropped from tens and hundreds of thousands of dollars to that of a bottle of wine. Software displaced the racks of electronics. By 1985 anybody could program and test their ideas on a PC, and then write a paper about it (and perhaps even patent it).

Initially, only the OCR. After a few false starts, OCR in the 1950s became a competitive business enterprise. 10 years later, there are more than 50 manufacturers in the US alone. With the advent of microprocessors and inexpensive optical scanners, optical character recognition price from tens of thousands and hundreds of thousands of dollars down the price of a bottle of wine. Software replaces racks of electronic equipment. By 1985, anyone can program and test their ideas on the PC, and then write a paper about it (even patent).

We know, however, very little about current commercial methods or in-house experimental results. Competitive industries have scarce motivation to publish (and their patents may only be part of their legal arsenal). The dearth of industrial authors in our publications is painfully obvious. Herbert Schantz’s book, The History of OCR, was an exception: he traced the growth of REI, which was one of the major success stories of the 1960’s and 1970’s. He also told the story, widely mirrored in sundry wikis and treatises on OCR, of the previous fifty years’ attempts to mechanize reading. Among other manufacturers of the period, IBM may have stood alone in publishing detailed (though often delayed) information about its products.

However, we know very little about the current internal business methods or results. Competitive industry there is little motivation for publishing (their patent laws may be only part of its arsenal). Our publications lack of industry are obvious. Herbert Sands book "OCR history" is an exception: he traces the growth of REI, REI is one of the major success stories of the 1960s and 1970s. He also tells the story of trying to read the mechanization of the past 50 years in a variety of wiki and OCR paper widely reflected. In other manufacturers of the period, IBM may individually released more information about their products (although often delayed).

Of the 4000-8000 articles published since 1900 on character recognition (my estimate), at most a few hundred really bear on OCR (construed as machinery - now software - that converts visible language to a searchable digital format). The rest treat character recognition as a prototypical classification problem. It is, of course, researchers’ universal familiarity with at least some script that turned character recognition into the pre-eminent vehicle for demonstrating and illustrating new ideas in pattern recognition. Even though some of us cannot tell an azalea from a begonia, a sharp sign from a clef, a loop from a tented arch, an erythrocyte from a leukocyte, or an alluvium from an anticline, all of us know how to read.

In 4000-8000, published since 1900 articles on character recognition (I guess), there are hundreds of articles up to a real machine and OCR-related (to be understood as a visual language to convert into searchable digital format - it is now software). The remaining characters are recognized as a typical classification problem. Of course, it is precisely because researchers generally familiar with at least some of the script, which makes the character recognition has become an excellent tool for show and tell pattern recognition of new ideas. Although some of us who can not tell the azalea and begonia, can not distinguish between signs of a sharp crack, it can not distinguish tent arch ring, white blood cells can not distinguish red blood cells, can not distinguish between alluvium anticline, but we all know how to read.

Until about 30 years ago, OCR meant recognizing mono-spaced OCR fonts and typewritten scripts one character at a time – eventually at the rate of several thousand characters per second. Word recognition followed for reading difficult-to-segment typeset matter. The value of language models more elaborate than letter n-gram frequencies and lexicons without word frequencies gradually became clear. Because more than half of the world population is polyglot, OCR too became multilingual (as Henry Baird predicted that it must). This triggered a movement to post all the cultural relics of the past on the Web. Much of the material awaiting conversion,ancient and modern, stretches the limits of human readability. Like humans, OCR must take full advantage of syntax, style, context, and semantics.

Until about 30 years ago, OCR also means that recognizes only a single character spacing OCR fonts and typing the script, and ultimately reach speeds of several thousand characters per second. Word recognition method adopted when the reading difficulties of publishing material. The value of the language model is more sophisticated than the letter N-gram frequency and no dictionary word frequency. Because more than half the world's population is multilingual, OCR has become a multi-lingual (as Henry Baird as predicted). This led to a release of all relics of the past on the network movement. Wait for the conversion of many materials, whether ancient or modern, are beyond the limits of human readability. Like humans, OCR must make full use of grammar, style, context and semantics.

Although many academic researchers are aware that OCR is much more than classification, they have yet to develop a viable, broad-range, end-to-end OCR system (but they may be getting close). A complete OCR system, with language and script recognition, colored print capability, column and line layout analysis, accurate character/word, numeric, symbol and punctuation recognition, language models, document-wide consistency, tuneability and adaptability, graphics subsystems, effectively embedded interactive error correction, and multiple output formats, is far more than the sum of its parts. Furthermore, specialized systems - for postal address reading, check reading, litigation, and bureaucratic forms processing - also require high throughput and different error-reject trade-offs. Real OCR simply isn’t an appropriate PhD dissertation project.

Although many academic researchers realized that OCR is not only classified, they have not developed a viable, broad range of end-to-OCR systems (but they may be nearing). A complete OCR system, a language and script recognition, color printing capability, column and row layout analysis, precise character / words, numbers, symbols and punctuation marks recognition, language model, document consistency range, adjustable and adaptability , graphics subsystem, effective embedded interaction and multiple output formats error correction, far more than the sum of its parts. In addition, a special system - the postal address reading, check reading, litigation and bureaucratic forms processing - also require high throughput and refused to weigh different error. The real OCR is not a suitable doctoral thesis project.

I never know whether to call hand print recognition and handwriting recognition “OCR.” but abhor intelligent as a qualifier for the latest wrinkle. No matter: they are here to stay until tracing glyphs with a stylus goes the way of the quill. Both human and machine legibility of manuscripts depend significantly on the motivation of the writer: a hand-printed income tax return requesting a refund is likely to be more legible than one reporting an underpayment. Immediate feedback, the main advantage of on-line recognition, is a powerful form of motivation. Humans still learn better than machines.

I do not know whether to handwriting recognition and handwriting recognition called "OCR", but I hate smart as a qualifier newest wrinkles. Either way: They will stay here until depicted with a stylus-shaped moves in the direction of the quill. Manuscript of human and machine readability depends largely on the author's motivation: to request a refund of income tax returns could read fingerprints easier to pay less than that. Instant feedback is the main advantage of online identification, is a powerful incentive in the form. Humanity is still better than the machine learning.

Document Image Analysis (DIA) is a superset of OCR, but many of its other popular subfields require OCR. Almost all line drawings contain text. An E-sized telephone company drawing, for instance, has about 3000 words and numbers (including revision notices). Music scores contain numerals and instructions like pianissimo. A map without place names and elevations would have limited use. Mathematical expressions abound in digits and alphabetic fragments like log, limit, tan or argmin. Good lettering used to be a prime job qualification for the draftsmen who drew the legacy drawings that we are now converting to CAD. Unfortunately, commercial OCR systems, tuned to paragraph-length segments of text, do poorly on the alphanumeric fragments typical of such applications. When Open Source OCR matures, it will provide a fine opportunity for customization to specialized applications that have not yet attracted heavy-weight developers. In the meantime, the conversion of documents containing a mix of text and line art has given rise to distinct sub-disciplines with their own conference sessions and workshops that target graphics techniques like vectorization and complex symbol configurations.

Document image analysis (DIA) is a superset of OCR, but it's a lot of other popular subfields need OCR. Almost all of the line drawings contain text. For example, an approximately 3000 words and numbers (including the revised notification) on the electronic telephone company drawings. Comprising digital music and instructions, such as pianissimo. Not a place name and elevation maps will have limited use. Mathematical expressions of a large number of fragments of numbers and letters, such as log, limit, tan or argmin. Good font was the main draftsman of job qualifications, they draw us now converted into a traditional CAD drawings. Unfortunately, commercial OCR system, to adjust the length of the segment of the text paragraph, do poorly on a typical alphanumeric fragments of such applications. When mature open source OCR, it will provide a good opportunity to customize those applications have not yet specifically to attract a large number of developers. At the same time, containing text and line art convert mixed documents produced different sub-disciplines, they have their own meetings and seminars to vectorization and symbol configuration and other complex graphics technology as the goal.

Another subfield of DIA investigates what to do with automatically or manually transcribed books, technical journals, magazines and newspapers. Although Information Retrieval (IR) is not generally considered part of DIA or vice-versa, the overlap between them includes “logical” document segmentation, extraction of tables of content, linking figures and illustrations to textual references, and word spotting. A recurring topic is assessing the effect of OCR errors on downstream applications. One factor that keeps the two disciplines apart is that IR experiments (e.g., TREC) typically involve orders of magnitude more documents than DIA experiments because the number of characters in any collection is far smaller than the number of pixels.

Another sub-field of DIA study how automatic or manual copying of books, technical journals, magazines and newspapers. Although the information retrieval (IR) is not generally considered part of the DIA, and vice versa, but the overlap therebetween comprising a "logical" segmentation document, the table of contents extraction, graphics, and illustrations and a reference link to the text word recognition. A recurring theme is the impact assessment OCR errors downstream applications. These two disciplines separate factor is that infrared experiments (for example, TREC) usually involve orders of magnitude than the DIA document experiments, because the number of characters in any collection are far less than the number of pixels.

Computer vision used to be easily distinguished from the image processing aspects of DIA by its emphasis on illumination and camera position. The border is blurring because even cellphone cameras now offer sufficient spatial resolution for document image capture at several hundred dpi as well as for legible text in large scene images. The correction of the contrast and geometric distortions in the resulting images goes well beyond what is required for scanned documents

In the past, computer vision, with its emphasis on lighting and camera position, it is easy to distinguish image processing DIA's open. Boundaries become blurred, because even cell phone cameras are now able to provide sufficient spatial resolution, captured document images at a rate of several hundred dpi, and a clear display of text in a large scene image. The resulting image contrast and the geometric distortion correcting far beyond the requirements of the scanned document

This collection suggests that we are still far from a unified theory of DIA or even OCR. The Handbook is all the more useful because we have no choice except to rely on heuristics or algorithms based on questionable assumptions. The most useful methods available to us were all invented rather than derived from prime principles. When the time is ripe, many alternative methods are invented to fill the same need. They all remain entrenched candidates for “best practice”. This Handbook presents them fairly, but generally avoids picking winners and losers.

This collection shows that we are far from a unified theory, Diaz, and even OCR. This manual is more useful, because we have no choice but to rely on heuristics or algorithms based on dubious assumptions. The most useful methods available to us are invented, not derived from the basic principle out. And when the time, a number of alternative methods have been invented to meet the same requirements. They are the "best practices," the firm candidate. This manual describes them fairly, but usually avoid picking winners and losers.

“Noise” appears to be the principal obstacle to better results. This is all the more irritating because many types of noise (e.g. skew, bleed-through, underscore) barely slow down human readers. We have not yet succeeded in characterizing and quantifying signal and noise to the extent that communications science has. Although OCR and DIA are prime examples of information transfer, informationtheoretic concepts are seldom invoked. Are we moving in the right direction by accumulating empirical midstream comparisons – often on synthetic data – from contests organized by individual research groups in conjunction with our conferences?

"Noise" seems to be the main obstacle to achieve better results. This is more annoying, because many types of noise (such as skew, bleeding, underline) hardly slows the reader down human readers. We have not been as successful to describe and quantify the signal and noise like communication science. Although OCR and DIA is a prime example of information transmission, but few references to the concepts of information theory. Are we moving in the right direction, through the accumulation of experience in midstream comparison - usually integrated data - drawn from each team race with us together in meetings organized?

Be that as it may, as one is getting increasingly forgetful it is reassuring to have most of the elusive information about one’s favorite topics at arm’s reach in a fat tome like this one. Much as on-line resources have improved over the past decade, I like to turn down the corner of the page and scribble a note in the margin. Younger folks, who prefer search-directed saccades to an old-fashioned linear presentation, may want the on-line version.

However, when a person becomes more forgetful of time, in such a thick book, the information most elusive of their favorite topic at hand is reassuring. Although online resources in the past decade has been greatly improved, but I still prefer the lower corner of the page, the space scrawled a note in the margin. Compared to old-fashioned linear presentation, young people prefer to search oriented glance, they may want the online version.

David Doermann and Karl Tombre were exceptionally well qualified to plan, select, solicit, and edit this compendium. Their contributions to DIA cover a broad swath and, as far as I know, they have never let the song of the sirens divert them from the muddy and winding channels of DIA. Their technical contributions are well referenced by the chapter authors and their voice is heard at the beginning of each section.

David Dorman and Carl Tom Bradley highly qualified planning, selecting, collecting and editing this compendium. As far as I know, they contribute to the Concordia is enormous, they never let the siren song to distract them from Concordia muddy winding river. Their contribution is technical reference section of the well, their voices heard at the beginning of each chapter can.

Dave is the co-founding-editor of IJDAR, which became our flagship journal when PAMI veered towards computer vision and machine learning. Along with the venerable PR and the high-speed, high-volume PRL, IJDAR has served us well with a mixture of special issues, surveys, experimental reports, and new theories. Even earlier, with the encouragement of Azriel Rosenfeld, Dave organized and directed the Language and Media Processing Laboratory, which has become a major resource of DIA data sets, code, bibliographies, and expertise.

Dave is a co-founding editor IJDAR when PAMI steering computer vision and machine learning, IJDAR become our flagship journals. With the oldest public relations and high-speed, high-capacity public relations, IJDAR provides us with a range of topics, investigation, lab reports and theories. Earlier, the encouragement of Azriel Rosenfeld, Dave organizes and directs the language and media processing laboratory, has become a major resource DIA data sets, code, bibliographies and expertise.

Karl, another IJDAR co-founder, put Nancy on the map as one of the premier global centers of DIA research and development. Beginning with a sustained drive to automate the conversion of legacy drawings to CAD formats (drawings for a bridge or a sewer line may have a lifetime of over a hundred years, and the plans for the still-flying Boeing 747 were drawn by hand), Karl brought together and expanded the horizons of University and INRIA researchers to form a critical mass of DIA.

Another IJDAR co-founder Carl (Karl) Nancy as one of the major global center for research and development of DIA. Will continue to promote the traditional format CAD drawings automatically converted to start (bridge or sewer drawings may have more than 100 years of life, while still flying the Boeing 747 program is hand-painted), the University of Calgary and the Indian Institute of the researchers gathered to expand their horizons, forming a critical mass of DIA.

Dave and Karl have also done more than their share to bring our research community together, find common terminology and data, create benchmarks, and advance the state of the art. These big patient men have long been a familiar sight at our conferences, always ready to resolve a conundrum, provide a missing piece of information, fill in for an absentee session chair or speaker, or introduce folks who should know each other.

Dave and Carl have done more work, our research community will gather together to find common terminology and data, create benchmarks and improve the technical level. In our meetings, these patient mogul has been a familiar sight, they are always ready to solve a problem, provide the missing information, or to fill the chairman's spokesman absent from the meeting, or introduce people should know each other.

The DIA community has every reason to be grateful to the editors and authors of this timely and comprehensive collection. Enjoy, and work hard to make a contribution to the next edition!

DIA community has every reason to thank the editors and writers timely and comprehensive collection. Enjoy, and strive to contribute to the next edition!

 

Part A Introduction, Background, Fundamentals .................... 1
1 A Brief History of Documents and Writing Systems ................... 3

2 Document Creation, Image Acquisition and Document Quality...... 11

3 The Evolution of Document Image Analysis ............................ 63

4 Imaging Techniques in Document Analysis Processes ................. 73

Part B Page Analysis........................................................ 133
5 Page Segmentation Techniques in Document Analysis ................ 135

6 Analysis of the Logical Layout of Documents........................... 177

7 Page Similarity and Classification........................................ 223

Part C Text Recognition .................................................... 255
8 Text Segmentation for Document Recognition.......................... 257

9 Language, Script, and Font Recognition ................................ 291

10 Machine-Printed Character Recognition................................ 331

11 Handprinted Character and Word Recognition ........................ 359

12 Continuous Handwritten Script Recognition ........................... 391

13 Middle Eastern Character Recognition ................................. 427

14 Asian Character Recognition ............................................. 459

Volume 2
Part D Processing of Non-textual Information ........................ 487
15 Graphics Recognition Techniques........................................ 489

16 An Overview of Symbol Recognition .................................... 523

17 Analysis and Interpretation of Graphical Documents ................. 553

18 Logo and Trademark Recognition ....................................... 591

19 Recognition of Tables and Forms ........................................ 647

20 Processing Mathematical Notation ....................................... 679

Part E Applications .......................................................... 703
21 Document Analysis in Postal Applications and Check
Processing ................................................................... 705

22 Analysis and Recognition of Music Scores .............................. 749

23 Analysis of Documents Born Digital ..................................... 775

24 Image Based Retrieval and Keyword Spotting in Documents ........ 805

25 Text Localization and Recognition in Images and Video .............. 843

Part F Analysis of Online Data............................................ 885
26 Online Handwriting Recognition......................................... 887

27 Online Signature Verification ............................................. 917

28 Sketching Interfaces ....................................................... 949

Part G Evaluation and Benchmarking .................................. 981
29 Datasets and Annotations for Document Analysis
and Recognition ............................................................ 983

30 Tools and Metrics for Document Analysis Systems Evaluation ....... 1011

Index......................................................................... 1037






Guess you like

Origin www.cnblogs.com/2008nmj/p/12185468.html