Commonly used table detection and recognition methods - table structure recognition method (Part 1)

Chapter 3 Common Form Detection and Recognition Methods

3.2 Form structure identification method

 Table structure recognition is the task after table area detection. Its goal is to recognize the layout structure, hierarchy, etc. of the table, and convert the visual information of the table into structural description information that can reconstruct the table. These table structure description information include: specific positions of cells, relationships between cells, row and column positions of cells, and the like.

In the current research, table structure information mainly includes the following two types of description forms: 1) a list of cells (including the position of each cell, the row and column information of the cell, and the content of the cell); 2) HTML code or Latex Code (contains the location information of the cell, and some also contain the content of the cell).

 Similar to the table region detection task, in the early table structure recognition methods, researchers usually designed heuristic algorithms or used machine learning methods to complete the table structure recognition task according to the characteristics of the data set.

According to the regularity of the two-dimensional layout of the cells in the table, Itonori (1993) used connected body analysis to extract the text blocks in it, and then extended and aligned each text block to form cells, so as to obtain the physical coordinates and rank position.

Rahgozar et al. (1994) recognized the table structure based on the rows and columns. It first identified the text blocks in the picture, and then clustered the rows and columns according to the position of the text blocks and the blank area between the two cells. Class, and then get the position of each cell and the structure of the table through the intersection of rows and columns.

Hirayama et al. (1995) started from the table lines, obtained the rows and columns of the table through geometric analysis such as parallel and vertical, and used the dynamic programming matching method to identify the logical relationship between each content block to restore the structure of the table.

Zuyev (1997) used visual features for table recognition, row and column lines and white space for cell segmentation. This algorithm has been applied to FineReader OCR products.

Kieninger et al. (1998) proposed the T-Recs (Table REcognition System) system, which takes the box of the word area as input, and through heuristic methods such as clustering and column decomposition, outputs the information corresponding to each text box and restores the structure of the table. . Later, on this basis, it proposed the T-Recs++ system (Kieninger et al., 2001), which further improved the recognition effect.

Amano et al. (2001) innovatively introduced the semantic information of the text by first decomposing the document into a set of boxes and semi-automatically classifying them into four types: blank, insertion, indication and explanation. The box relations between the representation boxes and their associated entries are then analyzed according to the semantic and geometric knowledge defined in the document structure grammar.

Wang et al. (2004) defined the table structure as a tree and proposed a table structure understanding algorithm based on optimization method design. The algorithm optimizes the parameters by learning the geometric distribution in the training set, and obtains the structure of the table. Ishitani et al. (2005) also used the tree structure to define the table structure, which used the DOM (Document Object Model) tree to represent the table, and extracted cell features from the input image of the table. Each cell is then sorted, irregular tables are identified, and modified to form a regular arrangement of cells.

Hassan (2007), Shigarov (2016) and others used PDF documents as the carrier of form recognition, and deciphered form visual information from PDF documents. The latter also proposes a framework for configurable heuristics.

Domestic research on table structure recognition started late, so there are few traditional heuristic methods and machine learning methods.

In the early days, Liu et al. (1995) proposed a table frame line template method, using the frame lines of the table to form a frame template, which can reflect the structure of the table topologically or geometrically. Then a corresponding item traversal algorithm is proposed to locate and mark items in the table. Then Li et al. (2012) used the OCR engine to extract the text content and text position in the form, used keywords to locate the header, and then combined the header information with the projection information of the table to obtain column separators and row separators. Get the table structure.

In general, the traditional methods of table structure recognition can be summarized into the following four types: row and column-based segmentation and post-processing, text-based detection, expansion and post-processing, text block-based classification and post-processing, and several types of methods fusion.

With the rise of neural networks, researchers began to apply them to document layout analysis tasks. Later, as more complex schemas were developed, more work was put into table columns and overall structure recognition.

  A Zucker proposed an efficient method CluSTi, a clustering method for identifying tabular structures in scanned images of invoices. CluSTi has three contributions. First, it uses a clustering method to remove high noise in the table image. Second, it uses state-of-the-art text recognition technology to extract all text boxes. Finally, CluSTi organizes text boxes into correct rows and columns using horizontal and vertical clustering techniques with optimal parameters. Segmentation, Embedding and Merging (SEM) proposed by Z Zhang is an accurate table structure recognizer. M Namysl proposed a general, modular approach to table extraction.

E Koci proposes a new approach to identify tables in spreadsheets and build layout regions after determining each cell's layout role. They use a graphical model to represent the spatial interrelationships between these regions. On this basis, they proposed the delete-and-fill algorithm (RAC), a table identification algorithm based on a carefully selected set of criteria.

SA Siddiqui leverages the potential of deformable convolutional networks to present a unique approach to analyze tabular patterns in document images. P Riba proposes a graph-based technique for identifying table structures in document images. The method also uses location, context, and content type, rather than raw content (recognizable text), so it is only a structure-aware technique that does not depend on language or the quality of text reading. E Koci uses genetic-based techniques for graph partitioning to identify parts of graphs that match tables in spreadsheets.

SA Siddiqui describes the structure recognition problem as a semantic segmentation problem. To segment the rows and columns, the authors employ a fully convolutional network. Assuming the consistency of the table structure, the method introduces a predictive splicing method, which reduces the complexity of table structure recognition. The authors import pre-trained models from ImageNet and use the structural models of the FCN encoder and decoder. When given an image, the model creates features of the same size as the original input image.

SA Khan presents a robust deep learning based solution for extracting rows and columns from recognized tables in document pictures. Table images are preprocessed and then sent to a bidirectional recurrent neural network using a gated recurrent unit (GRU) and a fully connected layer with softmax activation. SF Rashid presents a new learning-based approach to recognize tabular content in pictures of different documents. SR Qasim proposes a graph network-based table recognition architecture as an alternative to typical neural networks. S Raja proposed a method for identifying table structures that combines cell detection and interaction modules to locate cells and predict their relationship to other detected cells based on rows and columns. Furthermore, cell identification of a structurally restricted loss function was added as an additional differential component. Y Deng tested existing end-to-end table recognition problems, and he also highlighted the need for a larger dataset in this area.

 Another study by Y Zou calls for the development of an image-based table structure recognition technique utilizing fully convolutional networks. The job shown divides the table's rows, columns, and cells. Estimated bounds for all tabular components were enhanced with connected component analysis. Depending on the position of the row and column separators, each cell is then assigned a row and column number. In addition, cell borders are optimized using special algorithms.

To identify rows and columns in a table, KA Hashmi [118] proposes a bootstrap technique for table structure recognition. According to this study, by using the anchor point optimization method, the positioning of rows and columns can be better achieved. In their proposed work, Mask R-CNN and optimized anchors are used to detect row and column boundaries.

Another effort to segment table structures is the ReS2TIM paper by W Xue, which proposes the reconstruction of syntactic structures from tables. Regressing the coordinates of each cell is the main goal of this model. The new technique was initially used to build a network that could identify the neighbors of each cell in the table. This study presents a distance-based weighting system, which will help the network overcome the class imbalance problem associated with training.

C Tensmeyer proposed SPLERGE (Split and Merge), another method using dilated convolutions. Their strategy required the use of two different deep learning models, the first model established the grid-like layout of the table, and the second model determined whether further cell spanning over many rows or columns was possible.

Nassar provides a new recognition model for tabular structures. The latest encoder-dual-decoder in the PubTabNet end-to-end deep learning model is enhanced in two important ways. First, the authors provide a novel decoder for table cell object detection. This allows them to easily access the contents of table cells in programming pdfs without having to train any proprietary OCR decoders. According to the authors, this architectural improvement enables more precise extraction of table content and enables them to work with non-English tables. Second, the transformer-based decoder replaces the LSTM decoder.

S Raja proposes a new deep model based on object detection, which is tailored for fast optimization and captures the natural alignment of cells within a table. Even with precise cell detection, dense table recognition can still be problematic because multiple rows/columns span cells making it difficult to capture long-range row/column relationships. Therefore, the authors also sought to enhance structure identification by determining a graph-based formulation of a unique straight line. The authors emphasize the relevance of empty cells in tables from a semantic point of view. The authors suggest modifying a popular evaluation criterion to take these cells into account. To facilitate new perspectives on this problem, a moderately large evaluation dataset annotated for human cognition is then provided.

X Shen proposes two modules called Row Aggregation (RA) and Column Aggregation (CA). First, the authors apply feature slicing and tiling to make rough predictions for rows and columns and address high fault tolerance. Second, calculate the attention map of the channel to further obtain row and column information. To accomplish row and column segmentation, the authors utilize RA and CA to build a semantic segmentation network called Row and Column Aggregation Network (RCANet).

C Ma presents a new method for recognizing the structure of tables and detecting their boundaries from a variety of different document images. The authors propose to use CornerNet as a new region candidate network to generate higher-quality candidate tables for faster R-CNN, which greatly improves the localization accuracy of faster R-CNN for table recognition. This method only utilizes the minimal ResNet-18 backbone network. In addition, the authors propose a novel split-and-merge method to identify the table structure. The method utilizes a novel spatial CNN separation line prediction module to divide each detection table into a cell grid, and then uses a GridCNN cell merging module to recover the generating cells. Their table structure recognizer can accurately identify tables with significant white space and geometrically deformed (even curved) tables because the spatial CNN module efficiently transfers contextual information to the entire table image. B Xiao hypothesizes that a complex table structure can be represented by a graph, where vertices and edges represent individual cells and the connections between them. Then, the authors design a conditional attention network and describe the table structure recognition problem as a cell association classification problem (CATT-Net).

Jain proposes to train a deep network to recognize the spatial relationship between various pairs of characters contained in pictures of tables to decipher the structure of tables. The authors provide an end-to-end pipeline called TSR-DSAW: TSR, which generates digital representations of tabular images in a structured format like HTML, through character connections in depth space. The technique first utilizes a text detection network, such as CRAFT, to identify each character in an input table image. Next, using dynamic programming, character pairs are created. These character pairs are underlined in each individual image and then fed to a DenseNet-121 classifier, which is trained to recognize spatial correlations such as row, same column, same cell, or no cell. Finally, the authors apply post-processing to the output of the classifier to generate an HTML table structure.

H Li formulates this problem as a cell relation extraction challenge and presents T2, a cutting-edge two-stage method that successfully extracts tabular structures from digitally preserved text. T2 provides a broad concept, basic connectivity, that accurately represents the direct relationship between cells. To find complex table structures, it also builds an alignment graph and uses a message-passing network.

The table structure recognition in the actual scene application not only needs to complete the table detection and structure recognition at the same time, but also recognizes the text of each cell and extracts information. The process is more complicated than the above research fields.

references:

Gao L C, Li Y B, Du L, Zhang X P, Zhu Z Y, Lu N, Jin L W, Huang Y S, Tang Z . 2022.A survey on table recognition technology. Journal of Image and Graphics, 27(6): 1898-1917.

M Kasem , A Abdallah, A Berendeyev,E Elkady , M Abdalla, M Mahmouda, M Hamada, D Nurseitovd, I Taj-Eddin.Deep learning for table detection and structure recognition: A survey.arXiv:2211.08469v1 [cs.CV] 15 Nov 2022

S A Siddiqui , M I Malik,S Agne , A Dengel and S Ahmed. DeCNT: Deep Deformable CNN for Table Detection. in IEEE Access, vol.6, pp.74151-74161, [DOI: 10.1109/ACCESS.2018.2880211]

T Shehzadi, K A Hashmi, D Stricker, M Liwicki , and M Z Afzal.Towards End-to-End Semi-Supervised Table Detection with Deformable Transformer.arXiv:2305.02769v2 [cs.CV] 7 May 2023

Supongo que te gusta

Origin blog.csdn.net/INTSIG/article/details/130841544
Recomendado
Clasificación