Commonly used form detection and recognition methods - form area detection method (below)

——The book continues from the above

Training

The training of the semi-supervised network is carried out in two steps: a) the student module is independently trained on the labeled data, and the pseudo-label is generated by the teacher module; b) the training of the two modules is combined to obtain the final prediction result.

Pseudo-label framework

experiment

data set:

TableBank is the second largest dataset for the problem of table recognition in the field of document analysis. The dataset has 417,000 annotations via the arXiv database crawling process. The dataset has tables from three categories of document images: LaTeX images (253,817), Word images (163,417), and a combination of both (417,234). It also includes a dataset for identifying the structure of tables. In the experiments in the paper, only the data for table detection is used.

PubLayNet is a large public dataset with 335,703 images in the training set, 11,240 images in the validation set, and 11,405 images in the test set. It includes annotations such as polygon segmentation and bounding boxes of figures, listing titles, tables and text for images from research papers and articles. This dataset was evaluated using the coco analysis technique. In the experiments, the authors only used 102,514 out of 86,460 table annotations.

DocBank is a large dataset of more than 5,000 annotated document images designed for training and evaluation on tasks such as text classification, entity recognition, and relation extraction. It includes notes on title, author name, affiliation, abstract, text, etc.

ICDAR-19: Table Detection and Recognition (cTDaR) Competition was organized by ICDAR in 2019. For the table detection task (TRACKA), two new datasets (modern and historical) are introduced in the competition. For direct comparison with previous state-of-the-art methods, experiments present results on modern datasets with IoU thresholds ranging from 0.5-0.9.

Experiment setup details:

Experiments use deformable DETR with ResNet-50 pre-trained on the ImageNet dataset as the detection framework to evaluate the effectiveness of the semi-supervised approach. Training is performed on three-class datasets of PubLayNet, ICDAR-19, DocBank and TableBank. The experiments use 10%, 30% and 50% labeled data, and the rest as unlabeled data. The threshold for false labeling was set at 0.7. The training epochs are set to 150 for all experiments, and the learning rate is reduced by a factor of 0.1 at epoch 120. Apply strong enhancements as horizontal flip, resize, patch removal, cropping, grayscale and Gaussian blur. Experiments use horizontal flipping to apply weak augmentation. The value N of the number of queries input to the deformable DETR decoder is set to 30 because it gives the best results. Unless otherwise stated, experiments use the mAP (AP50:95) metric to evaluate results.

Experimental results discussion:

TableBank:

Experiments provide experimental results on all splits of tabular datasets with different proportions of labeled data. Transformer-based semi-supervised methods are also compared with previous deep learning-based supervised and semi-supervised methods. In addition, the experiments give results at all IoU thresholds for the TableBank-both dataset with 10% labeled data. Table 1 provides the experimental results of the semi-supervised method on the TableBank-latex, TableBank-word, and TableBank-both datasets, respectively, when 10%, 30% and 50% of the labeled data. It shows that at 10% labeled data, TableBank-both dataset has the highest AP50 value of 95.8%, TableBank-latex has 93.5%, and TableBank-word has 92.5%.

A qualitative analysis of semi-supervised learning of the form is shown in Figure 5. Part (b) of Figure 5 has a matrix similar to the row and column structure. The network detects the matrix as a table and gives a false positive detection result. Here, incorrect detection results indicate that the network cannot provide correct table region detection. Table 2 presents the results of this semi-supervised approach for different IoU thresholds for all datasets on 10% labeled data. A visual comparison of precision, recall, and f1-score for semi-supervised networks using different ResNet-50 backbones on the TableBank 10% labeled dataset is shown in Figure 6.

Comparison with previous supervised and semi-supervised methods

Table 3 compares deep learning-based supervised and semi-supervised networks on the ResNet-50 backbone. Supervised deformable DETR trained on 10%, 30% and 50% TableBank-both dataset labeled data is also compared with a semi-supervised approach using a deformable transformer. The results show that attention-based semi-supervised methods achieve promising results using a candidate generation process and post-processing steps such as non-maximum suppression (NMS).

PubLayNet:

      Experiments discuss the experimental results for different percentages of labeled data on the PubLayNet table class dataset. Transformer-based semi-supervised methods are also compared with previous deep learning-based supervised and semi-supervised methods. Furthermore, experiments give results for all IoU thresholds on the PubLayNet dataset with 10% labeled data. Table 4 provides the results of a semi-supervised approach that uses a deformable transformer on PubLayNet table-like data for different percentages of labeled data. Here, 10%, 30%, and 50% of the labeled data have AP50 values ​​of 98.5%, 98.8%, and 98.8%, respectively

       In addition, the semi-supervised network is trained on 10% of the labeled PubLayNet dataset at different IoU thresholds. Table 5 presents the results of semi-supervised methods for different IoU thresholds for PubLayNet table classes on 10% labeled data. A visual comparison of accuracy, recall and f1-score for a semi-supervised network using a deformable transformer network with a ResNet-50 backbone at different IoU thresholds on a 10% labeled dataset of the PubLayNet table class is shown in Figure 6 ( b) as shown. Here, blue represents the accuracy results of different IoU thresholds, red represents the recall results of different IoU thresholds, and green represents the f1-score results of different IoU thresholds.

 

Comparison with previous supervised and semi-supervised methods

Table 6 compares deep learning-based supervised and semi-supervised networks on the PubLayNet table class using the ResNet-50 backbone. A supervised deformable detr trained on 10%, 30% and 50% of the PubLayNet table class label data is also compared with a semi-supervised approach using a deformable transformer. It shows that semi-supervised methods that do not use candidate and post-processing steps, such as non-maximum suppression (NMS), provide competitive results.

DocBank:

Experiments discuss the experimental results on data with different label percentages on the DocBank dataset. In Table 7, the transformer-based semi-supervised method is compared with previous CNN-based semi-supervised methods.

In addition, the semi-supervised method on different scales of labeled data in Table 8 is compared with previous table detection and document analysis methods on different datasets. Although it is not possible to directly compare the authors' semi-supervised approach with previous methods for supervised document analysis. However, it can be observed that even with 50% labeled data, the authors obtain similar results to previous supervised methods.

 ICDAR-19:

The experiments also evaluate the table detection method on the Modern Track A dataset. The authors summarize the quantitative results of the method under different percentages of labeled data and compare it with previous supervised tabular detection methods in Table 9. Results are evaluated at higher IoU thresholds of 0.8 and 0.9. For direct comparison with previous tabular detection methods, the authors also evaluate the paper's method on 100% of the labeled data. The paper's method achieves 92.6% precision and 91.3% recall on the IoU threshold of 100% labeled data.

Ablation experiment:

Pseudo-label confidence threshold

The threshold (called the confidence threshold) plays an important role in deciding the balance between the accuracy and the number of generated pseudo-labels. As this threshold is increased, fewer samples will pass the filter, but they will be of higher quality. Conversely, a smaller threshold will result in more samples being passed, but with a higher probability of false positives. The effects of various thresholds from 0.5 to 0.9 are shown in Table 10. According to the calculation results, the optimal threshold is determined to be 0.7.

The impact of the number of learnable queries

In the analysis, the authors investigate the effect of varying the number of queries that are input to the deformable DETR decoder. Figure 7 compares the prediction results by varying the number of object queries that are input into the deformable DETR decoder. Optimal performance is achieved when the number of queries N is set to 30; deviations from this value will cause performance degradation. Table 11 shows and analyzes the results for different object query numbers. Choosing a small value for N may cause the model to fail to recognize specific objects, negatively affecting its performance. On the other hand, choosing a large value of N may cause the model to underperform due to overfitting, as it will incorrectly classify some regions as objects. Moreover, in the teacher-student module, the training complexity of this semi-supervised self-attention mechanism depends on the number of object queries and is improved by minimizing the number of object queries to reduce the complexity.

 

in conclusion

This paper presents a semi-supervised method for table detection in document images using deformable transformers. The method alleviates the need for large-scale annotated data and simplifies the process by integrating the pseudo-label generation framework into a simplified mechanism. Simultaneously generating pseudo-labels creates a dynamic process known as the "flywheel effect," in which one model continually improves on the pseudo-boxes produced by the other as training progresses. In this framework, pseudo class labels and pseudo bounding boxes are refined using two different modules student and teacher. These modules update each other with EMA features to provide accurate classification and bounding box prediction. The results show that the method outperforms the performance of supervised models when applied to 10%, 30%, and 50% of the TableBank and PubLayNet training data. Furthermore, the performance of the model is compared with current CNN-based semi-supervised baselines when trained on 10% of the labeled data from PubLayNet. In the future, the authors aim to investigate the effect of the proportion of labeled data on the final performance and develop models that operate efficiently with a minimal amount of labeled data. In addition, the author also intends to use a transformer-based semi-supervised learning mechanism for table structure recognition tasks.

references:

Gao L C, Li Y B, Du L, Zhang X P, Zhu Z Y, Lu N, Jin L W, Huang Y S, Tang Z . 2022.A survey on table recognition technology. Journal of Image and Graphics, 27(6): 1898-1917.

M Kasem , A Abdallah, A Berendeyev,E Elkady , M Abdalla, M Mahmouda, M Hamada, D Nurseitovd, I Taj-Eddin.Deep learning for table detection and structure recognition: A survey.arXiv:2211.08469v1 [cs.CV] 15 Nov 2022

S A Siddiqui , M I Malik,S Agne , A Dengel and S Ahmed. DeCNT: Deep Deformable CNN for Table Detection. in IEEE Access, vol.6, pp.74151-74161, [DOI: 10.1109/ACCESS.2018.2880211]

T Shehzadi, K A Hashmi, D Stricker, M Liwicki , and M Z Afzal.Towards End-to-End Semi-Supervised Table Detection with Deformable Transformer.arXiv:2305.02769v2 [cs.CV] 7 May 2023

Guess you like

Origin blog.csdn.net/INTSIG/article/details/130762474