[Paper Reading Notes 61] ClusTi: Clustering Method for Table Structure Recognition

Zucker, A., Belkada, Y., Vu, H. et al. ClusTi: Clustering Method for Table Structure Recognition in Scanned Images. Mobile Netw Appl 26, 1765–1776 (2021). https://doi.org/10.1007/s11036-021-01759-9

Sorbonne University, Paris, France

Keywords

  • Table structure recognition
  • Object recognition
  • Clustering method

1. Summary

​ First, use a clustering algorithm to remove heavy noise in the table image (DBSCAN).
​ Second, it uses the most advanced text recognition technology to extract all text boxes (reference paper: CRAFT----Character region awareness for text detection).
​ Third, CluSTi groups text boxes into corresponding correct rows and columns based on the horizontal (DBSCAN) and vertical clustering algorithm (DBSCAN) of optimized parameters.

2. Specific content

2.1 Identification process:

image-20211124160444007

2.2 Method process:

image-20211124155918463

1.Noise Removal

**Noise Removal: **Normal characters are generally highly clustered together, and noise is generally outlier; use DBSCAN clustering technique to remove singular points;

image-20211124161203706

2. Text Detection

Text Detection: deep neural model – CRAFT----Character region awareness for text detection;

image-20211124160844451

3. Row Detection

Row Detection: Horizontal clustering algorithm. DBSCAN clustering technique (parameters are optimized)

image-20211124161144676

First, calculate the centroid coordinates (i.e. (x_c, y_c)) of each detected text box. Then, normalize them according to the x-axis. Finally, the normalized centroids (i.e. (x_n, y_n)) were clustered using DBSCAN with optimized parameters.

​ Output the number of lines and which line the style belongs to.

Fine-tuning: Fine-tuning horizontal clustering

image-20211124161710308

For multi-line text in one cell, use the Probing algorithm ([33])

image-20211124161853178

参考:【33】Scholkmann F, Boss J, Wolf M (2012) An efficient algorithm for automatic peak detection in noisy periodic and quasi-periodic signals. Algorithms 5(4):588–603

4. Column Detection

**Column Detection: ** DBSCAN clustering technique (parameters are optimized)

Vertical clustering algorithm

image-20211124162318956

5. Cell Reconstruction:

Cells can be reconstructed by determining their actual width, height and coordinates.

image-20211124162410586

3. Evaluation

3.1 Dataset:

397 table images,来自 table-detection-dataset. https://github.com/sgrpanchal31/tabledetection-dataset

ICDAR 2013;

ICDAR 2019;

3.2 Experimental results:

image-20211124162713598

3.3 Comparison results with DeepDeSRT and TableNet

image-20211124172720628

【23】 Paliwal SS, Vishwanath D, Rahul R, Sharma M, Vig L (2019) Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In: 2019 International conference on document analysis and recognition (ICDAR). IEEE, pp 128–133

【34】Schreiber S, Agne S, Wolf I, Dengel A, Ahmed S (2017) Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR International conference on document analysis and recognition (ICDAR), vol 1.IEEE, pp 1162–1167

4. Summary

The method is simple, the thinking is clear, and the theory is not discussed much. But I don't know if the code is open source or not.

5. Related work

Review the DBSCAN density clustering algorithm

DBSCAN (Density-Based Spatial Clustering of Applications with Noise, a density-based clustering method with noise);

A very typical density clustering algorithm. Compared with K-Means and BIRCH, which are generally only suitable for clustering of convex sample sets, DBSCAN can be applied to both convex sample sets and non-convex sample sets.

Density clustering ideas:

by hahppyprince 2021-11-24

Guess you like

Origin blog.csdn.net/ld326/article/details/121521291