[Java] Use Tabula technology to extract data from tables in PDF files

One day, the project team came up with a request that the data in the PDF file needed to be extracted for data precipitation. This was because the third-party system did not provide a data interface, so this was the only solution.

As far as I know, there are currently 3 solutions for data extraction within PDF files:

First, if you have enough funds, you can directly analyze the PDF content through artificial intelligence and output the data according to the specifications you need;

The second is to use OCR recognition technology to extract content;

The third one is achieved through tools (also what I will present to you). PDFbox is very popular in the open source community, and its text recognition rate is also very good, but it is not very friendly to table support. When it comes to table data extraction, I chose Tabula to implement it;

What is Tabula?

Tabula is an open source tool for extracting tabular data from PDF documents. Its main technologies include:

  1. PDF parsing: Tabula uses Java's PDFBox library to parse the content and layout of PDF documents. It can locate the coordinates of text blocks and images on each page;
  2. Table recognition: Tabula identifies the structure of tables by analyzing the layout of lines and text blocks on the page. It looks for vertical and horizontal lines as column and row separators;
  3. Cell extraction: After determining the structure of the table, Tabula will analyze the text block corresponding to each cell and extract the text content in the cell;
  4. Data sorting: Tabula will try to automatically sort the data extracted from the table, for example: merging cells vertically and horizontally, processing cross-page tables, etc. It also performs certain text cleaning;
  5. Export format: Tabula supports exporting the extracted data to CSV and JSON formats. Users can import it into other tools such as Excel for subsequent analysis.
  6. Optimization algorithm: Tabula uses some optimized algorithms and heuristic rules in table analysis and data extraction to improve the accuracy. At the same time, it also provides an interactive editing interface for users to correct the results.

How to use Tabula?

The first step is to introduce pom file dependencies, as shown below:

<dependency>
  <groupId>technology.tabula</groupId>
  <artifactId>tabula</artifactId>
  <version>1.0.5</version>
</dependency>

Then you can create a PDF tool class (PdfUtil)

public class PdfUtil {
    
    

      ...
    
      private static final SpreadsheetExtractionAlgorithm SPREADSHEEET_EXTRACTION_ALGORITHM = new SpreadsheetExtractionAlgorithm();
      private static final ThreadLocal<List<String>> THREAD_LOCAL = new ThreadLocal<>();

      ...

      /**
       * @description: 解析pdf表格(私有方法)
       *               使用 tabula-java 的 sdk 基本上都是这样来解析 pdf 中的表格的,所以可以将程序提取出来,直到 cell
       *               单元格为止
       * @param {*} String pdf 路径
       * @param {*} int 自定义起始行
       * @param {*} PdfCellCallback 特殊回调处理
       * @return {*}
       */
      private static JSONArray parsePdfTable(String pdfPath, int customStart, PdfCellCustomProcess callback) {
    
    
            JSONArray reJsonArr = new JSONArray(); // 存储解析后的JSON数组

            try (PDDocument document = PDDocument.load(new File(pdfPath))) {
    
    
                  PageIterator pi = new ObjectExtractor(document).extract(); // 获取页面迭代器

                  // 遍历所有页面
                  while (pi.hasNext()) {
    
    
                        Page page = pi.next(); // 获取当前页
                        List<Table> tableList = SPREADSHEEET_EXTRACTION_ALGORITHM.extract(page); // 解析页面上的所有表格

                        // 遍历所有表格
                        for (Table table : tableList) {
    
    
                              List<List<RectangularTextContainer>> rowList = table.getRows(); // 获取表格中的每一行

                              // 遍历所有行并获取每个单元格信息
                              for (int rowIndex = customStart; rowIndex < rowList.size(); rowIndex++) {
    
    
                                    List<RectangularTextContainer> cellList = rowList.get(rowIndex); // 获取行中的每个单元格
                                    callback.handler(cellList, rowIndex, reJsonArr);
                              }
                        }
                  }
            } catch (IOException e) {
    
    
                  LOGGER.error(MARKER,
                              "function[PdfUtil.parsePdfTable] Exception [{} - {}] stackTrace[{}]",
                              e.getCause(), e.getMessage(), e.getStackTrace());
            } finally {
    
    
                  THREAD_LOCAL.remove();
            }
            return reJsonArr; // 返回解析后的JSON数组
      }

      ...

}

Here we first implement PDF table parsing according to the official website sample code. The general idea is:

  1. Create an empty JSONArray object reJsonArr to store parsed table data;
  2. Use the PDDocument.load method to load the PDF file with the specified path, and use the try-with-resources statement to create a PDDocument object document;
  3. Use ObjectExtractor to extract the page iterator pi from the document;
  4. Use a while loop to traverse each page, use the pi.hasNext method to determine whether there is a next page, and if so, enter the loop;
  5. Use the pi.next method to obtain the current page object page;
  6. Use SPREADSHEEET_EXTRACTION_ALGORITHM to parse all tables in the page and store the results in tableList;
  7. Use a for loop to traverse each table in the tableList, and perform the following operations for each table:
    a. Use the table.getRows method to obtain each row in the table, and store the results in rowList;
    b. Use a for loop to traverse the rowList in For each row, starting at the customStart position, do the following for each row:
    i. Get each cell in the row using the rowList.get method and store the results in cellList;
    ii. Pass cellList, rowIndex and reJsonArr as parameters Process the handler method of the callback function;
  8. Use try-catch statements to capture possible IOException exceptions and record error information;
  9. Use the finally statement to remove the data in THREAD_LOCAL;
  10. Returns the parsed JSONArray object reJsonArr;

The main purpose of adding a callback.handler callback function here is to decouple the "cell operation" and pdf parsing code. The interface definition of this callback interface is as follows:

@FunctionalInterface
public interface PdfCellCustomProcess {
    
    

      /**
       * @description: 自定义单元格回调处理
       * @return {*}
       */
      void handler(List<RectangularTextContainer> cellList, int rowIndex, JSONArray reJsonArr);
}

Among them, cellList is passed in the collection of all cells in this row, rowIndex is passed in the current row code, and reJsonArr is the return value. The specific implementation code is as follows:

public class PdfUtil {
    
    

      ...
    
      /**
       * @description: 解析 pdf 中简单的表格并返回 json 数组
       * @param {*} String PDF文件路径
       * @param {*} int 自定义起始行
       * @return {*}
       */
      public static JSONArray parsePdfSimpleTable(String pdfPath, int customStart) {
    
    
            return parsePdfTable(pdfPath, customStart, (cellList, rowIndex, reArr) -> {
    
    
                  JSONObject jsonObj = new JSONObject();
                  // 遍历单元格获取每个单元格内字段内容
                  List<String> headList = ObjectUtil.isNullObj(THREAD_LOCAL.get()) ? new ArrayList<>()
                              : THREAD_LOCAL.get();

                  for (int colIndex = 0; colIndex < cellList.size(); colIndex++) {
    
    
                        String text = cellList.get(colIndex).getText().replace("\r", " ");
                        if (rowIndex == customStart) {
    
    
                              headList.add(text);
                        } else {
    
    
                              jsonObj.put(headList.get(colIndex), text);
                        }
                  }

                  if (rowIndex == customStart) {
    
    
                        THREAD_LOCAL.set(headList);
                  }

                  if (!jsonObj.isEmpty()) {
    
    
                        reArr.add(jsonObj);
                  }
            });
      }

     ...

}

The main part of the code is a Lambda expression that is passed as a parameter to the parsePdfTable method. Lambda expression implements the PdfCellCustomProcess interface. The code block of the Lambda expression first creates a JSONObject object, and then iterates through the list of cells to obtain the text content of each cell.

If the current row index is equal to the custom starting row index, add the text content to the headList list; otherwise, add the text content to the jsonObj object as a key-value pair. Finally, if the jsonObj object is not empty, it is added to the reArr array. The code also contains some other operations. If the current row index is equal to the custom starting row index, set the headList list to the THREAD_LOCAL thread local variable. Finally, the reArr array is returned as the result of the method.

Finally, you only need to add the main method call to get the parsed JsonArray collection. However, it is not intuitive to output JsonArray data directly, so I wrote a method to parse JsonArray data and converted the data inside into Markdown format, as shown below:

private static String outputMdFormatForVerify(JSONArray jsonArr) {
    
    
        StringBuilder mdStrBld = new StringBuilder();
        StringBuilder headerStrBld = new StringBuilder("|");
        StringBuilder segmentStrBld = new StringBuilder("|");
        for (int row = 0; row < jsonArr.size(); row++) {
    
    
              StringBuilder bodyStrBld = new StringBuilder("|");
              JSONObject rowObj = (JSONObject) jsonArr.get(row);
              if (row == 0) {
    
    
                    rowObj.forEach((k, v) -> {
    
    
                          headerStrBld.append(" ").append(k).append(" |");
                          segmentStrBld.append(" ").append("---").append(" |");
                    });
                    headerStrBld.append("\n");
                    segmentStrBld.append("\n");
                    mdStrBld.append(headerStrBld).append(segmentStrBld);
              }
              rowObj.forEach((k, v) -> bodyStrBld.append("").append(v).append("|"));
              bodyStrBld.append("\n");
              mdStrBld.append(bodyStrBld);
        }
        return mdStrBld.toString();
}

This should be easier to understand, so I won’t go into details here.

The above code is basically no problem for general PDF table parsing, but it is not satisfactory for parsing with merged cells. Merging cells requires considering three merging modes: horizontal merging, vertical merging and mixed merging. It’s not that tabula-java’s sdk can’t do it, it’s just more troublesome. In the tabula-java solution, we can get the height and width of the cell, so first After doing a full traversal to obtain the two-dimensional array to position the cells, construct a virtual table based on the height and width, and finally backfill the data based on the two-dimensional array. This is also one of the reasons for using callbacks to separate cell operations, in order to prepare for merged cell analysis later.

But in fact, that’s all I said above. I haven’t written the code for merged cell parsing yet (the above is all my fault). I will share it with you after it is completed.

Guess you like

Origin blog.csdn.net/kida_yuan/article/details/132857890