Java parses large xml files

basic introduction

The implementation method and advantages and disadvantages of using Java to parse xml files are still classic interview questions. If you have not experienced it or do not recognize it, it is still a long time ago. Common configuration file types in daily work include: xml, yml, properties, ini, json and other formats. Some common and simple configurations may no longer be the first choice for xml, but the xml format will not be abandoned in the end. After all, there are many suitable The application scenarios and advantages that other formats cannot have.

As for the analysis of xml format files, dom, jdom, jom4j, sax and the Jaxb component added in JDK6 are commonly used in memory. Apart from the differences in their own implementations, the analysis of large xml files also mainly depends on the understanding of each API. Control, in the past few years, Jaxb has been used for the analysis of xml files. It is really the most convenient and easy to use, but I encountered an xml file with a size of more than 50M last year. When the file is slightly larger Under the circumstances, the first thing that comes to mind is that it is definitely not possible to read the file at one time, and then use xml analysis, that is to say, all the above-mentioned components cannot be used directly. I also use Guava's Files tool class Files under the recommendation of my colleagues. .asCharSource(file, charset).readLines(callback) to segment the file, take the file content in each segment and combine it into a batch of xml content segments to parse, and finally loop all the content segments to avoid one-time reading The file creates memory pressure on the server. Therefore, the focus is to use Guava's Files tool class to read the file content line by line in the form of cached streams. FileUtils in Apache Commons IO, a similar tool class, also provides the readLines implementation of reading files by line. The bottom layer uses IOUtils. The final implementation is to use BufferedReader to read files at one time, and read the file content by line as List<String> , this method is not suitable for too large files, otherwise it will cause memory overflow, it is recommended to use smaller files. The Guava Files recommended in this article also has the readLines function (the higher version has been marked as expired). It is recommended to use the asCharSource function. Its underlying implementation supports LineProcessor (line processor). We can produce and consume at the same time when applying, or produce a Batch after batch consumption, at the same time, it reads a line and waits for consumption before reading the next line. The read data is stored in a queue.

This example is divided into two parts, 1. Use the code to generate a 150M xml file. The content in the xml file is regular, and the content is written in segments according to the beginning of a certain line and the keyword of a certain line; 2. Read the file in sections according to the start tag and end tag of the special line, and print the read content to the console (printing to the console is simulated as the final processing of the data). As for the generation of the example 1 file, it is relatively simple, given Take a look at an example of the file structure, as shown in the following figure:

(Example file format, the actual file is demonstrated with 1 header line and 2 million and 2 pieces of data)

Reference Code

Define data consumers

package cn.chendd.xml;

import java.util.List;

/**
 * 按行读取文件内容的批量处理
 *
 * @author chendd
 * @date 2023/2/18 8:51
 */
@FunctionalInterface
public interface RowBatchListProcessor {

    /**
     * 批量处理数据
     * @param rows 文本数据
     */
    void execute(List<String> rows);

}

batch read by row

package cn.chendd.xml;

import com.google.common.io.LineProcessor;

import java.util.List;

/**
 * 行解析实现
 *
 * @author chendd
 * @date 2023/2/18 8:41
 */
public class RowLineProcessor implements LineProcessor<String> {

    /**
     * xml文件内容的行节点个数
     */
    private static final int BATCH_SIZE = 200;

    private List<String> rows;
    private String beginMarker;
    private String endMarker;
    private RowBatchListProcessor processor;

    /**
     * 构造函数
     * @param rows 文件行数据
     * @param beginMarker 开始行标记
     * @param endMarker 结束行标记
     * @param processor 逻辑处理类
     */
    public RowLineProcessor(List<String> rows , String beginMarker , String endMarker , RowBatchListProcessor processor) {
        this.rows = rows;
        this.beginMarker = beginMarker;
        this.endMarker = endMarker;
        this.processor = processor;
    }

    /**
     * 单次获取的内容
     */
    private StringBuilder textBuilder = new StringBuilder();
    /**
     * 是否开始读取文件
     */
    private boolean begin = false;

    @Override
    public boolean processLine(String line) {
        if (line.endsWith(beginMarker)) {
            begin = true;
        }
        if (line.endsWith(endMarker)) {
            begin = false;
            textBuilder.append(line);
            rows.add(textBuilder.toString());
            textBuilder.setLength(0);
        } else if (begin) {
            textBuilder.append(line);
        }
        if (rows.size() > 0 && rows.size() % BATCH_SIZE == 0) {
            processor.execute(rows);
            rows.clear();
        }
        return true;
    }

    @Override
    public String getResult() {
        if (rows.isEmpty()) {
            return null;
        }
        this.processor.execute(rows);
        return null;
    }

}

call example

package cn.chendd.xml;

import com.google.common.collect.Lists;
import com.google.common.io.Files;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;

import java.io.File;
import java.io.IOException;
import java.net.URLDecoder;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.concurrent.atomic.AtomicInteger;

/**
 * 文件按行读取验证
 *
 * @author chendd
 * @date 2023/2/18 9:59
 */
@RunWith(JUnit4.class)
public class RowLineReaderTest {

    @Test
    public void reader() throws IOException {
        AtomicInteger atomicInteger = new AtomicInteger();
        //批量数据解析实现
        RowBatchListProcessor execute = rows -> {
            System.out.println(String.format("第 %d 批数据处理,数据 %d 行!" , atomicInteger.addAndGet(1) , rows.size()));
        };
        RowLineProcessor processor = new RowLineProcessor(Lists.newArrayList() , "<tr>" , "</tr>" , execute);
        Files.asCharSource(this.getFile(), Charset.defaultCharset()).readLines(processor);
    }

    private File getFile() throws IOException {
        String fileFolder = URLDecoder.decode(getClass().getResource("").getFile() , StandardCharsets.UTF_8.name());
        return new File(fileFolder , "data.xml");
    }


}

sample output

other instructions

(1) Use the IO buffer to read the file by line stream, use the special mark of the line as the start mark and end mark, read the file content in segments, match the read file and then use Jaxb mapping and then parse it;

(2) The implementation of the code in this example is to parse a regular file with a special format start line and end line. It cannot parse files of any format, and can be applied to file parsing with similar rules;

(3) This example provides two examples of generating xml and xml parsing, the number of data items is 2 million, and the file size is greater than 150M;

(4) The source code project download can be seen: https://gitee.com/88911006/chendd-examples/tree/master/xml

(5) For more information, please go to; https://www.chendd.cn/blog/article/1626771133537509378.html

Guess you like

Origin blog.csdn.net/haiyangyiba/article/details/129098684