Talk about Excel analysis: how to deal with millions of rows of EXCEL files? | JD Cloud technical team

I. Introduction

Excel tables are widely used in background management systems, and are mostly used for batch configuration and data export. In daily development, we can't avoid Excel data processing.

So, how to properly handle Excel files with a huge amount of data to avoid memory overflow problems? This article will compare and analyze the mainstream Excel analysis technologies in the industry, and give solutions.

If this is your first contact with Excel parsing, it is recommended that you understand the basic concepts of this article from Chapter 2; if you already have some understanding of POI, please skip to Chapter 3 to read the key content of this article.

2. Basic articles - POI

When it comes to reading and writing Excel, you can't do without the big brother in this circle - POI.

Apache POI is a free and open source cross-platform Java API written in Java by the Apache Software Foundation. It supports us to use the Java language to interact with all Microsoft Office documents including Word, Excel, PowerPoint, and Visio, and to perform data reading, writing, and modification operations.

(1) "Bad" spreadsheets

In POI, each document has a corresponding document format, such as the 97-2003 version of the Excel file (.xls), the document format is HSSF - Horrible SpreadSheet Format, which means "poor spreadsheet format". Although Apache humorously and modestly calls its API "terrible", it is indeed a comprehensive and powerful API.

The following are some "bad" POI document formats, including Excel, Word, etc.:

Office documents Corresponding POI format
Excel (.xls) HSSF (Horrible SpreadSheet Format)
Word (.doc) HWPF (Horrible Word Processor Format)
Visio (.vsd) HDGF (Horrible DiaGram Format)
PowerPoint(.ppt) HSLF(Horrible Slide Layout Format)

(2) Introduction to OOXML

Microsoft launched an XML-based technical specification in the Office 2007 version: Office Open XML, or OOXML for short. Different from the binary storage of the old version, under the new specification, all Office documents are written in XML format and compressed and stored in ZIP format, which greatly improves the standardization, improves the compression rate, reduces the file size, and supports backward compatible. In simple terms, OOXML defines how to represent Office documents as a series of XML files.

The essence of Xlsx files is XML

Let's take a look at the composition of an Xlsx file using the OOML standard. If we right-click an Xlsx file, we can find that it can be decompressed by a ZIP decompression tool (or directly modify the extension to .zip and then decompress), which means: Xlsx files are compressed in ZIP format . After decompression, you can see the following directory format:

Open the "/xl" directory, which is the main structure information of this Excel:

Among them, workbook.xml stores the structure of the entire Excel workbook, including several sheets, and the structure of each sheet is stored in the /wooksheets folder. styles.xml stores the format information of the cell, and the /theme folder stores some predefined fonts, colors and other data. In order to reduce the compression volume, all character data in the form are stored in sharedStrings.xml. After analysis, it is not difficult to find that the main data of the Xlsx file is written in XML format.

XSSF format

In order to support the new standard Office documents, POI also launched a set of APIs compatible with the OOXML standard, called poi-ooxml. For example, the POI document format corresponding to the Excel 2007 file (.xlsx) is XSSF (XML SpreadSheet Format).

The following is part of the OOXML document format:

Office documents Corresponding POI format
Excel (.xlsx) XSSF (XML SpreadSheet Format)
Word (.docx) XWPF (XML Word Processor Format)
Visio (.vsdx) XDGF (XML DiaGram Format)
PowerPoint (.pptx) XSLF (XML Slide Layout Format)

(3)UserModel

In POI, we provide two models for parsing Excel, UserModel (user model) and EventModel (event model) . Both analysis modes can process Excel files, but the analysis methods, processing efficiency, and memory usage are different. The simplest and most practical is UserModel.

UserModel & DOM parsing

The User model defines the following interface:

  1. Workbook-workbook, corresponding to an Excel document. Depending on the version, there are HSSFWorkbook, XSSFWorkbook and other classes.

  2. Sheet-form, several sheets in an Excel, also has HSSFSheet, XSSFSheet and other classes.

  3. Row-row, a form is composed of several rows, and there are also HSSFRow, XSSFRow and other classes.

  4. Cell-cell, a row is composed of several cells, there are also HSSFCell, XSSFCell and other classes.

user model representation

It can be seen that the user model is very suitable for the habits of Excel users and is easy to understand, just like opening an Excel table. At the same time, the user model provides a rich API, which can support us to complete the same operations as in Excel, such as creating a form, creating a row, getting the number of rows in a table, getting the number of columns in a row, reading and writing cell values, etc.

Why does UserModel support us to perform such rich operations? Because in UserModel, all XML nodes in Excel are parsed into a DOM tree, and the entire DOM tree is loaded into memory, so random access to each XML node can be conveniently performed .

UserModel data conversion

Knowing the user model, we can directly use its API for various Excel operations. Of course, a more convenient way is to use the user model to convert an Excel file into the Java data structure we want for better data processing.

It is easy for us to think of relational databases - because the essence of the two is the same. Compared with the data table of the database, our idea is as follows:

  1. Think of a Sheet as two parts, the header and the data, which respectively contain the structure of the table and the data of the table.

  2. For the header (the first line), check whether the header information matches the defined attributes of the entity class.

  3. For the data (remaining rows), traverse each Row from top to bottom, convert each row into an object, and use each column as an attribute of the object to obtain an object list, which contains all the data in Excel.

Next, we can process our data according to our needs. If we want to write the manipulated data back to Excel, the same logic applies.

Use UserModel

Let's see how to read Excel files using UserModel. Use POI 4.0.0 version here, first introduce poi and poi-ooxml dependencies:

    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>4.0.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>4.0.0</version>
    </dependency>

We want to read a simple Sku information table, the content is as follows:

How to convert UserModel information into a data list?

We can define the mapping relationship from table header to data by implementing reflection + annotation, which helps us realize the conversion from UserModel to data object. The basic idea of ​​implementation is: ① Customize the annotation, define the column number in the annotation, which is used to mark each attribute of the entity class corresponding to which column of the Excel header. ② In the entity class definition, add annotations to the attributes of each entity class according to the table structure. ③ Through reflection, obtain the column number corresponding to each attribute of the entity class in Excel, so as to obtain the value of the attribute in the corresponding column.

The following is a simple implementation. First, prepare the custom annotation ExcelCol, which contains the column number and header:

import java.lang.annotation.ElementType;
import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.lang.annotation.Target;

@Target({ElementType.FIELD})
@Retention(RetentionPolicy.RUNTIME)
public @interface ExcelCol {

    /**
     * 当前列数
     */
    int index() default 0;

    /**
     * 当前列的表头名称
     */
    String header() default "";
}

Next, define the Sku object according to the Sku field, and add annotations, the column numbers are 0, 1, 2, and specify the header name:

import lombok.Data;
import org.shy.xlsx.annotation.ExcelCol;

@Data
public class Sku {

    @ExcelCol(index = 0, header = "sku")
    private Long id;

    @ExcelCol(index = 1, header = "名称")
    private String name;

    @ExcelCol(index = 2, header = "价格")
    private Double price;
}

Then, use reflection to obtain each Field of the table header, and use the column number as the index to store it in the Map. Start from the second row of Excel (the first row is the table header), traverse each subsequent row, and for each attribute of each row, get the value of the corresponding Cell according to the column number, and assign a value to the data object. Depending on the type of value in the cell, such as text/number, etc., perform different processing. In order to simplify the logic, only the types that appear in the table header are processed below, and the processing logic for other cases is similar. The whole code is as follows:

import com.alibaba.fastjson.JSON;
import org.apache.commons.lang3.StringUtils;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.shy.domain.pojo.Sku;
import org.shy.xlsx.annotation.ExcelCol;

import java.io.FileInputStream;
import java.lang.reflect.Field;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class MyUserModel {

    public static void main(String[] args) throws Exception {
        List<Sku> skus = parseSkus("D:\sunhaoyu8\Documents\Files\skus.xlsx");
        System.out.println(JSON.toJSONString(skus));
    }

    public static List<Sku> parseSkus(String filePath) throws Exception {
        FileInputStream in = new FileInputStream(filePath);
        Workbook wk = new XSSFWorkbook(in);
        Sheet sheet = wk.getSheetAt(0);
        // 转换成的数据列表
        List<Sku> skus = new ArrayList<>();

        // 获取Sku的注解信息
        Map<Integer, Field> fieldMap = new HashMap<>(16);
        for (Field field : Sku.class.getDeclaredFields()) {
            ExcelCol col = field.getAnnotation(ExcelCol.class);
            if (col == null) {
                continue;
            }
            field.setAccessible(true);
            fieldMap.put(col.index(), field);
        }

        for (int rowNum = 1; rowNum <= sheet.getLastRowNum(); rowNum++) {
            Row r = sheet.getRow(rowNum);
            Sku sku = new Sku();
            for (int cellNum = 0; cellNum < fieldMap.size(); cellNum++) {
                Cell c = r.getCell(cellNum);
                if (c != null) {
                    setFieldValue(fieldMap.get(cellNum), getCellValue(c), sku);
                }
            }
            skus.add(sku);
        }
        return skus;
    }

    public static void setFieldValue(Field field, String value, Sku sku) throws Exception {
        if (field == null) {
            return;
        }
        //得到此属性的类型
        String type = field.getType().toString();
        if (StringUtils.isBlank(value)) {
            field.set(sku, null);
        } else if (type.endsWith("String")) {
            field.set(sku, value);
        } else if (type.endsWith("long") || type.endsWith("Long")) {
            field.set(sku, Long.parseLong(value));
        } else if (type.endsWith("double") || type.endsWith("Double")) {
            field.set(sku, Double.parseDouble(value));
        } else {
            field.set(sku, value);
        }
    }

    public static String getCellValue(Cell cell) {
        DecimalFormat df = new DecimalFormat("#.##");
        if (cell == null) {
            return "";
        }
        switch (cell.getCellType()) {
            case NUMERIC:
                return df.format(cell.getNumericCellValue());
            case STRING:
                    return cell.getStringCellValue().trim();
            case BLANK:
                return null;
        }
        return "";
    }

Finally, print out the converted data list. The result of the operation is as follows:

[{"id":345000,"name":"电脑A","price":5999.0},{"id":345001,"name":"手机C","price":4599.0}]

Tips: If "NoClassDefFoundError" occurs in your program, please introduce ooxml-schemas dependency:

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>ooxml-schemas</artifactId>
    <version>1.4</version>
</dependency>

See the table below for version selection. For example, POI 4.0.0 corresponds to ooxml-schemas version 1.4:

Limitations of UserModel

The above processing logic is applicable to most Excel files, but the biggest disadvantage is the large memory overhead, because all data is loaded into memory. According to the actual measurement, the Excel file with the above three columns will have OOM at about 70,000 lines, while the maximum number of lines in the XLS file is 65,535 lines, and the XLSX has reached 1,048,576 lines. If you read tens of thousands or even millions of data into the memory , the risk of memory overflow is extremely high.

So, how to solve the problem that the traditional UserModel cannot handle large batches of Excel? Developers have come up with many wonderful solutions, please see the next chapter.

3. Advanced articles - Exploration of memory optimization

Next, we will introduce the main content of this article, and at the same time solve the problem raised in this article: How to optimize the memory of Excel analysis to process millions of rows of Excel files?

(1)EventModel

As we mentioned earlier, in addition to UserModel, POI also provides another model for parsing Excel: EventModel event model. Different from the DOM parsing of the user model, the event model uses the SAX method to parse Excel.

EventModel & SAX analysis

The full name of SAX is Simple API for XML, which is an event-driven XML parsing method. Unlike DOM, which reads XML at one time, SAX will process XML while reading. To put it simply, the SAX parser will scan the XML document line by line, and when it encounters a tag, it will trigger the parsing processor, thereby triggering the corresponding event Handler. What we have to do is to inherit the DefaultHandler class and rewrite a series of event handling methods to process Excel files accordingly.

Here is an example of simple SAX parsing, here is the XML file to be parsed: a sku table, which contains two sku nodes, each node has an id attribute and three child nodes.

<?xml version="1.0" encoding="UTF-8"?>
<skus>
    <sku id="345000">
        <name>电脑A</name>
        <price>5999.0</price>
   </sku>
    <sku id="345001">
        <name>手机C</name>
        <price>4599.0</price>
   </sku>
</skus>

Create a Java entity class against the XML structure:

import lombok.Data;

@Data
public class Sku {
    private Long id;
    private String name;
    private Double price;
}

Custom event processing class SkuHandler:

import com.alibaba.fastjson.JSON;
import org.shy.domain.pojo.Sku;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class SkuHandler extends DefaultHandler {
    /**
     * 当前正在处理的sku
     */
    private Sku sku;
    /**
     * 当前正在处理的节点名称
     */
    private String tagName;

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        if ("sku".equals(qName)) {
            sku = new Sku();
            sku.setId(Long.valueOf((attributes.getValue("id"))));
        }
        tagName = qName;
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        if ("sku".equals(qName)) {
            System.out.println(JSON.toJSONString(sku));
            // 处理业务逻辑
            // ...
        }
        tagName = null;
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        if ("name".equals(tagName)) {
            sku.setName(new String(ch, start, length));
        }
        if ("price".equals(tagName)) {
            sku.setPrice(Double.valueOf(new String(ch, start, length)));
        }
    }
}

Among them, SkuHandler rewrites three event response methods:

startElement()——Call this method whenever a new XML element is scanned, passing in the XML tag name qName, and the XML attribute list attributes;

characters()——Call this method whenever a string that is not in the XML tag is scanned, passing in the character array, starting subscript and length;

endElement() - Whenever the end tag of an XML element is scanned, this method is called, passing in the XML tag name qName.

We use a variable tagName to store the currently scanned node information, and update the tagName every time the scanned node sends a change;

Use a Sku instance to maintain the Sku information currently read into the memory. Whenever the Sku is read, we print the Sku information and execute the corresponding business logic. In this way, one piece of Sku information can be read at a time and processed while analyzing. Since each line of Sku has the same structure, only one piece of Sku information needs to be maintained in the memory, which avoids reading all the information into the memory at one time.

When calling the SAX parser, use SAXParserFactory to create a parser instance and parse the input stream. The Main method is as follows:

import org.shy.xlsx.sax.handler.SkuHandler;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.InputStream;

public class MySax {
    public static void main(String[] args) throws Exception {
        parseSku();
    }

    public static void parseSku() throws Exception {
        SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
        SAXParser saxParser = saxParserFactory.newSAXParser();
        InputStream inputStream = ClassLoader.getSystemResourceAsStream("skus.xml");
        saxParser.parse(inputStream, new SkuHandler());
    }
}

The output is as follows:

{"id":345000,"name":"电脑A","price":5999.0}
{"id":345001,"name":"手机C","price":4599.0}

The above demonstrates the basic principles of SAX parsing. The API of EventModel is more complex, and SAX parsing is also implemented by rewriting the Event handler. Interested readers, please refer to the sample code of POI official website: https://poi.apache.org/components/spreadsheet/how-to.html

Limitations of EventModel

Although the EventModel API officially provided by POI uses the SAX method to solve the problem of DOM parsing, there are some limitations:

① Belongs to low level API, low level of abstraction, relatively complex, and high cost of learning and using.

② The processing methods for HSSF and XSSF types are different, and the code needs to be compatible according to different types.

③ The problem of memory overflow cannot be perfectly solved, and there is still room for optimization of memory overhead.

④ It is only used for Excel parsing and does not support Excel writing.

Therefore, the author does not recommend using POI's native EventModel . As for the more recommended tools, please see below.

(2)SXSSF

Introduction to SXSSF

SXSSF, the full name of Streaming XML SpreadSheet Format, is a low-memory streaming Excel API launched after POI 3.8-beta3, aiming to solve the memory problem when Excel is written. It is an extension of XSSF, when you need to write a large amount of data into Excel, you only need to replace XSSF with SXSSF. The principle of SXSSF is a sliding window-save a certain number of rows in memory, and store the rest on disk. The advantage of this is memory optimization, but the cost is the loss of random access. SXSSF is compatible with most APIs of XSSF, and is very suitable for developers who understand UserModel.

Memory optimization inevitably imposes certain limitations:

① Only a limited number of rows can be accessed at a certain point in time, because the rest of the rows are not loaded into memory.

② Does not support XSSF APIs that require random access, such as deleting/moving rows, cloning sheets, formula calculations, etc.

③ Excel read operation is not supported.

④ Because it is an extension of XSSF, it does not support writing Xls files.

Comparison of UserModel, EventModel, and SXSSF

So far, all POI Excel APIs have been introduced. The following table is a comparison of the functions of all these APIs, from POI official website:

It can be seen that UserModel is based on DOM parsing, has the most complete functions, and supports random access. The only disadvantage is that the CPU and memory efficiency are unstable;

EventModel is a streaming reading solution provided by POI, based on SAX parsing, only supports forward access, and other APIs do not;

SXSSF is a streaming writing solution provided by POI. It can also only access forward and supports some XSSF APIs.

(3)EasyExcel

Introduction to EasyExcel

In order to solve the problem of POI's native SAX parsing, Ali re-developed EasyExcel based on POI. The following is an introduction quoted from the official website of EasyExcel:

The well-known frameworks for Java parsing and generating Excel include Apache poi and jxl. But they all have a serious problem that consumes a lot of memory. POI has a SAX mode API that can solve some memory overflow problems to a certain extent, but POI still has some defects, such as Excel version 07 decompression and storage after decompression. It is done in memory, and the memory consumption is still very large. easyexcel rewrote poi's analysis of Excel version 07. A 3M excel still needs about 100M memory for POI sax analysis. Using easyexcel can reduce it to a few M, and no matter how big the excel is, there will be no memory overflow; version 03 depends on POI's sax mode encapsulates the model conversion on the upper layer, making it easier and more convenient for users.

As mentioned in the introduction, EasyExcel also uses SAX parsing, but because the SAX parsing of xlsx is rewritten, the memory overhead is optimized; the xls file is further encapsulated in the upper layer to reduce the cost of use. On the API, it uses annotations to define the Excel entity class, which is easy to use; read Excel through the event listener, compared with the native EventModel, the API is greatly simplified; when writing data, EasyExcel uses repeated multiple Write-at-a-time mode reduces memory overhead.

The biggest advantage of EasyExcel is that it is easy to use and can be used within ten minutes. Due to the high-level packaging of POI's API, it is suitable for developers who do not want to understand the basic API of POI. All in all, EasyExcel is an API worth checking out.

Use EasyExcel

Introduce easyexcel dependency:

<dependency>
    <groupId>com.alibaba</groupId>
    <artifactId>easyexcel</artifactId>
    <version>2.2.3</version>
</dependency>

First, define the Excel entity class with annotations:

import com.alibaba.excel.annotation.ExcelProperty;
import lombok.Data;

@Data
public class Sku {
    @ExcelProperty(index = 0)
    private Long id;

    @ExcelProperty(index = 1)
    private String name;

    @ExcelProperty(index = 2)
    private Double price;
}

Next, rewrite the invoke and doAfterAllAnalysed methods in AnalysisEventListener. These two methods are called when the event of single-line analysis completion and the event of all analysis completion are monitored respectively. Every time a single-line parsing is completed, we print the parsing result, the code is as follows:

import com.alibaba.excel.EasyExcel;
import com.alibaba.excel.context.AnalysisContext;
import com.alibaba.excel.event.AnalysisEventListener;
import com.alibaba.fastjson.JSON;
import org.shy.domain.pojo.easyexcel.Sku;

public class MyEasyExcel {
    public static void main(String[] args) {
        parseSku();
    }

    public static void parseSku() {
        //读取文件路径
        String fileName = "D:\sunhaoyu8\Documents\Files\excel.xlsx";
        //读取excel
        EasyExcel.read(fileName, Sku.class, new AnalysisEventListener<Sku>() {
            @Override
            public void invoke(Sku sku, AnalysisContext analysisContext) {
                System.out.println("第" + analysisContext.getCurrentRowNum() + "行:" + JSON.toJSONString(sku));
            }

            @Override
            public void doAfterAllAnalysed(AnalysisContext analysisContext) {
                System.out.println("全部解析完成");
            }
        }).sheet().doRead();
    }
}

Take a test, use it to parse a 100,000-line excel, the file will be OOM when read by UserModel, as follows:

operation result:

(4)Xlsx-streamer

Introduction to Xlsx-streamer

Xlsx-streamer is a tool for streaming Excel, also based on POI secondary development. Although EasyExcel can solve the problem of Excel reading very well, the parsing method is SAX, which needs to be parsed in an event-driven manner by implementing a listener. Is there any other way of parsing it? Xlsx-streamer gives the answer.

Description translated from official documentation:

If you've used Apache POI to read Excel files in the past, you may have noticed that it's not very memory efficient. Reading the entire workbook can cause severe memory usage spikes which can wreak havoc on the server. There are many reasons why Apache has to read the entire workbook, but most of them have to do with the library allowing you to use random addresses for reading and writing. If (and only if) you just want to read the contents of an Excel file in a fast and memory-efficient manner, you probably don't need this feature. Unfortunately, the only thing in the POI library for reading streaming workbooks requires your code to use a SAX-like parser. All the friendly classes like Row and Cell are missing from the API. This library acts as a wrapper around the streaming API while preserving the syntax of the standard POI API. Read on to find out if it's right for you. NOTE: This library only supports reading XLSX files.

As mentioned in the introduction, the biggest convenience of Xlsx-streamer is that it is compatible with the user's habit of using POI UserModel. It provides its own streaming implementation for all UserModel interfaces, such as StreamingSheet, StreamingRow, etc. For those who are familiar with the development of UserModel For users, there is almost no learning threshold, and they can directly use UserModel to access Excel.

The implementation principle of Xlsx-streamer is the same as that of SXSSF, both of which are sliding windows - limit the size of the data read into the memory, read the data being parsed into the memory buffer, and form a temporary file to prevent a large amount of memory from being used. The content of the buffer will continue to change during the parsing process. When the stream is closed, the temporary file will also be deleted. Due to the existence of the memory buffer, the entire stream will not be completely read into memory, thus preventing memory overflow.

Like SXSSF, because only some rows are loaded into the memory, the ability of random access is sacrificed, and the entire table can only be accessed through traversal order, which is an inevitable limitation. In other words, if the StreamingSheet.getRow(int rownum) method is called, the method will get the specified row of the sheet, and an exception of "the operation is not supported" will be thrown.

The biggest advantage of Xlsx-streamer is that it is compatible with UserModel, especially for developers who are familiar with UserModel but do not want to use cumbersome EventModel. Like SXSSF, it provides a solution to the memory problem by implementing the UserModel interface, which fills the gap that SXSSF does not support reading. It can be said that it is the "reading version" of SXSSF.

Use Xlsx-streamer

Introduce pom dependencies:

    <dependency>
        <groupId>com.monitorjbl</groupId>
        <artifactId>xlsx-streamer</artifactId>
        <version>2.1.0</version>
    </dependency>

Here is a demo using xlsx-streamer:

import com.monitorjbl.xlsx.StreamingReader;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;

import java.io.FileInputStream;

public class MyXlsxStreamer {
    public static void main(String[] args) throws Exception {
        parseSku();
    }

    public static void parseSku() throws Exception {
        FileInputStream in = new FileInputStream("D:\sunhaoyu8\Documents\Files\excel.xlsx");
        Workbook wk = StreamingReader.builder()
                //缓存到内存中的行数,默认是10
                .rowCacheSize(100)
                //读取资源时,缓存到内存的字节大小,默认是1024
                .bufferSize(4096)
                //打开资源,必须,可以是InputStream或者是File
                .open(in);
        Sheet sheet = wk.getSheetAt(0);

        for (Row r : sheet) {
            System.out.print("第" + r.getRowNum() + "行:");
            for (Cell c : r) {
                if (c != null) {
                    System.out.print(c.getStringCellValue() + " ");
                }
            }
            System.out.println();
        }
    }
}

As shown in the code, the usage method of Xlsx-streamer is: use StreamingReader for parameter configuration and streaming reading. We can manually configure the fixed sliding window size. There are two indicators, namely the maximum number of rows cached in memory and The maximum number of bytes cached in memory, these two indicators will limit the upper limit of the sliding window at the same time. Next, we can use the API of UserModel to traverse and access the read tables.

Use a 100,000-line excel file to test it, and the running result is:

StAX analysis

The parsing method used at the bottom of Xlsx-streamer is called StAX parsing. StAX was introduced in the JSR 173 specification in March 2004 and is a new feature introduced by JDK 6.0. Its full name is Streaming API for XML, streaming XML parsing. More precisely, it's called "streaming pull analytics". The reason why it is called pull analysis is because it is opposite to "streaming push analysis" - SAX analysis.

We mentioned before that SAX parsing is an event-driven parsing model. Whenever a tag is parsed, the corresponding event Handler will be triggered to "push" the event to the responder. In such a push model, the parser is active and the responder is passive. We cannot choose which events we want to respond to, so such parsing is relatively inflexible.

In order to solve the problem of SAX parsing, StAX parsing adopts a "pull" method - when the parser traverses the stream, the original responder becomes the driver, actively traverses the event parser (iterator), and pulls one by one from it. event and handle it. During parsing, StAX supports using the peek() method to "peek" at the next event to decide whether it is necessary to analyze the next event without having to read the event from the stream. This can effectively improve flexibility and efficiency.

Let's parse the same XML again with StAX:

<?xml version="1.0" encoding="UTF-8"?>
<skus>
    <sku id="345000">
        <name>电脑A</name>
        <price>5999.0</price>
   </sku>
    <sku id="345001">
        <name>手机C</name>
        <price>4599.0</price>
   </sku>
</skus>

This time we don't need a listener, and integrate all processing logic in one method:

import com.alibaba.fastjson.JSON;
import org.apache.commons.lang3.StringUtils;
import org.shy.domain.pojo.Sku;

import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import java.io.InputStream;
import java.util.Iterator;


public class MyStax {

    /**
     * 当前正在处理的sku
     */
    private static Sku sku;
    /**
     * 当前正在处理的节点名称
     */
    private static String tagName;

    public static void main(String[] args) throws Exception {
        parseSku();
    }
    
    public static void parseSku() throws Exception {
        XMLInputFactory inputFactory = XMLInputFactory.newInstance();
        InputStream inputStream = ClassLoader.getSystemResourceAsStream("skus.xml");
        XMLEventReader xmlEventReader = inputFactory.createXMLEventReader(inputStream);
        while (xmlEventReader.hasNext()) {
            XMLEvent event = xmlEventReader.nextEvent();
            // 开始节点
            if (event.isStartElement()) {
                StartElement startElement = event.asStartElement();
                String name = startElement.getName().toString();
                if ("sku".equals(name)) {
                    sku = new Sku();
                    Iterator iterator = startElement.getAttributes();
                    while (iterator.hasNext()) {
                        Attribute attribute = (Attribute) iterator.next();
                        if ("id".equals(attribute.getName().toString())) {
                            sku.setId(Long.valueOf(attribute.getValue()));
                        }
                    }
                }
                tagName = name;
            }
            // 字符
            if (event.isCharacters()) {
                String data = event.asCharacters().getData().trim();
                if (StringUtils.isNotEmpty(data)) {
                    if ("name".equals(tagName)) {
                        sku.setName(data);
                    }
                    if ("price".equals(tagName)) {
                        sku.setPrice(Double.valueOf(data));
                    }
                }
            }
            // 结束节点
            if (event.isEndElement()) {
                String name = event.asEndElement().getName().toString();
                if ("sku".equals(name)) {
                    System.out.println(JSON.toJSONString(sku));
                    // 处理业务逻辑
                    // ...
                }
            }
        }
    }
}

The above code is equivalent to the logic of SAX parsing. Use XMLEventReader as an iterator to read events from the stream, loop through the event iterator, and then classify according to the event type. Interested partners can try it out by themselves and explore more details of StAX analysis.

4. Conclusion

EventModel, SXSSF, EasyExcel, and Xlsx-streamer respectively give their own solutions to the memory usage problem of UserModel. The following is a comparison of all the Excel APIs mentioned in this article:

  UserModel EventModel SXSSF EasyExcel Xlsx-streamer
memory usage high lower Low Low Low
Full Table Random Access yes no no no no
read excel yes yes no yes yes
read method DOM SAX -- SAX StAX
write Excel yes yes yes yes no

It is recommended that you choose the appropriate API according to your usage scenario:

  1. To deal with the needs of large batches of Excel files, it is recommended to choose POI UserModel and EasyExcel;

  2. To read large batches of Excel files, EasyExcel and Xlsx-streamer are recommended;

  3. To write large batches of Excel files, SXSSF and EasyExcel are recommended.

Using the above API will definitely meet the needs of Excel development. Of course, the Excel API is not limited to these, and there are many APIs of the same type. You are welcome to explore and innovate more.

Page link:

POI official website: https://poi.apache.org/

EasyExcel official website: https://easyexcel.opensource.alibaba.com

Xlsx-streamer Github: https://github.com/monitorjbl/excel-streaming-reader

Author: JD Insurance Sun Haoyu

Source: JD Cloud Developer Community

Graduates of the National People’s University stole the information of all students in the school to build a beauty scoring website, and have been criminally detained. The new Windows version of QQ based on the NT architecture is officially released. The United States will restrict China’s use of Amazon, Microsoft and other cloud services that provide training AI models . Open source projects announced to stop function development LeaferJS , the highest-paid technical position in 2023, released: Visual Studio Code 1.80, an open source and powerful 2D graphics library , supports terminal image functions . The number of Threads registrations has exceeded 30 million. "Change" deepin adopts Asahi Linux to adapt to Apple M1 database ranking in July: Oracle surges, opening up the score again
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10086061