Apache POI 用eventmodel 读取大文件Excel (1) 初识 Xlsx格式

前言

最近被要求用Java读取大文件Xlsx(超过10MB的Excel文档),并导入数据库,百度搜到了POI。

一开始对POI了解不够深入,于是轻松地用POI的usermodel开发出了第一版,然鹅,usermodel对于内存的要求太高,只要文件超过5MB就开始GC overhead limit exceeded了。

(而且读取6MB的xlsx文件居然能占用3个G的内存??)

于是我决定使用eventmodel模式,然鹅,由于excel里带有公式的缘故,需要开发的代码量非常大,而且在百度很难找到使用event model的POI的案例,百度,谷歌找了一整天都没发现合适的案例,最终决定自己从底层开始,一点一点啃掉这个知识点。顺便整理个文档,输入输出,福曼学习法

也不是没考虑过市面上其它的工具,比如:

  • easyexcel
  • hutool操作poi
  • excel-streaming-reader
  • gridexcel
  • jxl

然鹅这些工具都不能同时满足以下几个特性:

  • 低内存消耗(至少不能导致GC overhead limit exceeded,或者OOM
  • 自动处理复杂公式(最难受的是把2020/12/20,ABC/BCS-DS/BUC这样的日期字符串读取为公式,总之现有的类库对公式的支持都不够多)
  • 读取超大文件(增量数据20MB以上,全量数据将近500MB)

也许easyexcel可以做到,然鹅黑盒子不敢用呀,而且easyexcel抽象级别过高,不够灵活,没有文档,学习成本过高的同时无法满足项目需求。

Excel 之 Xlsx格式详解

以下内容部分来自于以下链接,并对其内容做了部分修改。
https://www.loc.gov/preservation/digital/formats/fdd/fdd000398.shtml

了解一下 XLSX, (Office Open XML, Spreadsheet ML)

Office Open XML简称 OOXML,是xlsx格式的标准,它在ECMA-376中制定,
下面这个链接是ECMA-376的文档下载地址。
http://www.ecma-international.org/publications/standards/Ecma-376.htm
还有关于OOXML,wiki上有详细的介绍
https://en.wikipedia.org/wiki/Office_Open_XML

The Open Office XML-based spreadsheet format using .xlsx as a file extension has been the default format produced for new documents by versions of Microsoft Excel since Excel 2007.

The format was designed to be equivalent to the binary .xls format produced by earlier versions of Microsoft Excel (see MS-XLS).

For convenience, this format description uses XLSX to identify the corresponding format.

The primary content of a XLSX file is marked up in SpreadsheetML, which is specified in parts 1 and 4 of ISO/IEC 29500, Information technology – Document description and processing languages – Office Open XML File Formats (OOXML).

This description focuses on the specification in ISO/IEC 29500:2012 and represents the format variant known as “Transitional.”

Although editions of ISO 29500 were published in 2008, 2011, 2012, and 2016, the specification has had very few changes other than clarifications and corrections to match actual usage in documents since SpreadsheetML was first standardized in ECMA-376, Part 1 in 2006.

This description can be read as applying to all SpreadsheetML versions published by ECMA International and by ISO/IEC through 2016.

See Notes below for more detail on the chronological versions and differences.

The XLSX format uses the SpreadsheetML markup language and schema to represent a spreadsheet “document.”

Conceptually, using the terminology of the Spreadsheet ML specification in ISO/IEC 29500-1, the document comprises one or more worksheets in a workbook.

A worksheet typically consists of a rectangular grid of cells.

Each cell can contain a value or a formula, which will be used to calculate a value, with a cached value usually stored pending the next recalculation.

A single spreadsheet document may serve several purposes: as a container for data values;

as program code (based on the formulas in cells) to perform analyses on those values; and as one or more formatted reports (including charts) of the analyses.

Beyond basics, spreadsheet applications have introduced support for more advanced features over time.

These include mechanisms to extract data dynamically from external sources, to support collaborative work, and to perform an increasing number of functions that would have required a database application in the past, such as sorting and filtering of entries in a table to display a temporary subset.

The markup specification must support both basic and more advanced functionalities in a structure that supports the robust performance expected by users.

看看它的结构:

An XLSX file is packaged using the Open Packaging Conventions (OPC/OOXML_2012, itself based on ZIP_6_2_0).

The package can be explored, by opening with ZIP software, typically by changing the file extension to .zip.

The top level of a minimal package will typically have three folders (_rels, docProps, and xl) and one file part ([Content_Types].xml).

在这里插入图片描述

The xl folder holds the primary content of the document including the file part workbook.xml and a worksheets folder containing a file for each worksheet, as well as other files and folders that support functionality (such as controlling calculation order) and presentation (such as formatting styles for cells) for the spreadsheet.

在这里插入图片描述

Any embedded graphics are also stored in the xl folder as additional parts.

The other folders and parts at the top level of the package support efficient navigation and manipulation of the package:

_rels

_rels is a Relationships folder, containing a single file .rels (which may be hidden from file listings, depending on operating system and settings).

It lists and links to the key parts in the package, using URIs to identify the type of relationship of each key part to the package.

In particular it specifies a relationship to the primary officeDocument (typically named /xl/workbook.xml ) and typically to parts within docProps as core and extended properties.

docProps

docProps is a folder that contains properties for the document as a whole, typically including a set of core properties, a set of extended or application-specific properties, and a thumbnail preview for the document.

[Content_Types].xml

[Content_Types].xml is a file part, a mandatory part in any OPC package, that lists the content types (using MIME Internet Media Types as defined in RFC 6838) for parts within the package.

The standards documents that specify this format run to over six thousand pages.

Useful introductions to the XLSX format can be found at:

Anatomy of a SpreadsheetML File by Daniel Dick of Reuters.

Structure of a SpreadsheetML document from Open XML SDK documentation. Includes diagram showing typical spreadsheet document parts.

下一篇:Apache POI 用eventmodel 读取大文件Excel (2) Xlsx格式内容详解

猜你喜欢

转载自blog.csdn.net/weixin_42072754/article/details/110621488