Apache POI uses eventmodel to read large files Excel (3) Xlsx format content supplement

Part of the content of this article comes from ECMA-376
http://www.ecma-international.org/publications/standards/Ecma-376.htm

In the previous section, I have understood most of the composition of OOXML. However, in actual operation, Goose encountered t="shared"such a shared formula, which was not mentioned in the previous article. It was found on stackOverflow that it is a shared formula. But More detailed content is gone.

The descriptions searched on Baidu, Google, and stackOverflow have very little information explaining OOXML, so I finally had to go back to the root and look at the wiki, and found that OOXML was formulated by ECMA-376.

The time when the .xlsx format appeared was 2007, and it complied with this standard, the time when this standard was born must be in 2007 or before. After searching, it was found that the first edition of ECMA-376 1st was released in 2006, so download it Come down and take a look at the content.

After querying, it is found that the field definitions in the cell of the XML of .xlsx are defined in SpreadsheetML. The following is the original content of ECMA:

8.4 SpreadsheetML
This subclause introduces the overall form of a SpreadsheetML package, and identifies some of its main element types.
(See Part 3 for a more detailed introduction.) A SpreadsheetML package
has a relationship of type officeDocument, which specifies the
location of the main part in the package. For a SpreadsheetML
document, that part contains the workbook definition.

Here it says that the detailed content is in the third part, so I found the third part and found that there is indeed:
Insert picture description here

Insert picture description here

After searching, I added a part of the content that needs to be used. These are all from ECMA-376 1st part3:

3.2.9.2.1 Shared Formulas

<row r="7" spans="4:8">
  <c r="H7" s="1">
    <f t="shared" ref="H7:H11" ce="1" si="0">SUM(E7:G7)</f>
    <v>1.0246225028914113</v>
  </c>
</row>
<row r="8" spans="4:8">
  <c r="H8" s="1">
    <f t="shared" ce="1" si="0">SUM(E8:G8)</f>
    <v>0.9063376048733931</v>
  </c>
</row>

Just as strings in cells can be extremely pervasive and redundant in a sheet (and therefore must be optimized), formulas are also extremely pervasive in a sheet, and often can be optimized.

Consider the table in the above example, where column H contains a formula that sums the numbers in columns E through G, for each row.

The only difference between the formulas in H6:H12 is that the reference increases by 1 row from one row to the next.

Therefore, an optimization is created where only the formula in H6 needs to be written out, with some additional information indicating how far to propagate the formula once loaded.

This enables the loading application to load and parse only the first of the shared formulas, and then more quickly apply the necessary transforms to produce the additional related formulas in subsequent cells.

Note that while formulas can be shared, it is desirable to enable easy access to the contents of a cell.

Therefore, it is allowed that all formulas may be written out, but only the primary formula in a shared formula need be loaded and parsed.

3.2.9 Cell

<c r="B3">
  <f>B2+1</f>
  <v>2</v>
</c>

The cell itself is expressed by the c collection.

Each cell indicates it’s location in the grid using A1-style reference notation.

A cell can also indicate a style identifier (attribute s) and a data type (attribute t).

The cell types include string, number, and Boolean.

In order to optimize load/save operations, default data values are not written out.

3.2.9.1 Cell Values

Cells contain values, whether the values were directly typed in (e.g., cell A2 in our example has the value External Link:) or are the result of a calculation (e.g., cell B3 in our example has the formula B2+1).

String values in a cell are not stored in the cell table unless they are the result of a calculation.

Therefore, instead of seeing External Link: as the content of the cell’s v node, instead you see a zero-based index into the shared string table where that string is stored uniquely.

This is done to optimize load/save performance and to reduce duplication of information.

To determine whether the 0 in v is a number or an index to a string, the cell’s data type must be examined.

When the data type indicates string, then it is an index and not a numeric value.

In ECMA part4, finally found the definition of the formula

Insert picture description here

There are also definitions of common types
Insert picture description here

I will add it here at the moment, and I will come back to add it if necessary in the future

Next: Apache POI uses eventmodel to read large files Excel (4) sample code

Guess you like

Origin blog.csdn.net/weixin_42072754/article/details/110849603