ESRI Shapefile format parsing

ESRI Shapefile format parsing

overview

Shapefile belongs to a vector graphics format, which can save the position and related properties of geometric figures. But this format cannot store topological information of geographic data. Shapefiles were first used in the second version of ArcView GIS in the early nineties. Shapefiles are currently read by many free and commercial programs.

Shapefile is a relatively primitive vector data storage method, which can only store the position data of the geometry, but cannot store the attribute data of these geometry in one file at the same time. Therefore, the Shapefile must also be accompanied by a two-dimensional table for storing the attribute information of each geometry in the Shapefile. Many geometries in Shapefile can represent complex geographical things and provide them with powerful and accurate calculation capabilities. However, a single shapefile can only contain one kind of geometry. For example, a shapefile cannot contain polyline and polygon data at the same time.

Shapefile consists of multiple files, three of which are essential:

.shp — graphics format for saving geometric entities of elements.
.shx — Graphical index format. The geometry position index records the position of each geometry in the shp file, which can speed up the efficiency of searching forward or backward for a geometry.
.dbf — Attribute Data Format, stores attribute data for each geometry in dBase III+'s data table format.

A group of files representing the same data should have the same filename prefix. For example, to store the geometry and attribute data about a lake, there must be three files: lake.shp, lake.shx and lake.dbf. The suffix of the "real" Shapefile is shp, but only this file data is incomplete, and the other two must be attached to form a complete set of geographic data.
In addition to these three required files, there are eight optional files, the most useful of which is the .prj file:

.prj —Projection information file, used to save geographic coordinate system and projection information in well-known text format.

Within each .shp, .shx and .dbf file, graphics are sorted consistently within each file. That is to say, the first record in .shp corresponds to the first record in .shx and .dbf, and so on. In addition, in .shp and .shx, the endianness of many fields is different. Therefore, when users write programs that read these file formats, they must be very careful to deal with the different endianness of different files.

Shapefiles usually record geographical coordinates in the order of XY. Generally, X corresponds to longitude and Y corresponds to latitude. Users must pay attention to the order of X and Y.

.shp file format

The master file in Shapefile format contains georeferenced data. The file consists of a fixed-length file header and one or several variable-length record data. Each variable-length data record contains a record header and some record content. The detailed data storage format is provided by Esri Shapefile technical description.[1]. Note that although the suffixes of Shapefile files are the same as those of AutoCAD's graphics font source format, both are .shp, please don't confuse them.

The main file header contains 17 fields for a total of 100 bytes, including nine 4-byte (32-bit signed integer, int32) integer fields, followed by eight 8-byte (double-precision floating point) signed float pip field.

byte type byte order use
0–3 int32 Big endian File number, always integer 9994 (hexadecimal 0x0000270a)
4–23 int32 Big endian five unused 32-bit integers
24–27 int32 Big endian The length of the file, including the file header. (expressed in 16-bit integer)
28–31 int32 little endian version number
32–35 int32 little endian graph type (see below)
36–67 double little endian The minimum bounding rectangle (MBR), which is a rectangle that contains all the graphics in the shapefile. Represented by four floating-point numbers, which are the minimum value of the X coordinate, the minimum value of the Y coordinate, the maximum value of the X coordinate, and the maximum value of the Y coordinate.
68–83 double little endian The range of Z coordinate values. Represented by two floating-point numbers, which are the minimum value of the Z coordinate and the maximum value of the Z coordinate.
84–99 double little endian Range of M coordinate values. Represented by two floating-point numbers, which are the minimum value of the M coordinate and the maximum value of the M coordinate.

This file then contains a variable number of variable-length data records, each starting with an 8-byte record header:

byte type byte order use
0–3 int32 Big endian record number (starting from 1)
4–7 int32 Big endian Record length (expressed as a 16-bit integer)

Following the record header is the actual record:

byte type byte order use
0–3 int32 little endian graph type (see below)
4– - - graphic content

The content of variable-length records is determined by the type of graphics. Shapefile supports the following graphics types:

value graphic type field
0 empty graphic none
1 Point X, Y
3 Polyline (minimum bounding rectangle) MBR, number of components, number of points, all components, all points
5 Polygon (minimum bounding rectangle) MBR, number of components, number of points, all components, all points
8 MultiPoint (multipoint) (minimum bounding rectangle) MBR, number of points, all points
11 PointZ (point with Z and M coordinates) X, Y, Z, M
13 PolylineZ (polyline with Z or M coordinates) Required: (minimum bounding rectangle) MBR, number of components, number of points, all components, all points, Z coordinate range, Z coordinate array \ Optional: M coordinate range, M coordinate array
15 PolygonZ (polygon with Z or M coordinates) Required: (minimum bounding rectangle) MBR, number of components, number of points, all components, all points, Z coordinate range, Z coordinate array \ Optional: M coordinate range, M coordinate array
18 MultiPointZ (multipoint with Z or M coordinates) Required: (minimum bounding rectangle) MBR, number of points, all points, Z coordinate range, Z coordinate array\Optional: M coordinate range, M coordinate array
21 PointM (point with M coordinate) X, Y, M
23 PolylineM (polyline with M coordinates) Required: (minimum bounding rectangle) MBR, number of components, number of points, all components, all points \ optional: M coordinate range, M coordinate array
25 PolygonM (polygon with M coordinates) Required: (minimum bounding rectangle) MBR, number of components, number of points, all components, all points \ optional: M coordinate range, M coordinate array
28 MultiPointM (multipoint with M coordinates) Required: (minimum bounding rectangle) MBR, number of points, all points\ Optional: M coordinate range, M coordinate array
31 MultiPatch Required: (minimum bounding rectangle) MBR, number of components, number of points, all components, all points, Z coordinate range, Z coordinate array \ Optional: M coordinate range, M coordinate array

In common use, Shapefiles usually contain points, polylines, and polygons. A shape with a Z coordinate is three-dimensional. Shapes with M coordinates contain a user-specified measurement defined at each point coordinate. 3D Shapefiles are very rare. In addition, Shapefile's M measurements have usually been replaced by other more powerful and robust databases, and Shapefile is generally only responsible for geometric data.

.shx file format

The file index of the Shapefile contains the same 100-byte file header as the .shp file, followed by a variable number of 8-byte fixed-length records, each record has two fields:

byte type byte order use
0–3 int32 Big endian Record displacement (expressed in 16-bit integer)
4–7 int32 Big endian Record length (expressed in 16-bit integer)

因为这个图形索引每个数据项都是定长的,因此程序只要在这个图形索引中向前或向后遍历,读取索引中所记录的记录位移与记录长度,程序就可以很快地向前或向后遍历整个Shapefile,在.shp文件中找到任意一个几何体的正确位置。

.dbf文件格式

属性文件(.dbf)用于记录属性信息。它是一个标准的DBF文件,也是由头文件和实体信息两部分构成。这个文件可以用excel打开。
其中文件头部分的长度是不定长的,它主要对DBF文件作了一些总体说明(见表2.7),其中最主要的是对这个DBF文件的记录项的信息进行了详细地描述,比如对每个记录项的名称、数据类型、长度等信息都有具体的说明。一般来说,shapefile文件有多少个属性就有多少个记录项
文件头结构如下表所示:

字节 类型 说明
0 1个字节 表示当前的版本信息。
1-3 3个字节 表示最近的更新日期,按照YYMMDD格式。
4-7 1个32位数 文件中的记录条数。
8-9 1个16位数 文件头中的字节数。
10-11 1个16位数 一条记录中的字节长度。
12-13 2个字节 保留字节,用于以后添加新的说明性信息时使用,这里用0来填写。
14 1个字节 表示未完成的操作。
15 1个字节 dBASE IV编密码标记。
16-27 12个字节 保留字节,用于多用户处理时使用。
28 1个字节 DBF文件的MDX标识。在创建一个DBF 表时 ,如果使用了MDX 格式的索引文件,那么 DBF 表的表头中的这个字节就自动被设置了一个标志,当你下次试图重新打开这个DBF表的时候,数据引擎会自动识别这个标志,如果此标志为真,则数据引擎将试图打开相应的MDX 文件。
29 1个字节 Language driver ID.
32-X (n*32)个字节 记录项信息描述数组。n表示记录项的个数。这个数组的结构在下个表中有详细的解释。
X+1 1个字节 作为记录项终止标识。

每条记录项都是对一项属性的描述,结构见下表:

字节 类型 说明
0-10 11个字节 记录项名称,是ASCII码值。例如:“LAYER”,“AREA”,“GM_TYPE”,"NAME"等。
11 1个字节 记录项的数据类型,是ASCII码值。(B、C、D、G、L、M和N,具体的解释见表2.9)。
12-15 4个字节 保留字节,用于以后添加新的说明性信息时使用,这里用0来填写。
16 1个字节 记录项长度,二进制型。
17 1个字节 记录项的精度,二进制型。
18-19 2个字节 保留字节,用于以后添加新的说明性信息时使用,这里用0来填写。
20 1个字节 工作区ID。
21-30 10个字节 保留字节,用于以后添加新的说明性信息时使用,这里用0来填写。
31 1个字节 MDX标识。如果存在一个MDX 格式的索引文件,那么这个记录项为真,否则为空。

记录项数据类型如下表所示:

代码 数据类型 允许输入的数据
B 二进制型 各种字符。
C 字符型 各种字符。
D 日期型 用于区分年、月、日的数字和一个字符,内部存储按照YYYYMMDD格式。
G (General or OLE) 各种字符。
N 数值型(Numeric) - . 0 1 2 3 4 5 6 7 8 9
M (Memo) 各种字符。

写完所有记录项之后,就是实体信息了。按与shp中几何实体相同的次序写入即可。

参考

维基百科
ESRI文档
CSDN博客

Guess you like

Origin blog.csdn.net/qq_42679415/article/details/129899766