How to efficiently process geospatial data with a modern data stack

background knowledge

What is geographic information data

The definition of geographic information data mainly comes from the planet we know well - the earth. We know that the earth's surface is an uneven surface, an approximate ellipsoid. There is a gap of close to 20,000 meters between the known highest and lowest points with reference to sea level.

  • Mount Everest, 8848.86 meters of ice (People's Daily: December 8, 2020)
  • The Mariana Trench is 10,909 meters deep relative to sea level (People's Daily: November 30, 2020)

Even sea level changes due to the gravitational pull of the moon's tides, not to mention rising sea levels caused by climate change. Therefore, it is quite difficult to find an exact mathematical model to represent the earth.

So people use the north and south poles as two fixed points, and rotate the earth according to this axis. The earth will take the shape of an ellipsoid under this rotating shape, which is the ideal ellipsoid of the earth. This ellipsoid can then be represented by a mathematical model.

It is based on such a consensus that the model data of the earth ellipsoid recommended by the International Union of Geodesy and Geophysics in 1975 is recommended as: the major radius of 6378140 meters, the minor radius of 6356755 meters, and the flattening ratio of 1:298.257. There are some corrections to the values.

Based on these precise measurements, we can define all the points on the earth through mathematical representation, so that with modern positioning technology we can precisely locate all the locations on the earth.

Positioning Technology

In order to know our precise position on the earth, we can use navigation software, and the normal operation of navigation software needs to rely on the global positioning system. At present, there are four major navigation systems in the world.

  • China's Beidou

  • GPS in America

  • Galileo in the European Union

  • GLONASS

The most important part of these positioning systems are artificial satellites, which revolve around the earth according to a fixed law, some of which are in circular geosynchronous orbits.

Since each satellite is at a different distance from us, the beacons sent by them at the same time arrive at our device at small time intervals. This way we have beacon signals with delays, each delay can be seen as a distance.

We know in advance the exact location of each satellite, plus this distance information. When we get at least 3 signals, we can use the famous triangulation method to get our exact position, which is also the core principle used by all satellite positioning technologies.

spatial data

The existing spatial database standards mainly include the following two sets, which are generally compatible with each other.

  • Simple Feature Access SQL, referred to as SFA SQL : SFA SQL is a standard developed by the Open Geospatial Consortium (OGC).

  • SQL Multimedia Part3: Spatial, referred to as SQL/MM : SQL/MM is a standard formulated by the International Standard Organization (International Standard Organization, referred to as ISO).

Through the description of spatial data we can define a specific geometry. In the common part of these two standards, the following 3 groups of 6 basic types are defined, which are frequently used types.

  • point, multiple points
  • line, multiple lines
  • Polygon, Multiple Polygons

In order to facilitate the storage and use of these data, the OGC organization has defined two specific formats through the OpenGIS specification

  • Well-Known Text (WKT) format
  • Well-Known Binary (WKB) format

Both WKB/WKT only describe the geometric data of points, lines, and surfaces through markup language, and generally do not need coordinate systems when used for geometric calculations. However, when the data needs to be displayed on the map, the original spatial data needs to be projected onto the geodetic coordinate system (this process is called projection) to obtain the specific geographic coordinates of the geometric figure.

Spatial Reference Identification Number (SRID)

To project geometry into a coordinate system, an SRID must be used. SRID can be understood as uniquely identifying the way to map a certain geometric spatial data into a specific coordinate system.

When SRID is 0 or SRID is not used, it means that a geometry instance is not placed in any coordinate system, and we cannot locate its position. For example, through the specific values ​​of length, width and height, we can know the shape of a cube, but we cannot know its specific coordinates.

Different SRID values ​​represent different ways of mapping the geometry into the coordinate system. The spatial data of the geometry itself combined with SRID can specifically locate the position of the geometry in the coordinate system.

The following figure simply demonstrates the difference with or without SRID. The SRID defined by the European Petroleum Surveying and Mapping Group (EPSG) is a coordinate system constructed according to the geographic information of the earth, and the geometric figures can be converted into real geographic coordinates according to the geometric spatial data and the SRID value of the EPSG standard.

There are several recognized standard SRIDs, such as the SRID defined by the European Petroleum Surveying Group (EPSG). Different databases have different adaptability to different SRID standards.

Certain databases and spatial types (such as PostGIS geometry and geography in PostgreSQL or geography in Microsoft SQL Server) use a predefined subset of EPSG codes, and only spatial references with these SRIDs can be used.

The work of compiling the SIRD has now been handed over to the International Association of Oil and Gas Producers (OGP). For more EPSG information, visit https://epsg.io/

Common SRID Standards and Geographic Coordinate Systems

There are the following four coordinate systems commonly used in China

  • WGS84: The coordinate system used on the US GPS system.
  • GCJ02: A geographic coordinate system developed by the China Bureau of Surveying and Mapping.
  • BD09: The coordinate system used by Baidu Maps, which is based on the GCJ02 coordinate system.
  • CGCS2000: The coordinate system used by China's Beidou system.

Geodetic Coordinate System and Mapping

The basic steps of map drawing

Drawing a map and constructing a geodetic coordinate system mainly adopts the following steps:

  1. First, a datum point will be selected, and all terrain data will be drawn based on this datum point. And this point is also a point on the ellipsoid of the earth.

  2. The Earth ellipsoid acts as a canvas on which the surveyor can outline specific street and terrain information. Thus forming the final map data for our use.

  3. Based on the deviation between the positioning point and the reference point of the map, the entire positioning-to-map conversion can be completed. This process is the coordinate system conversion. Of course, the actual process is more complicated, and there will even be multiple deviation corrections.

Store geographic information

At present, the mainstream relational databases basically support geographic information, and the most commonly used type is the geometry type. Corresponds to the sdo_geometry type in the Oracle database .

There are other geometry types, such as Point , Polygon , MultiPoint , MultiPolygon and so on. For the sake of space, the content of this article is only for the geometry type.

Friends who are interested in in-depth understanding can conduct in-depth research on their own according to the table below. This article will not expand too much.

database geometry type
MySQL POINT、LINESTRING、POLYGON、MULTIPOINT、MULTILINESTRING、MULTIPOLYGON、
GEOMETRY、GEOMETRYCOLLECTION
PostgreSQL POINT、LINE、LSEG、BOX、PATH、POLYGON、CIRCLE、GEOMETRY
Oracle SDO_GEOMETRY、SDO_TOPO_GEOMETRY、SDO_GEORASTER
SqlServer GEOMETRY、GEOGRAPHY

Different databases have some differences in the storage of geographic information due to different storage and query engines. These differences are mainly because pure WKB cannot meet the actual needs. The most direct problem is that WKB only considers the spatial data storage of geometric figures, but does not involve the information related to the geodetic coordinate system.

For example, in MySQL, geographic information data will add 4 additional bytes before WKB data to store its corresponding SRID. PostgreSQL uses the more advanced EWKB format as the storage format for geographic information data.

Therefore, if you want to obtain geographic information data directly from the database in binary mode, it is necessary to know the correct way to obtain it.

The problem of geographic information data application

We will discuss the practical problems encountered in the application of geographic information data from a specific case. Our geographic data application case is as follows:

How do you know how much land on earth has been used over a period of time?

In order to achieve this goal, we will have to face the following challenges:

Big amount of data

First, land use changes over time, such as:

  • At some times the land may be cultivated land and at other times forest land.

  • A piece of land may be cut into parcels over time, or merged into a larger parcel.

Therefore, the map data obtained every year is only the latest situation in that year, and the plot data is constantly changing.

Based on such a situation, if you want to know the change of the plot in a time span. Often, map data from different times is covered. If the map data is 1G in size, 10G of historical map data needs to be processed to calculate the 10-year change.

Large amount of calculation

For the map data, there will also be many other structured data, such as: residential area, house number, restaurant name, land access and traffic roads and so on. Therefore, based on business query needs, data query and filtering on the business dimension will be performed first.

Friends who have written business logic know that complex business queries are likely to involve the joint query operation of several tables. In addition, we also need to perform the intersection and calculation of geometric figures through GIS functions. This leads to the following two problems

  • A large amount of geographic geometric information and label information cause join performance problems for large tables.

  • A large number of calculations due to the function calculation of GIS.

There is no one-size-fits-all database

The mainstream databases that are friendly to geographic information storage are PostgreSQL, Oracle and SQL Server. Ecological tools like PostgreSQL are also friendly to geographic information data processing, such as:

  • PostgreSQL has good support for GeoServer, MapServer, ArcGISServer and several map service middleware.
  • PostgreSQL supports PostGIS better than Greenplum

These traditional databases cannot solve all problems, especially when facing tens of millions of GIS tables, the join query of tables will face serious problems.

Fortunately, new and powerful OLAP analytical databases have been emerging in recent years. The processing and analysis of geographic information data can be combined with these new analysis engines to significantly improve the efficiency of geographic information data processing.

A modern data stack for efficient processing of geospatial data

The following solutions for modernizing the data stack are from a real case of CloudCanal users. The user's original solution is to query and process geographic information data based on PostgreSQL.

Through CloudCanal data interoperability, the following modern data stack is used to easily integrate the powerful analysis capabilities of ClickHouse and the powerful full-text indexing capabilities of ElasticSearch to process geographic information data, which significantly improves the efficiency of geographic information data processing.

Data stack architecture diagram

The above architecture diagram shows the flow and processing of the entire geographic information data:

  1. PostgreSQL is friendly to geographic information storage and processing. Business applications first write all the generated geographic information data into PostgreSQL.

  2. Use CloudCanal to migrate and synchronize geographic information data between heterogeneous data sources, and automatically convert geographic information data into new data sources

  3. Leverage ClickHouse's powerful analytical capabilities to efficiently process geospatial data

    1. Aggregation and join analysis operations of massive geographic information data

    2. The application uses ClickHouse's powerful analysis capabilities to first screen the data, generate valid data with a small amount of data, and directly use the JTS tool to calculate the quadratic geometric function for the geographic information data with a small data size, and then generate the final processing result.

  4. Using ElasticSearch's powerful full-text indexing capabilities, applications can directly perform full-text search on geographic information data stored in ElasticSearch

It can be seen that using CloudCanal's modern data stack to process geographic information data has the following benefits:

  • It can cope with complex business query needs, and select different new databases for business to improve efficiency.

  • The application can directly use the geographic information data of a small data scale filtered by the analysis engine to perform geometric function calculations, which significantly improves the efficiency.

CloudCanal is friendly and compatible with geographic information data

table structure migration

When using PostgreSQL as the main library and ClickHouse as the analysis library. The first problem is ClickHouse's table building. Without the CloudCanal tool, building a table is more painful. With CloudCanal, this process will be quite convenient.

  • PostgreSQL does not have a statement similar to MySQL show create table, which can easily obtain the original table creation statement for us to refer to, so it is necessary to create one table by one table.

  • ClickHouse's table word end type is not consistent with PostgreSQL's word end type, and it is necessary to understand their mapping and conversion.

Even with data synchronization between PostgreSQL and PostgreSQL, there are some issues to consider

  • PostgreSQL table structure migration with SRID

These problems are solved by using CloudCanal, which automatically recognizes the field types of the table and maps them to the appropriate columns, which saves a lot of time in learning about new databases.

Similarly, CloudCanal, like PostgreSQL, has a relatively complete support for GIS features, and it can accurately handle PostgreSQL table structures with SRIDs. See the table below.

CREATE TABLE "city"
(
    "ogc_fid"    int4 NOT NULL,
    "mssm"       varchar(16),
    "bz"         varchar(16),
    "provincen"  varchar(50),
    "provincec"  varchar(50),
    "cityn"      varchar(50),
    "cityc"      varchar(50),
    "shape_leng" float4,
    "shape_area" float4,
    "geom"       geometry(geometry,4490) -- 带有 SRID 的列
);

data migration

CloudCanal supports the complete migration of geographic information-related data in the user's source-side database to the opposite-end heterogeneous data source, and supports resuming the upload from a breakpoint.

CloudCanal has done a lot of work on the migration of geographic information data. When the source database is PostgreSQL. The full data synchronization process will recognize the SRID information on the table, and convert PostgreSQL to standard WKT using the EWKB format, together with the SRID as the final data.

  • When the peer is ClickHouse, we can get the complete GIS geographic information data and the corresponding coordinate system SRID, which can be further processed in the program.

  • When the peer is PostgreSQL, the geographic information and coordinate system can also be completely synchronized to the peer.

custom processing

In the process of migrating/synchronizing geographic information data from the source database to the peer database, some very flexible processing operations can be done through the CloudCanal custom code function.

Users can implement their own custom code and do some additional processing for each piece of data during the data synchronization process. for example:

  • In the application of processing GIS, circumscribing is often used to obtain the largest rectangular area of ​​the geometric figure. Then store this rectangular area in a new field

  • Find the center point of GIS data geometry

  • Trim the data in advance, and write the cleaned and trimmed regular data into the new database of the peer.

Long-term real-time geographic information data synchronization

CloudCanal not only supports the migration of historical data but also supports real-time data synchronization between heterogeneous data sources. Real-time geographic information data synchronization can enhance the competitiveness of enterprises in certain business areas.

During real-time synchronization, users are naturally most concerned about the stability of long-term real-time synchronization.

In practice, in order to ensure the stability of real-time data synchronization during business operation, CloudCanal adopts a variety of methods to achieve this.

  • Fully distributed: all core components support distributed deployment to avoid single point of failure

  • Disaster recovery automatic recovery: If the machine running the real-time synchronization task crashes, CloudCanal will automatically migrate the real-time synchronization task on this machine to another available machine to continue real-time synchronization

  • Real-time synchronization and breakpoint resuming: CloudCanal has specially designed site management for various database source types. When a task is restarted due to a failure, the task will continue to start synchronization at the place where it was interrupted last to avoid data loss.

Summarize

At the beginning of this article, we briefly introduced the background knowledge related to geographic information data. In the second half of the article, we discussed how to use modern data stacks to efficiently process geographic information data.

If you think the article is helpful to you, please help to forward it and share it. I also look forward to everyone paying more attention to and using CloudCanal. Go to the official website of ClouGence https://www.clougence.com to download and experience.

References

  • Beidou Satellite Navigation System Public Service Performance Specification Version 1.0
  • Jointly announce the elevation of Mount Everest
  • "Striver" all-sea deep manned submersible successfully completed the 10,000-meter deep dive test
  • Beidou satellite navigation system public service performance specification
  • High-precision map (1) - geographic coordinate system
  • The Home of Location Technology Innovation and Collaboration
  • Correct posture for storing GIS data in MYSQL 8.0
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5170379/blog/5585656