SuperMap GIS Big Data Analysis and Tuning Action Guide

SuperMap GIS Big Data Analysis and Tuning Action Guide

1. The purpose of big data analysis and tuning

    The amount of project data reaches tens of millions. If it is necessary to analyze tens of millions of data together with tens of millions of data, many customers will consider choosing the big data module of SuperMap GIS basic software (hereinafter referred to as GIS software). The environment configuration of the big data module is relatively complicated compared with other products. Environments such as Spark need to be set up through the command line in Linux, which may have certain thresholds for project implementation.
    Based on years of experience in big data analysis projects, a set of tuning reference methods is summarized for reference.

Two, tuning ideas

    There are three main ideas for big data analysis and tuning: Spark cluster tuning, data tuning, and GIS software tuning, as shown in the following figure:
insert image description here

2.1 Spark environment tuning

    Spark parameter optimization is mainly concentrated in the spark-defaults.conf file. This article introduces the tuning parameters from the following four aspects:

  • Memory tuning: modify in the spark-defaults.conf file

  • Parallelism tuning: modify in the spark-defaults.conf file

  • Disk IO tuning: modify in the spark-defaults.conf file

  • Network IO tuning: modify in the spark-defaults.conf file

    The specific parameter introduction and recommended values ​​are shown in the following table. After you modify the parameters, please restart Spark to ensure that the parameter modification takes effect:

Tuning direction configuration item Configuration item description Defaults Recommended value
memory tuning spark.driver.memory Used to set the memory size of the Driver process 1GB 2GB or 4GB
spark.executor.memory Used to set the memory size of the Executor process 1GB 2GB or 4GB
spark.memory.fraction Used to set the heap memory ratio of the Spark memory manager 0.6 0.5 or 0.7
spark.memory.storageFraction Used to set the storage memory ratio of the Spark memory manager 0.5 0.4 or 0.6
spark.executor.memoryOverhead It is used to set the memory size that the Executor needs to reserve in addition to the memory required to execute the task max(384m, 10% of spark.executor.memory) The additional memory size of Executor can be set to a smaller value, such as 256M
Parallelism Tuning spark.default.parallelism Used to set the default parallelism of RDD The default value is the number of CPU cores in the current cluster 2~4 times the number of CPU cores
spark.sql.shuffle.partitions Used to set the parallelism of the Shuffle operation 200 1000 or higher
spark.cores.max Set the maximum number of CPU cores each application can use Current number of CPU cores in the cluster If your application requires a lot of CPU resources, you can set this parameter to be close to or equal to the total number of available CPU cores, otherwise it is recommended to set it to about 70%~80% of the number of CPU cores
Disk IO Tuning spark.local.dir Used to set the local disk temporary directory storage path /tmp Set to a local directory path with sufficient disk space to ensure that the Spark application has enough space to store intermediate results and other cached data
spark.shuffle.file.buffer Used to set the buffer size for the Shuffle operation 32k 64k or 128k
spark.reducer.maxSizeInFlight Used to set the maximum output size of the Reducer 48m 128m or 256m
Network IO tuning spark.rpc.message.maxSize Used to set the maximum size of an RPC message 128MB If your application needs to handle large-scale data, you can increase the value of this parameter to 512MB or higher

Description of official configuration items (version 3.3.0)
http://spark.incubator.apache.org/docs/3.3.0/configuration.html

2.2 Data tuning

2.2.1 How to choose a suitable storage database

    The types of databases suitable for big data storage and analysis vary with different business scenarios. The following is a brief description of the types, introductions, advantages and disadvantages of databases suitable for big data storage and analysis in SuperMap GIS products:

database type Database introduction Advantages and disadvantages
Hadoop HDFS Hadoop HDFS is the abbreviation of Hadoop Distributed File System. It is a distributed file system. The core design idea is to divide large files into blocks and store these blocks on different nodes to achieve high reliability and high throughput data access, and support data storage and processing on large-scale clusters. Advantages : strong scalability, support PB-level data storage;
Disadvantages : not suitable for storing structured data, relatively low read and write performance
PostGIS PostGIS is an open source spatial geographic information system extension based on PostgreSQL, suitable for storing and processing spatial data. It supports complex spatial query and analysis with high reliability and performance Advantages : It is suitable for storing spatial data, has the ability to process spatial data, provides many functions and operators for spatial data operations, and can query and analyze spatial data; Disadvantage: Not suitable for storing non-
spatial data
other Dameng, Hangao, Yugong, Renminda Jincang, Oracle Advantages : It supports a wide variety of databases, which can meet the reading requirements of various database connections;
Disadvantages : Most databases cannot be read in JDBC mode, and the performance of distributed read and write is not as good as that of libraries that support JDBC

    To sum up, according to specific application scenarios and requirements, different databases can be selected. The following are recommended libraries for each scenario:

Application Scenario Recommended database
The amount of data is tens of millions, supporting the storage of geographically partitioned feature datasets and writing them into the DSF directory Hadoop HDFS
need to store and process spatial data PostGIS
need to be stored in a relational database Dameng, Hangao, Yugong, Renda Jincang, Oracle, etc.

2.2.2 Database Tuning

HDFS

HDFS tuning, mainly modify the Hadoop/etc/hadoop/hdfs-site.xml file:

  • Increase the data block size: You can improve read performance by increasing the HDFS data block size. In the hdfs-site.xml configuration file, you can set the dfs.blocksize parameter, which is 128MB by default.
  • Increase the number of copies: Read performance and data reliability can be improved by increasing the number of copies of HDFS data blocks. In the hdfs-site.xml configuration file, you can set the dfs.replication parameter, which is 3 by default.
  • Enable compression: You can reduce disk space usage and network transmission overhead by enabling data compression. In the hdfs-site.xml configuration file, you can set the io.compression.codecs parameter to select the desired compression algorithm, such as org.apache.hadoop.io.compress.GzipCodec.

PostGIS

PostGIS tuning, mainly modify PostgreSQL/data/postgresql.conf:

  • Increase cache: Read performance can be improved by increasing the cache size. The cache size can be set through the shared_buffers parameter in the postgresql.conf configuration file. The default is 128MB, and it is generally recommended to set it to 1/4 to 1/3 of the system memory.
  • Adjust query plan: You can improve query performance by adjusting query plan. You can adjust the query plan through parameters such as effective_cache_size (usually recommended to be set to half of the system memory) and work_mem (recommended to be set between 1/16 and 1/8 of shared_buffers) in the postgresql.conf configuration file.
  • Enable field indexing: You can improve query performance by enabling indexing. Indexes can be created through tools such as the CREATE INDEX command or pgAdmin, such as:
CREATE INDEX idx_name ON table_name (column_name) WHERE column_name > 0;
  • Spatial Index: You can improve query performance by tuning the PostGIS index. Indexes can be created through tools such as the CREATE INDEX command or pgAdmin, such as:
CREATE INDEX mytable_gist_idx ON mytable USING gist (geom); 
  • 使用空间聚合:如ST_Union、ST_Collect和ST_ConvexHull等,可以将多个几何对象合并成一个对象。使用空间聚合可以减少查询的数据量,从而提高查询性能。

2.2.3 数据优化

    除数据库配置优化外,对于具体的数据和数据量,本文主要推荐2种做法:

  • 保证精度前提下,减少参与计算和分析的数据量

          ○常用办法:抽稀或者用数据库函数等操作去简化数据

  • 数据量不可变情况下,为数据创建索引

          ○常用办法:参考2.2.2数据库调优及2.3SuperMap GIS产品调优部分

2.3 SuperMap GIS产品调优

2.3.1 SuperMap GIS引擎调优

    GIS软件中,针对不同数据库使用的连接引擎有所不同,不同连接引擎支持数据库不同、优势存在差异,具体情况如下:

JDBC

  • 引擎对应的连接数据库有:PostGIS、OracleSpatial
  • 引擎介绍:Java数据库连接的缩写,是Java语言中用于连接和操作关系型数据库的API。JDBC提供了一个标准的接口,使得Java程序可以连接到不同的关系型数据库,并执行SQL查询和更新操作。JDBC可以在Java应用程序和数据库之间建立一条连接,并提供了一系列的接口和方法,用于执行SQL语句、处理结果集、管理事务等操作
  • 优点:可在集群模式下可以实现分布式读写,极大提升性能
  • 缺点:JDBC方式目前只支持读取PostGIS和OracleSpatial中的空间数据
  • 调优方式:参考2.2.2数据库调优部分

DSF

  • 引擎对应的连接数据库有:Hadoop HDFS
  • 引擎介绍:分布式空间文件引擎(Distributed Spatial File,以下简称DSF),能够管理矢量、栅格和影像数据,兼备大体量空间数据的高性能分布式存取、空间查询、统计、分布式分析、空间可视化及数据管理能力
  • 优点:DSF更加适合全量空间数据的高性能分布式计算,能够进一步强化海量经典空间数据的分布式存储和管理能力,显著提高大数据量的计算性能
  • 缺点:数据需要提前建立地理分区索引,比直接从矢量数据读取分析会多一步操作。
  • 调优方式:构建合适的索引
              格网索引:数据分布比较均匀时使用该索引,需要设置一个合适的行列号,行列数目的设置依据每个格网内对象数据,形成的格网中对象数目控制在一定程度,比如面对象10w以内,点50w以内
              四叉树索引:数据分布不均匀,呈现明显的聚集特点时使用该索引。地理分区要素数据集中每个分区对象数目最大值处理原则一般是,将数据集以 DSF
    方式存储,每个 DSF 文件大小不超过 Hadoop 块文件大小(默认情形下,Hadoop 块大小是
    256M)。例如点类型的要素数据集为50w对象,国土地类图斑数据为5w对象等

SDX

  • 引擎对应的连接数据库有:达梦、瀚高、禹贡、人大金仓、Oracle 、PostGIS等
  • 引擎介绍:SDX是SuperMap的空间引擎技术,它提供了一种通用的访问机制(或模式)来访问存储在不同引擎里的数据。这些引擎类型有数据库引擎、文件引擎和Web 引擎
  • 优点:SDX面支持目前的主流商用关系数据库平台,提供了R树索引、四叉树索引、动态索引(多级格网索引)和图幅索引(三级索引),充分发挥每一种索引的优势,提高了数据访问和查询效率
  • 缺点:由于SDX读写都是整库取到内存中,分布式读写效率不高
  • 调优方式:通过桌面端提前对数据创建合适的空间索引与字段索引

2.3.2 JDBC、DSF、SDX该如何选择

使用场景 推荐的引擎
千万级别以上的数据 DSF
常规使用 JDBC
对于数据存储到不支持JDBC和DSF的库 SDX

三、可能遇到的问题及解决办法

问题一:Spark运行过程中报错java.Lang.UnsatisfiedinkError:supermap.data.EnvironmentNative.jni GetBasePath()Ljava/lang/String;

insert image description here

Solution: This is because the JAVA component cannot be read. It is necessary to check whether the JAVA component has been installed, whether the configuration in the system is correct after installation, and whether there is a lack of dependencies. After the component configuration is normal, resubmit the task.

4. A more convenient way to analyze big data

    When using SuperMap GIS products, you can use big data analysis in many ways. Here, it is recommended that you use the GPA module in SuperMap GIS (available in both SuperMap iDesktopX and SuperMap iServer)

  • GPA provides 100+ operators and supports the expansion of custom tools. Developers can easily and quickly build their own business processes through visual modeling without writing complex codes, and realize the automation of spatial data processing and analysis processes;
  • For Spark tuning parameters, it supports direct modification on the UI interface, without repeatedly modifying the Spark configuration file;
  • Existing business models support export and import, and can be shared and reused conveniently and quickly, greatly reducing the threshold for using big data analysis.

insert image description here

(SuperMap iServer handles automation services)
    If you have any questions during use, please call the technical support hotline: 400-8900-866.

Guess you like

Origin blog.csdn.net/supermapsupport/article/details/131696348