gp performance management

1. The performance of the gp database is determined by the slowest segment in a set of segment services 2. The

gp database does not support triggers temporarily

3. Greenplum Database can run well on traditional UNIX file systems, such as BSD/UFS/FFS files On the system, many operating systems support it. On the linux operating system, XFS is recommended, and on the solaris operating system, ZFS is recommended.

4. gp database overview:
(1) Flexible scalability: expand capacity and performance online; (2) quickly receive large amounts of data from different sources; (3) provide high-performance parallel processing capabilities for in-database analysis; (4) Support high user concurrency; (5) Extremely high performance: optimized for fast query execution and unparalleled data loading speed; (6) Reduced total cost of ownership: Consolidate data marts to reduce costs; (7) Highly available: Self-healing and fully redundant; (8) Advanced backup and disaster recovery: Leverage industry-leading Data Domain backup and recovery; (9) Rapid deployment: Purpose-built data warehouse appliances.

5. Hardware considerations:
(1) Segment servers have the same hardware configuration; recommended: dual-core, 32GB Mem, high-speed disk array, more than 4 Gigabit Ethernet ports.
(2) The Master server has high cpu and memory resources;
(3) Baseline performance: 3.2GB/s (comprehensive system disk read and write speed)

6. The following modules are commonly used:
dblink: similar to oracle dblink, but the function is weaker, and it can only connect to Postgresql or postgresql-based database
oid2name: Get the oid of the database object or get the database object information according to the oid, which is an independent executable command
pg_buffercache: Query the cache information of shared_buffer in real time
pg_freespacemap: Display FSM content
pgrowlocks: Display row lock information
pgstattuple: Count the "dead rows" and free space in the specified table

7. Set the Master parameter
The Master parameter can only be set in the master node of the GP.
If the same parameters are set in multiple levels, the minimum registration shall prevail. The session covers the role, the role covers the database, and the database covers the system
  
[system-level parameters]. The
steps are as follows:
(1) Edit the $MASTER_DATA_DIRECTORY/postgresql.conf file
(2) ) Find the parameter you want to modify, remove the comment (delete the # in front), and set the value you want
(3) Save and close the file
(4) For session parameters, you do not need to restart the service, execute as follows:
   $ gpstop -u
(5) For parameter modification that needs to restart the service, execute as follows:
   $ gpstop -r
  
[database-level parameter]
When the session parameter is set at the database level, this parameter must be used when each session connects to the database.
The database parameter set will override the system-level parameter.
To set database-level parameters, use the ALTER DATABASE command
such as :
=# ALTER DATABASE mydatabase SET search_path TO myschema;
  
[role-level parameter]
When the session parameter is set at the role level, this parameter will be used when each session is initialized through the role.
The set role parameter overrides the database-level parameter.
Such as:
=# ALTER ROLE bob SET search_path TO bobschema;
  
[session-level parameters]
Any session parameter can be set in an active session, using the SET command.
The set session parameters will override the role-level parameters.
Such as:
=# SET work_mem TO '200MB';
=# RESET work_mem;

8. Start the master into maintenance mode
Maintenance mode (maintenance mode) refers to only start the master.
Usage => In the case of not affecting segment user data, only connect to the master in utility mode, and edit the settings in the system catalog.

Step => (1) Enter maintenance mode: $ gpstart -m; (2) Connect to master and do cata_log maintenance, for example: $ PGOPTIONS='-c gp_session_role=utility' psql template1; (3) After completing management tasks, in Before entering production mode, you must stop utility mode
9.
Restrictions on UPDATE in update data GP:
1. Distribution key cannot be updated
2. When mirror is started, STABLE or VOLATILE cannot be used in update statement
3. RETURNING
UPDATE SQL example:
UPDATE products SET price = 10 WHERE price = 5; --Update single or multiple rows of data

10.
Restrictions on using DETELE in delete data GP:
1. When mirror is started, STABLE or VOLATILE cannot be used in the delete statement
2. RETURNING is not supported 3.
Truncate does not scan the table .
DETELE SQL example:
DELETE FROM products WHERE price = 10; --Delete DELETE FROM products according to where conditions; --Delete
all data in the table
TRUNCATE mytable; --Empty table 11.

[Transaction]
BEGIN or START TRANSACTION --Open a transactionEND
or COMMIT -- end transaction
ROLLBACK -- rollback
SAVEPOINT -- segment commit or rollback transaction

12. [Index]
In an OLTP environment, indexes are heavily used in pursuit of the fastest response time. Usually a single hit or a small number of datasets.
However, GP is generally used for OLAP, and it is basically a full table scan, so indexes should be used as little as possible.
GP recommends checking the consumption of your query without adding any indexes. (Note: For a table with a primary key, the system will automatically create a primary key index.)

12.1 Considerations for index building:
(1) Query load: BI generally accesses large data sets, so the upper index cannot be used. For OLAP-type databases, sequential reads of large batches of data perform better than random scans with indexes.
(2) Compressed tables: indexes can improve the performance of compressing append-only tables
(3) Do not build indexes on fields that are updated frequently
(4) Selectively create B-tree indexes: index selectivity = distinct number of columns in the same column / The total number of rows, for example: a table has 1000 rows, a column has 800 unique values, and the selectivity of the index is 0.8. The selectivity of the unique index is 1.0, which is the best case
(5) Use Bitmap indexes on low selectivity columns: GP adds Bitmap indexes, which are not available in postgresql
(6) Index columns are usually used for joins: often For fields used for join, building an index can improve the performance of join. (For example: foreign key)
(7) Index columns are frequently used in where statements
(8) Avoid duplicate columns to create indexes
(9) Delete indexes when importing data in batches: When importing a large amount of data, delete the index first, and wait for the data to be imported. Rebuild later. This will be faster.
(10) Consider a cluster index: clustering an index means that the data is physically sorted on the hard disk. Since the data is physically closer together, the reads are more ordered.
  
12.2 Index types:
(1) B-tree; (2) GiST -- for GIS; (3) Bitmap (there is no such index type in psotgres)
Note: Hash and GIN indexes in gp are disabled.
  
12.3 Create index
CREATE INDEX title_idx ON films (title); --Create B-tree index by default
CREATE INDEX gender_bmp_idx ON employee USING bitmap (gender);--Create bitmap index
  
12.4 Check index usage
Although indexes are not maintained and optimized in gp, But it is still important to check the load when the index is used.
Check the hints that appear in EXPLAIN:
(1) Index Scan
(2) Bitmap Heap Scan
(3) Bitmap Index Scan
(4) BitmapAnd or BitmapOr
  
12.5 Manage indexes
In some cases, a low-performance index may need to be REBUILD. update and delete do not update the bitmap index, so if you want to update or delete a table with a bitmap index, you will need to rebuild.
REINDEX my_table; -- rebuild all indexes on my_table
REINDEX my_index; -- rebuild my_index index
DROP INDEX title_idx; -- delete index title_idx

13.GP supports partitioned tables, which are mainly used to store large tables, such as fact table
Purpose: (1) Slice big data for easy query; (2) Facilitate database maintenance; (3) Partition When created, each partition will come with a Check constraint to limit the range of data. Check constraints are also used to locate partitions when executing queries.
Supported partition types: (1) range partition; (2) list partition; (3) combined partition
(1) the difference between partition and distribution
distribution -- physically split table data, can execute queries in parallel
partition -- Logically split large table data to improve query performance and facilitate data warehouse maintenance
(2) View table partitions
pg_partitions -- view creation partition information
pg_partition_templates -- view subpartitions created with subpartition templates
pg_partition_column -- view partition fields

14. Select Column data type
(1) Character type, CHAR, VARCHAR, TEXT in gp have no difference in performance. But char has performance advantages in other database systems. In most cases, CHAR can be used instead of TEXT or VARCHAR.
(2) Number types, preferably the smallest data type. Replace BIGINT with INT or SMALLINT.
(3) When you need to join across tables, you need to ensure that the data types are consistent. Otherwise, gp will do data type conversion, resulting in performance consumption.
(4) GP also contains some collection data types.

15. Constraints: Compatible with postgresql, including: check, not null, unique, primary key, foreign key is not supported
  
16. Select distribution strategy: DISTRIBUTED BY (hash distribution) and DISTRIBUTED RANDOMLY (random distribution Round-Robin)
Consider conditions ( Order of importance)
(1) Even Data Distribution: In order to get the best performance, the data volume of all segments should be equal. If there is an imbalance, the load of the segment with a large amount of data will be very large when querying.
(2) Local and Distributed Operations: For join, sort or aggregation operations, segment-level (within segments) is faster than system-level (between segments).
(3) Even Query Processing: Each segment gets an equal query request load.
  
17. Table storage method
(1) Heap or Append-Only storage: GP uses heap table by default. Heap tables are best used for small tables, such as dimension tables (updated frequently after initialization). Append-Only tables cannot be updated and deleted. Generally used for batch data import. Single row inserts are not recommended.
CREATE TABLE bar (a int, b text)
        WITH (appendonly=true)
        DISTRIBUTED BY (a);
(2) Row or Column-Oriented storage: row storage, column storage, mixed storage
a. Data needs to be updated
row storage => after table data is imported, if it needs to be updated
column storage => only suitable for append -only table.
b. Frequently insert data
row storage => if frequently insert data
column storage => not optimized for write operations (column values ​​of the same row must be written to different locations on disk)
c. Multi-column query request
row storage => in select or In the where clause, query all or most of the columns
Column storage => In the where or having clause, query the value of a single column or filter a single row
SELECT AVG(salary)... WHERE salary > 10000
SELECT salary, dept .. . WHERE state='CA'
d. Many columns in the table
row storage => request many columns at the same time or the size of row data is relatively small
column storage => use wide table, query only a few columns
e. compress
row storage => Not available
Column store => Available
such as: (Note: The use of column store must be an append-only table)
CREATE TABLE bar (a int, b text)
        WITH (appendonly=true, orientation=column)
        DISTRIBUTED BY (a);
(3) Use compression (Append-Only tables are only applicable): You can use the database's built-in compression (zlib or QuickLZ). If a compressed file system is used, creating append-only tables will not be able to use compression.
When choosing the compression type and level of append-only tables, the following factors need to be considered: a. CPU usage; b. compression ratio/disk size; c. compression rate; d. decompression rate/scan rate.
Although we use compression to reduce the data size, we must consider the time and cpu consumption of data compression and decompression.
Compression performance depends on hardware, query tuning settings, and other factors.
    QuickLZ - low compression ratio, low cpu consumption, compressed data blocks
    zlib - high compression ratio, low speed
CREATE TABLE foo (a int, b text)
        WITH (appendonly=true, compresstype=zlib,
        compresslevel=5);
(Note: QuickLZ's The compression level is only level1, and zlib can be set from 1-9)

18.Schema is the logical organization of object and data in the Database.
In the same Database, objects of different schemas can use the same name. For example, the table in schema A is called tab1, and the table in schema B can also be called tab1. But it will report an error in the same schema.
SELECT * FROM myschema.mytable;
Note: If the schema name is specified in sql, query the specified schema, otherwise query the configuration parameters in the search path.
(1) User-level schema
public => GP default installation, default schema
(2) System-level schema
pg_catalog => Contains system data dictionary tables, built-in data types, functions and operators.
information_schema => Contains a collection of standard views, all views are information proposed from the system data dictionary table.
pg_toast => store large objects (used inside GP)
pg_bitmapindex => store bitmap index objects (used inside GP)
pg_aoseg => store append-only tables (used inside GP)
pg_toolkit => manage schema, query system log files and other system metrics unit. Contains some external tables, views and functions.

19.Database
GP can create one or more databases, but the client program can only access one database at a time (can not query across databases). If you do not specify a template when building a library, template1 is used by default (template1 is generated when GP is initialized).
In addition to this template, there are 2 other templates: template0 and Postgres (for internal use by the system and should not be deleted or modified). template0 can create a complete and clean database with only some standard objects predefined by GP.
(1) 3 ways to build a database
=> CREATE DATABASE new_dbname; (client connection, must be superuser or have database building authority)
$ createdb -h masterhost -p 5432 mydatabase (command line database building)
=> CREATE DATABASE new_dbname TEMPLATE old_dbname; (clone database)
(2) 2 ways to delete the database
(Note: The deleted database cannot be rolled back, use it with caution!)
=> \c template1;  
=> DROP DATABASE mydatabase; (client connection, connect to template1 first, then drop)
$ dropdb -h masterhost - p 5432 mydatabase (command line)
(3) management
SELECT datname FROM pg_datbase; (list database list)
=> ALTER DATABASE mydatabase SET search_path TO myschema, public, pg_catalog; (modify database parameters)

20.Tablespace & Filespace
Tablespaces allow each Machines use a variety of file systems and decide how best to use physical storage.
Tablespaces are used for a variety of reasons: a. To choose different storage types for the frequency of data usage; b. To control the I/O performance of certain database objects.
  
filespace refers to the collection of filesystem locations for all components. A filespace can be used by one or more tablespaces. The filespace used by the following two tablespaces is pg_system (specified when GP is initialized):
pg_global --(store system metadata)

gp_default --(default tablespace of template0 and template1 libraries)
21. [limit concurrency] postgresql. conf file parameters
(1) max_connection -- If the maximum number of connections is 
to be changed, both the master and the segment must be modified. The value of segment must be 5-10 times that of master.
(2) Parameters related to max_connection, max_prepared_transactions -- the maximum number of prepared transactions
The max_prepared_transactions of the master must be set to be greater than or equal to max_connection, and the max_prepared_transactions of the segment should be set to the same as the master.
For example:
in $MASTER_DATA_DIRECTORY/postgresql.conf (with standby master):
max_connection = 100
max_prepared_transactions = 100
in $SEGMENT_DATA_DIRECTORY/postgresql.conf:
max_connection = 500
max_prepared_transactions = 100 The
modification steps are as follows:
a. Stop the library $ gpstop
b. Modify the master parameter
c. Modify each segment parameter
d. Restart $ gpstart
  
22. [Encrypted C/S connection] GP supports SSL connection.
The conditions are as follows:
1. The client and master server need to install openSSL
2. Set the postgresql.conf parameter in the master: ssl=on. When the ssl mode is turned on, it will search for two files contained in the data directory of the master: server.key (server private key) and server.crt (server certificate)

23.GP manages data access control through roles. Roles contains 2 concepts: Users and Groups.
A role can be a database user or group, or both. Role can own database objects (for example: tables), and can open access to database objects to other roles. A role can also be a member of another role, and child roles can inherit the permissions of the parent role.
  
(1) Each GP database system can be said to be a collection of database roles (users and groups). Generally, the current user of the operating system is used to log in by default when logging in. Roles are defined at the system level, meaning they can access all databases in the system.
  
(2) [gpadmin user]: When initializing the database, gpadmin is generally used (both the operating system user and the super user of the database), which has the highest authority of the database. The authority of gpadmin is too large. Be sure to control the authority and use it only for one database maintenance work (such as database upgrade or expansion).

(3) The difference between Role and User: ROLE + LOGIN permission = USER

(4) [Create Groups]
Role is also a Group, use GRANT and REVOKE to add and delete roles.
For example:
=# CREATE ROLE admin CREATEROLE CREATEDB;
=# GRANT admin TO john, sally;
=# REVOKE admin FROM bob;
  
can grant permissions for individual objects:
=# GRANT ALL ON TABLE mytable TO admin;
=# GRANT ALL ON SCHEMA myschema TO admin;
=# GRANT ALL ON DATABASE mydb TO admin;

24. [Query Plan]: When viewing Query Plan, it is bottom-up.
Motion -- mainly involves the completion of tasks between multiple nodes, and data movement will occur between nodes.
Slice --GP slices tasks for optimal performance when executing queries. A query plan is divided according to motion, and each slice is bounded by motion.
Redistribute Motion --Move data between segments. This is very performance-intensive. This operation should be avoided in the early table design.
Gather Motion --Segment transmits data to Master. Not all query plans call gather motion. For example: CREATE TABLE AS SELECT ... There is no gather motion for this, and the results are stored in the new table without going through the master.

25. [Parallel query]
Query Dispather (QD) --Master node query task processing, responsible for creating and allocating query plans, collecting and transmitting final results
Query Executor (QE) --Segment node query processing, responsible for completing data calculation, multiple Communication of intermediate results between QEs. Each slice planned for a query is assigned at least one worker process
gangs -- related processes that work on the query plan for the same slice. When a slice task is completed, the gang will pass the data to the upper-level gang. Internal communication is controlled through internal processes.

26.GP test loading efficiency reaches 2TB per hour.

27. [Charging method]
1. Charge by capacity, starting from 1T, the more you buy, the cheaper the price
2. Buy an all-in-one Data Computing Appliance (DCA), EMC hardware and GP are sold together

28. About interconnect redundancy
In order to ensure high network availability, two gigabit switches need to be deployed. If it is an offline test environment, you only need to deploy one gigabit switch.
The role of the Interconnect network connection layer: responsible for the process communication of each segment node, using a standard gigabit switch.
Data transmission uses the UDP protocol by default. When using UPD, the GP will do additional packet checksums and checks that are not performed. Therefore, in terms of reliability, it is basically equivalent to TCP, but in terms of performance and scalability, it is better than TCP.
When using TCP, GP has a limit of 1000 segments, while UDP does not.

29. Greenplum Database is developed in postgreSQL, based on MPP (massively parallel processing) and shared-Nothing architecture (Oracle RAC is a shared everything architecture).
It is mainly used in data warehouses to do large-scale data and complex query functions.
Compared with the existing data warehouse solutions (Oracle, IBM, Microsoft, Sybase and Teradata), it has its own advantages: 1. Faster 2. Supports larger data volume and better
scalability : 1. High requirements for LAN bandwidth, generally Gigabit switches. 2. Online expansion is not supported. If expansion is required, at least 2 more machines must be added. If it is not expanded by a factor of 2, all data needs to be redistributed evenly.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326593848&siteId=291194637