GreenPlum Best Practices

Data Model

Greenplum Database is a shared nothing MPP analytical database. This model is highly normalized database SMP / transactional significant difference. Greenplum using non-normalized database schema design will work best, non-normalized analytical MPP mode is suitable for processing, for example, a star pattern with a large fact tables and dimension tables or smaller snowflake pattern.

Table for connecting columns use the same data type.

 

Additional heap storage vs. storage optimization

For batch or a single iteration will be received UPDATE, DELETE, and INSERT operations heap memory tables and partitions.

For concurrent reception will UPDATE, DELETE, and INSERT operations heap memory tables and partitions.

For rarely updated only after the initial loading and subsequent insertion of the partition table and a large batch operations, optimize the use of additional storage.

Not additionally performed in a single optimization table on the INSERT, UPDATE or DELETE operations.

Never in additional concurrent batch optimization of the table UPDATE, or DELETE operation. Batch can perform concurrent INSERT operations.

 

Line memory column store vs.

If the load so requires frequent updating and insertion of iterations transaction, the use of such a load line memory.

When used in a wide line memory table selection.

Generally or mixed load line memory usage.

Selecting a very narrow (a few columns) column using memory and the calculation data gathered on a small column.

If there is a single column in the table is periodically updated without modifying other columns in the row, the column is used to deposit such a table.

 

compression

Using a compression system to improve the range of I / O in large additional optimization and partition table.

Located in the data compression setting level setting column.

Make a balance between a high level of compression and decompression of data required to compress time and CPU cycles.

 

distributed

A random distribution or distribution column for all tables explicitly defined. Do not use the default value.

Use separate evenly distributed among all Segment.

Do not make distribution used in the WHERE clause of a query column.

Do not distributed on the date or time stamp.

Do not distributed in the same column and the partition table.

Distributed locally to significantly improve on the same column are connected in a large table is often connected.

Verification data is evenly distributed load after the initial data and the incremental load data.

Ensure that no data at all on tilt!

 

Memory Management

The vm.overcommit_memory set to 2.

Do not configure the OS to use large pages.

Use gp_vmem_protect_limit set the maximum memory instance can perform all the work assigned to each Segment database.

Gp_vmem_protect_limit calculated as the following settings:

gp_vmem - Greenplum database available total memory

 

Where SWAP swap space of the host (in GB), RAM of the host is a RAM (in GB)

max_acting_primary_segments - When mirror Segment Segment failure or because the host is activated, the greatest number of most primary Segment running on a host

gp_vmem_protect_limit

 

Converted into a set configuration parameter MB.

In a lot of work file is generated scene is calculated taking into account the working documents by the following formula gp_vmem factors:

 

The gp_vmem_protect_limit not set too high or large physical RAM on the system ratio.

Vm.overcommit_ratio operating system parameters calculated using the calculated set value gp_vmem:

 

Use statement_mem to allocate memory for each Segment in a database query.

Number (ACTIVE_STATEMENTS) resource queues using the query and set the amount of active memory that can be utilized to query the queue (MEMORY_LIMIT).

The queues are associated with a resource for all users. Do not use the default queue.

PRIORITY queue is provided for a load and to match the actual situation of the actual needs.

Memory allocation of resources to ensure that the queue does not exceed gp_vmem_protect_limit settings.

Dynamic Update resource queue settings to match the daily operation flow.

 

Partition

Only partition on a large table. Do small partition table.

Only achieve partition elimination (partition pruning) when using partition-based query.

The choice of partition and give up a list of partitions.

Based on the query predicate table partitions.

Do not distributed and partition table on the same column.

Do not use the default partition.

Do not use multi-level partition, create fewer partitions so that each partition have more data.

EXPLAIN query plan by checking verification queries to selectively scans the partition table (partition are eliminated).

Do not create too many partitions using a column store, because the total number of physical file on each Segment: Number = number of x number of columns Segment x number of partitions physical file

 

index

Usually no need to index the Greenplum database.

Create an index for the purpose of requests for access to drill with high selectivity to the high base sheet in single-column table.

Do not index columns are updated frequently.

Always load the data into the table before deleting an index. After loading, re-create the index for the table.

Create a B- tree index selective.

Do not create bitmap indexes on columns to be updated.

Do not unique column base is very high or very low data using bitmap indexes.

Do not use transactional load a bitmap index.

Usually do not index partition table. If you need an index, the column must be different partitioning column.

 

Resource queue

Resource use queues to manage the load on the cluster.

The associated resource queue all the roles associated with a user-defined.

The number of active members of queries can run concurrently use ACTIVE_STATEMENTS parameter to limit a particular queue.

MEMORY_LIMIT parameter control using the total amount of memory that can be utilized to run a query through the queue.

Do not put all queues are set to MEDIUM, because it does not actually load management.

Dynamically modify the resources to match the load queue and the status quo.

 

Monitoring and maintenance

Implement Greenplum Database Administrator's Guide "recommended monitoring and maintenance tasks."

Gpcheckperf run after installation and periodically run the tool, it is used to save the output compare performance changes over time.

Use all the tools at hand to understand the system performance under different loads.

Check for any unusual events to determine the causes.

By periodically running a query explain plan monitoring activities to ensure that the query is run in an optimal manner.

Inspection program to determine whether the index is used and whether the partition elimination occurs as expected.

Learn the location and contents of the system log file and monitor them on a regular basis, not only to check the log only when problems arise.

 

ANALYZE

Do not run ANALYZE on the entire database. If necessary, to selectively run ANALYZE on the table level.

After loading always run ANALYZE.

Significant change in the underlying data INSERT, UPDATE, and DELETE operations are always running after ANALYZE.

Always run ANALYZE after the CREATE INDEX operation.

If you are running on a very large table ANALYZE take too long, it can only be used to run ANALYZE on a column connection conditions, WHERE clause, SORT clause, GROUP BY clause or HAVING clause.

 

Cleaning

After running VACUUM large UPDATE and DELETE operations.

Do not run VACUUM FULL. But a CREATE TABLE ... AS running operation, and then rename and delete the original table.

VACUUM frequently run on the system directory to prevent directory expansion and operational needs VACUUM FULL on the directory.

Never kill VACUUM on a table of contents.

Do not run VACUUM ANALYZE.

 

load

Gpfdist loaded or unloaded using data Greenplum database.

With the increase in the number of Segment maximize parallelism.

Evenly spread on as many data ETL node.

Dividing very large data file into equal parts, and the data spread over as many file systems.

Each file system to run two instances gpfdist.

Gpfdist run on as many interfaces.

Use gp_external_max_segs to control the amount of each Segment gpfdist services.

Always keep the gp_external_max_segs gpfdist process and the number is an even factor.

Before being loaded into an existing table always delete the index and rebuild the index after loading.

Always run ANALYZE on the table after loading.

During loading disable automatic statistics collection to NONE by setting gp_autostats_mode.

VACUUM run after loading errors in order to regain space.

 

gptransfer

For the fastest transmission rate, transmitting data using gptransfer same size or larger to the target database.

Avoid using --full or --schema-only options. But the use of different methods to copy the pattern to the target database, then the data transmission table.

Remove the index table prior to transmission and rebuild them after the transfer is complete.

Use SQL COPY command smaller transfer to the target database table.

Use gptransfer bulk transfer large table.

Before performing the migration of production, test runs gptransfer. Experiments were performed --batch-size and --sub-batch-size option to obtain the maximum parallelism. Gptransfer iterative operation table to determine the appropriate batch.

Table use only the fully qualified name. Dot (.), Spaces, quotation marks ( ') in the table name and double quotation mark ( ") can cause problems.

If --validation option to verify data after transmission, the -x option also determines an exclusive lock is placed on the source table.

Be sure to create each role, functions and resources of the queue at the destination database. When gptransfer -t option, these objects will not be transmitted.

The postgres.conf and pg_hba.conf configuration files are copied from the source to the target cluster cluster.

Gppkg installed in the target database with the desired extension.

 

safety

Protection gpadmin user ID and only allow it necessary for the system administrator access.

When performing specific system maintenance tasks (such as upgrading or expansion), only the administrator should log in to Greenplum as gpadmin.

Restrict users have SUPERUSER role attribute. Become superuser role can bypass all checks and access privileges resource queue Greenplum database. Only the system administrator should be given superuser powers. See Greenplum Database Administrator's Guide "to modify the role of property."

Database users should never log in to gpadmin, or production and ETL load should never be run in gpadmin.

Assigned a different role for each user login.

For applications or Web services, consider creating a different role for each application or service.

Use groups to manage access privileges.

Protection root password.

A strongly enforce password policies for the operating system password.

Ensure that critical operating system files are protected.

 

encryption

Encrypting and decrypting data requires the expense of performance, encryption requires only encrypted data.

Before you implement any encryption scheme in a production system, to conduct performance tests.

Greenplum database production server certificate system should be issued by a certificate authority (CA), so that the client can authenticate the server. If the local client's clients are organizations, CA can be local.

As long as the client to connect Greenplum database will be through an insecure link, you should use SSL encryption.

A symmetric encryption scheme (encryption and decryption using the same key) scheme has better performance than asymmetric, symmetric encryption scheme should therefore be at the security key can be shared.

Pgcrypto package using the function to encrypt data on the disk. Data encryption and decryption process in the database, it is necessary to protect the connection with the SSL client to avoid transmitting unencrypted data.

When the ETL data is loaded or unloaded into the database from the database, it is encrypted with gpfdists protocol.

 

High Availability

Use hardware RAID storage solution with 8-24 disks.

Use RAID 1,5 or 6, so that the disk array can tolerate a disk failure.

A hot spare disposed in the disk array to allow automatic rebuild begins upon detection of a disk failure.

The entire array while preventing degradation and failure by mirrored RAID volume reconstruction.

Regularly monitor disk usage and adds extra space when needed.

Segment inclined monitoring to ensure that data is to be stored and distributed evenly over all average consumed Segment.

Master set up a reserve to take over after the main Master failure.

Planning When failure occurs, how the client switches to the new Master instance, e.g., by updating the Master DNS address.

Provided in the system monitoring mechanism for monitoring applications, or via email notification Master when the primary fails.

Set up a mirror for all of Segment.

Segment thereof and the main mirror placed on different host fails to prevent the host.

Provided in the system monitoring mechanism for monitoring applications, or via email notification when the primary fails Segment.

Segment gprecoverseg quickly using tools fail to restore redundancy and allow the system to return to an optimal balance.

Greenplum database configured SNMP notification is sent to a network monitor.

Set email notification $ MASTER_DATA_DIRECTORY / postgresql.conf configuration file, so Greenplum system can notify the administrator by email when it detects a serious problem.

Consider a two-cluster configuration to provide redundancy and additional query processing throughput on the extra level.

Unless the database can be easily restored from the source, regular backups Greenplum database.

If the heap table is relatively small and only a few additional column memory partitions are optimized or modified using incremental backup between backups.

If the backup is saved to a local clustered storage, the location on the backup after the completion of these files to a safe, not in clusters.

If the backup is saved to NFS mount points, for example, scale NFS embodiment Dell EMC Isilon IO or the like in order to avoid bottlenecks.

Consider using Greenplum integrated backup streaming to Dell EMC Data Domain or Veritas NetBackup enterprise backup platform.






--- end --- restore content

Data Model

Greenplum Database is a shared nothing MPP analytical database. This model is highly normalized database SMP / transactional significant difference. Greenplum using non-normalized database schema design will work best, non-normalized analytical MPP mode is suitable for processing, for example, a star pattern with a large fact tables and dimension tables or smaller snowflake pattern.

Table for connecting columns use the same data type.

 

Additional heap storage vs. storage optimization

For batch or a single iteration will be received UPDATE, DELETE, and INSERT operations heap memory tables and partitions.

For concurrent reception will UPDATE, DELETE, and INSERT operations heap memory tables and partitions.

For rarely updated only after the initial loading and subsequent insertion of the partition table and a large batch operations, optimize the use of additional storage.

Not additionally performed in a single optimization table on the INSERT, UPDATE or DELETE operations.

Never in additional concurrent batch optimization of the table UPDATE, or DELETE operation. Batch can perform concurrent INSERT operations.

 

Line memory column store vs.

If the load so requires frequent updating and insertion of iterations transaction, the use of such a load line memory.

When used in a wide line memory table selection.

Generally or mixed load line memory usage.

Selecting a very narrow (a few columns) column using memory and the calculation data gathered on a small column.

If there is a single column in the table is periodically updated without modifying other columns in the row, the column is used to deposit such a table.

 

compression

Using a compression system to improve the range of I / O in large additional optimization and partition table.

Located in the data compression setting level setting column.

Make a balance between a high level of compression and decompression of data required to compress time and CPU cycles.

 

distributed

A random distribution or distribution column for all tables explicitly defined. Do not use the default value.

Use separate evenly distributed among all Segment.

Do not make distribution used in the WHERE clause of a query column.

Do not distributed on the date or time stamp.

Do not distributed in the same column and the partition table.

Distributed locally to significantly improve on the same column are connected in a large table is often connected.

Verification data is evenly distributed load after the initial data and the incremental load data.

Ensure that no data at all on tilt!

 

Memory Management

The vm.overcommit_memory set to 2.

Do not configure the OS to use large pages.

Use gp_vmem_protect_limit set the maximum memory instance can perform all the work assigned to each Segment database.

Gp_vmem_protect_limit calculated as the following settings:

gp_vmem - Greenplum database available total memory

 

Where SWAP swap space of the host (in GB), RAM of the host is a RAM (in GB)

max_acting_primary_segments - When mirror Segment Segment failure or because the host is activated, the greatest number of most primary Segment running on a host

gp_vmem_protect_limit

 

Converted into a set configuration parameter MB.

In a lot of work file is generated scene is calculated taking into account the working documents by the following formula gp_vmem factors:

 

The gp_vmem_protect_limit not set too high or large physical RAM on the system ratio.

Vm.overcommit_ratio operating system parameters calculated using the calculated set value gp_vmem:

 

Use statement_mem to allocate memory for each Segment in a database query.

Number (ACTIVE_STATEMENTS) resource queues using the query and set the amount of active memory that can be utilized to query the queue (MEMORY_LIMIT).

The queues are associated with a resource for all users. Do not use the default queue.

PRIORITY queue is provided for a load and to match the actual situation of the actual needs.

Memory allocation of resources to ensure that the queue does not exceed gp_vmem_protect_limit settings.

Dynamic Update resource queue settings to match the daily operation flow.

 

Partition

Only partition on a large table. Do small partition table.

Only achieve partition elimination (partition pruning) when using partition-based query.

The choice of partition and give up a list of partitions.

Based on the query predicate table partitions.

Do not distributed and partition table on the same column.

Do not use the default partition.

Do not use multi-level partition, create fewer partitions so that each partition have more data.

EXPLAIN query plan by checking verification queries to selectively scans the partition table (partition are eliminated).

Do not create too many partitions using a column store, because the total number of physical file on each Segment: Number = number of x number of columns Segment x number of partitions physical file

 

index

Usually no need to index the Greenplum database.

Create an index for the purpose of requests for access to drill with high selectivity to the high base sheet in single-column table.

Do not index columns are updated frequently.

Always load the data into the table before deleting an index. After loading, re-create the index for the table.

Create a B- tree index selective.

Do not create bitmap indexes on columns to be updated.

Do not unique column base is very high or very low data using bitmap indexes.

Do not use transactional load a bitmap index.

Usually do not index partition table. If you need an index, the column must be different partitioning column.

 

Resource queue

Resource use queues to manage the load on the cluster.

The associated resource queue all the roles associated with a user-defined.

The number of active members of queries can run concurrently use ACTIVE_STATEMENTS parameter to limit a particular queue.

MEMORY_LIMIT parameter control using the total amount of memory that can be utilized to run a query through the queue.

Do not put all queues are set to MEDIUM, because it does not actually load management.

Dynamically modify the resources to match the load queue and the status quo.

 

Monitoring and maintenance

Implement Greenplum Database Administrator's Guide "recommended monitoring and maintenance tasks."

Gpcheckperf run after installation and periodically run the tool, it is used to save the output compare performance changes over time.

Use all the tools at hand to understand the system performance under different loads.

Check for any unusual events to determine the causes.

By periodically running a query explain plan monitoring activities to ensure that the query is run in an optimal manner.

Inspection program to determine whether the index is used and whether the partition elimination occurs as expected.

Learn the location and contents of the system log file and monitor them on a regular basis, not only to check the log only when problems arise.

 

ANALYZE

Do not run ANALYZE on the entire database. If necessary, to selectively run ANALYZE on the table level.

After loading always run ANALYZE.

Significant change in the underlying data INSERT, UPDATE, and DELETE operations are always running after ANALYZE.

Always run ANALYZE after the CREATE INDEX operation.

If you are running on a very large table ANALYZE take too long, it can only be used to run ANALYZE on a column connection conditions, WHERE clause, SORT clause, GROUP BY clause or HAVING clause.

 

Cleaning

After running VACUUM large UPDATE and DELETE operations.

Do not run VACUUM FULL. But a CREATE TABLE ... AS running operation, and then rename and delete the original table.

VACUUM frequently run on the system directory to prevent directory expansion and operational needs VACUUM FULL on the directory.

Never kill VACUUM on a table of contents.

Do not run VACUUM ANALYZE.

 

load

Gpfdist loaded or unloaded using data Greenplum database.

With the increase in the number of Segment maximize parallelism.

Evenly spread on as many data ETL node.

Dividing very large data file into equal parts, and the data spread over as many file systems.

Each file system to run two instances gpfdist.

Gpfdist run on as many interfaces.

Use gp_external_max_segs to control the amount of each Segment gpfdist services.

Always keep the gp_external_max_segs gpfdist process and the number is an even factor.

Before being loaded into an existing table always delete the index and rebuild the index after loading.

Always run ANALYZE on the table after loading.

During loading disable automatic statistics collection to NONE by setting gp_autostats_mode.

VACUUM run after loading errors in order to regain space.

 

gptransfer

For the fastest transmission rate, transmitting data using gptransfer same size or larger to the target database.

Avoid using --full or --schema-only options. But the use of different methods to copy the pattern to the target database, then the data transmission table.

Remove the index table prior to transmission and rebuild them after the transfer is complete.

Use SQL COPY command smaller transfer to the target database table.

Use gptransfer bulk transfer large table.

Before performing the migration of production, test runs gptransfer. Experiments were performed --batch-size and --sub-batch-size option to obtain the maximum parallelism. Gptransfer iterative operation table to determine the appropriate batch.

Table use only the fully qualified name. Dot (.), Spaces, quotation marks ( ') in the table name and double quotation mark ( ") can cause problems.

If --validation option to verify data after transmission, the -x option also determines an exclusive lock is placed on the source table.

Be sure to create each role, functions and resources of the queue at the destination database. When gptransfer -t option, these objects will not be transmitted.

The postgres.conf and pg_hba.conf configuration files are copied from the source to the target cluster cluster.

Gppkg installed in the target database with the desired extension.

 

safety

Protection gpadmin user ID and only allow it necessary for the system administrator access.

When performing specific system maintenance tasks (such as upgrading or expansion), only the administrator should log in to Greenplum as gpadmin.

Restrict users have SUPERUSER role attribute. Become superuser role can bypass all checks and access privileges resource queue Greenplum database. Only the system administrator should be given superuser powers. See Greenplum Database Administrator's Guide "to modify the role of property."

Database users should never log in to gpadmin, or production and ETL load should never be run in gpadmin.

Assigned a different role for each user login.

For applications or Web services, consider creating a different role for each application or service.

Use groups to manage access privileges.

Protection root password.

A strongly enforce password policies for the operating system password.

Ensure that critical operating system files are protected.

 

encryption

Encrypting and decrypting data requires the expense of performance, encryption requires only encrypted data.

Before you implement any encryption scheme in a production system, to conduct performance tests.

Greenplum database production server certificate system should be issued by a certificate authority (CA), so that the client can authenticate the server. If the local client's clients are organizations, CA can be local.

As long as the client to connect Greenplum database will be through an insecure link, you should use SSL encryption.

A symmetric encryption scheme (encryption and decryption using the same key) scheme has better performance than asymmetric, symmetric encryption scheme should therefore be at the security key can be shared.

Pgcrypto package using the function to encrypt data on the disk. Data encryption and decryption process in the database, it is necessary to protect the connection with the SSL client to avoid transmitting unencrypted data.

When the ETL data is loaded or unloaded into the database from the database, it is encrypted with gpfdists protocol.

 

High Availability

Use hardware RAID storage solution with 8-24 disks.

Use RAID 1,5 or 6, so that the disk array can tolerate a disk failure.

A hot spare disposed in the disk array to allow automatic rebuild begins upon detection of a disk failure.

The entire array while preventing degradation and failure by mirrored RAID volume reconstruction.

Regularly monitor disk usage and adds extra space when needed.

Segment inclined monitoring to ensure that data is to be stored and distributed evenly over all average consumed Segment.

Master set up a reserve to take over after the main Master failure.

Planning When failure occurs, how the client switches to the new Master instance, e.g., by updating the Master DNS address.

Provided in the system monitoring mechanism for monitoring applications, or via email notification Master when the primary fails.

Set up a mirror for all of Segment.

Segment thereof and the main mirror placed on different host fails to prevent the host.

Provided in the system monitoring mechanism for monitoring applications, or via email notification when the primary fails Segment.

Segment gprecoverseg quickly using tools fail to restore redundancy and allow the system to return to an optimal balance.

Greenplum database configured SNMP notification is sent to a network monitor.

Set email notification $ MASTER_DATA_DIRECTORY / postgresql.conf configuration file, so Greenplum system can notify the administrator by email when it detects a serious problem.

Consider a two-cluster configuration to provide redundancy and additional query processing throughput on the extra level.

Unless the database can be easily restored from the source, regular backups Greenplum database.

If the heap table is relatively small and only a few additional column memory partitions are optimized or modified using incremental backup between backups.

If the backup is saved to a local clustered storage, the location on the backup after the completion of these files to a safe, not in clusters.

If the backup is saved to NFS mount points, for example, scale NFS embodiment Dell EMC Isilon IO or the like in order to avoid bottlenecks.

Consider using Greenplum integrated backup streaming to Dell EMC Data Domain or Veritas NetBackup enterprise backup platform.





Guess you like

Origin www.cnblogs.com/Don/p/11432600.html