Greenplum data warehouse (2): greenplum data warehouse and management statements

Basic Concepts of Greenplum

https://gp-docs-cn.github.io/docs/admin_guide/intro/arch_overview.html

Greenplum Database is a massively parallel processing (MPP) database whose architecture is specifically designed for managing large-scale analytical data warehouses and business intelligence workloads. MPP (also known as shared nothing architecture) refers to a system with multiple processors cooperating to perform an operation, and each processor has its own memory, operating system, and disk. Greenplum uses this high-performance system architecture to distribute the load of a multi-terabyte data warehouse and can use all the resources of the system to process a query in parallel.

Greenplum Database is a disk-oriented database instance developed based on PostgreSQL 8.3.23, forming a tightly integrated database management system (DBMS). Its SQL support, features, configuration options, and end-user functions are very similar to PostgreSQL in most cases. Database users who interact with Greenplum Database will feel that they are using a regular PostgreSQL DBMS.

Greenplum Database can use append-optimized (AO) storage formats to load and read data in batches, and can provide performance advantages on HEAP tables. Additional optimized storage provides checksums for data protection, compression, and row/column direction. Tables optimized for row or column addition can be compressed.

The main differences between Greenplum Database and PostgreSQL are:

  • In order to support the parallel structure of Greenplum Database, the internals of PostgreSQL have been modified or added. For example, the system catalog, optimizer, query executor, and transaction manager components have been modified or enhanced to be able to execute queries simultaneously on all parallel PostgreSQL database instances. Greenplum's Interconnect (network layer) allows communication between different PostgreSQL instances, allowing the system to behave as a logical database.
  • Greenplum Database can choose column-based storage. The data is logically organized as a table, but the rows and columns are physically stored in a column-oriented format, rather than as rows. Columnar storage can only be used with append optimized tables. Columnar storage is compressible. When users only need to return columns of interest, columnar storage can provide better performance. All compression algorithms can be used in row or column storage tables, but run length encoding (RLE) compression can only be used in column storage tables. Greenplum Database provides compression on all append-optimized tables that use columnar storage.

Greenplum database queries use a volcanic query engine model, where the execution engine takes an execution plan and uses it to generate a physical operator tree, then calculates the table through the physical operators, and finally returns the result as a query response.

Store and process large amounts of data by distributing data and processing loads on multiple servers or hosts. Greenplum Database is an array of databases based on PostgreSQL 8.3. The databases in the array work together to present a single database. Master is the entrance to the Greenplum Database system. The client will connect to this database instance and submit SQL statements. The Master will coordinate and work with other database instances called Segments in the system. Segments are responsible for storing and processing data.

 Greenplum的Master

The Greenplum Database Master is the entrance to the entire Greenplum Database system. It accepts connections and SQL queries and distributes work to Segment instances. When interacting with Greenplum Database (through the Master), they will feel that they are interacting with a typical PostgreSQL database. You can use a client such as psql or an application programming interface (API) such as JDBC, ODBC, libpq (PostgreSQL C language API) to connect to the database.

The Master is the location of the global system catalog. The global system catalog contains system tables related to the metadata of the Greenplum Database system itself. The Master does not contain any user data, and the data only exists on the segment. The Master will authenticate the client connection, process the incoming SQL commands, distribute the workload among the segments, coordinate the results returned by each segment, and present the final results to the client program.

Greenplum Database uses write-ahead log (WAL) to implement primary/standby mirroring. In the WAL-based log, all modifications will be written to the log before being applied to ensure data integrity for any operations being processed.

 Greenplum的Segment

The Segment instances of Greenplum Database are independent PostgreSQL databases, each of which stores part of the data and performs the main part of query processing. When a user connects to the database through the Greenplum Master and issues a query, some processes are created on each segment database to process the query.

User-defined tables and their indexes will be available on the segments, and the segments contain different parts of the data. The database server process that serves segment data runs under the corresponding segment instance. Users interact with segments in a Greenplum database system through the Master. A segment host usually runs 2 to 8 Greenplum segments, depending on the number of CPU cores, RAM, storage, network interfaces, and workload. Segment hosts are expected to be configured in the same way. The key to obtaining the best performance from Greenplum Database is to evenly distribute data and workload among a large number of segments with the same capabilities , so that all segments can start working for a task at the same time and complete their work at the same time.

 Greenplum query distribution

The master receives the analysis and optimizes the query. The resulting query plan may be parallel or directional. The master will distribute the parallel query plan to all segments. Distribute the targeted query plan to a single segment. Each segment is responsible for performing local database operations on its own data set. Most of the database operations (such as table scans, joins, aggregations, and sorts) are executed on all segments in a parallel manner. Each operation performed on a segment database is independent of the data stored in other segment databases.

Some queries may only access data on a single segment, such as single-row INSERT, UPDATE, DELETE, or SELECT operations or queries filtered by table distribution key columns. In these queries, the query plan will not be distributed to all segments, but directed to the segment containing the affected or related rows.

In addition to the usual database operations (such as table scans, joins, etc.), Greenplum Database also has " move " operations. Movement involves moving tuples between segments. Not every query requires a move operation (directed query plans do not Data needs to be moved through Interconnect). In order to achieve maximum parallelism during query execution, Greenplum divides the work of the query plan into slices . A slice is a planning segment on which a segment can work independently. As long as a move operation appears in the plan, the query plan will be sliced, with a slice at both ends of the move. Consider the following simple query involving a join between two tables:

SELECT customer, amount
FROM sales JOIN customer USING (cust_id)
WHERE dateCol = '04-30-2016';

The redistribution movement in this example is necessary, and tuples need to be moved between segments to complete the connection, because the customer table is distributed according to the cust_id on the segment, and the sales table is distributed according to the sale_id. In order to perform the connection, the sales tuple must be redistributed according to cust_id, and the plan is switched on both sides of the redistribution move operation, forming slice 1 and slice 2 . This query plan consists of another move operation called collect move . The collection operation indicates when the segment sends the results back to the master, and the master presents the results to the client. Since the query plan will be sliced ​​as long as there is a move, this plan also has an implicit slice ( slice 3 ) at the top level . Not all query plans involve collection moves. For example, a CREATE TABLE x AS SELECT... statement will not be collected and moved because the tuples are all sent to the newly created table instead of to the Master.

Greenplum will create several database processes to handle the work of the query. On the Master, the query worker process is called the Query Distributor (QD). QD is responsible for creating and distributing query plans. It also collects and expresses the final result. On the segment, the query worker process is called the query executor (QE). QE is responsible for completing that part of its work and communicating its intermediate results with other worker processes. At least one worker process must be assigned to each slice of the query plan . The worker process works independently on the part of the query plan assigned to it. During query execution, each segment will have several processes working for the query in parallel.

Related processes that work for the same slice of the query plan but are located on different segments are called gangs . As part of the work is completed, tuples will flow from one process group to the next group in the query plan. This kind of inter-process communication between segments is called the Interconnect component of Greenplum Database .

Greenplum Database cluster management and status viewing

Greenplum Database provides standard command line tools to perform common monitoring and management tasks. Greenplum's command line tools are located in the $GPHOME/bin directory and are executed on the Master host. Greenplum provides practical tools for the following management tasks:

  • Install Greenplum Database on an array
  • Initialize a Greenplum Database system
  • Start and stop Greenplum Database
  • Add or remove a host
  • Expand the array and redistribute the table on the new segment
  • Recover a failed segment instance
  • Manage failover and recovery of failed Master instances
  • Backup and restore a database (parallel)
  • Load data in parallel
  • Transfer data between Greenplum Database
  • System status report

1. View the status of the segment node

select * from gp_segment_configuration;

pgdb=# select * from gp_segment_configuration;
 dbid | content | role | preferred_role | mode | status | port |  hostname   |   address   |           datadir           
------+---------+------+----------------+------+--------+------+-------------+-------------+-----------------------------
    1 |      -1 | p    | p              | n    | u      | 5432 | greenplum-1 | greenplum-1 | /data/gpdata/master/gpseg-1
    2 |       0 | p    | p              | n    | u      | 6000 | greenplum-2 | greenplum-2 | /data/gpdata/pdata/gpseg0
    3 |       1 | p    | p              | n    | u      | 6000 | greenplum-3 | greenplum-3 | /data/gpdata/pdata/gpseg1
    4 |       0 | m    | m              | n    | d      | 7000 | greenplum-3 | greenplum-3 | /data/gpdata/mdata/gpseg0
    5 |       1 | m    | m              | n    | d      | 7000 | greenplum-2 | greenplum-2 | /data/gpdata/mdata/gpseg1

2. View historical information such as segment node failures

select * from gp_configuration_history order by 1 desc ;

3. The gpstate tool provides to view the status information of the database system and check the synchronization status of the segment instance

gpstate -m

This command will output the synchronization status of each node instance. When the status of each node is "synchronizing", it is abnormal. If there is data, it means that the segment instance is synchronizing. Do it again every few minutes. If there is an instance, the synchronization cannot be completed for a long time. , Need to report to DBA for further monitoring
4. Check the disk idle condition of the segment node

SELECT * FROM gp_toolkit.gp_disk_free;

5. Check the standby synchronization status

gpstate -f

This command will output the synchronization status of the Standby Master, which is abnormal when the Standby Master status is "synchronizing"; to view more detailed information about the Greenplum Database array configuration, use gpstate with the -s option;

gpstate -s

6. Greenplum provides pgssh commands, which can operate on the nodes of the pg cluster in batches, such as:

gpssh -f ~/gpconfigs/hostfile_segonly -e "df -h |grep /data"
[gpadmin@greenplum-1 gpconfigs]$ gpssh -f ~/gpconfigs/hostfile_segonly -e "df -h |grep /data"
[greenplum-3] df -h |grep /data
[greenplum-3] /dev/vdb        197G  724M  187G   1% /data
[greenplum-2] df -h |grep /data
[greenplum-2] /dev/vdb        197G  724M  187G   1% /data

7. After completing all the configurations, initialize the database; -c followed by the initial system parameters, -h followed by the segment list

gpinitsystem -c /home/gpadmin/gpconfigs/gpinitsystem_config -h /home/gpadmin/gpconfigs/hostfile_segonly

8. Start the cluster

gpstart

9. Restart the cluster

gpstop -r

10. View data disk

Master and segment data full will prevent normal database activities from continuing. If the disk grows too full, it may cause the database server to shut down. You can use the gp_disk_free external table in the gp_toolkit management solution to check the remaining free space (in kilobytes) in the segment host file system.

 SELECT * FROM gp_toolkit.gp_disk_free   ORDER BY dfsegment;

11. View storage consignment sales for each library

See overall size of the database (in bytes), the use of gp_toolkit management scheme gp_size_of_database view. E.g:

SELECT * FROM gp_toolkit.gp_size_of_database ORDER BY sodddatname;

12. View the disk size occupied by the table

SELECT relname AS name, sotdsize AS size, sotdtoastsize 
   AS toast, sotdadditionalsize AS other 
   FROM gp_toolkit.gp_size_of_table_disk as sotd, pg_class 
   WHERE sotd.sotdoid=pg_class.oid ORDER BY relname;

Guess you like

Origin blog.csdn.net/yezonggang/article/details/107563691