Greenplum Database Schema Analysis

Greenplum Database is the most advanced distributed open source database technology, mainly used to handle large-scale data analysis tasks, including data warehousing, business intelligence (OLAP), and data mining. Since it was officially open sourced in October 2015, it has received extensive attention from industry insiders at home and abroad. This article introduces the Greenplum Database technical architecture that the community cares about. Pivotal open sourced massively parallel processing data warehouse Greenplum   Pivotal China

1. Introduction to Greenplum Database

Big data is a hot word, and all walks of life are talking about it. When it comes to big data, many people think it is Hadoop. In fact, Hadoop is just one of several processing solutions for big data. Today's SQL, NoSQL, NewSQL, Hadoop, etc. can handle some problems of big data at different levels or different applications. As a distributed large-scale parallel processing database, Greenplum Database is more suitable for storage engine, calculation engine and analysis engine of big data in most cases.

Greenplum Database is also referred to as GPDB. It has rich features:

First, complete standard support: GPDB fully supports ANSI SQL 2008 standard and SQL OLAP 2003 extension; in terms of application programming interface, it supports ODBC and JDBC. Perfect standard support makes system development, maintenance and management very convenient. However, the current NoSQL, NewSQL and Hadoop support for SQL is not perfect, different systems need to be developed and managed separately, and the portability is not good.

Second, it supports distributed transactions and ACID. Ensure strong data consistency.

Third, as a distributed database, it has good linear scalability. In the production environment of domestic and foreign users, there are many cases of GPDB clusters with hundreds of physical nodes.

Fourth, GPDB is an enterprise-level database product, and there are thousands of clusters running in the production environments of different customers around the world. These clusters provide services for the key businesses of many large financial, government, logistics, retail and other companies around the world.

Fifth, GPDB is the result of more than a decade of R&D investment by Greenplum (now Pivotal). GPDB is based on PostgreSQL 8.2, which had about 800,000 lines of source code, and GPDB now has 1.3 million lines of source code. Compared to PostgreSQL 8.2, about 500,000 lines of source code have been added.

Sixth, Greenplum has many partners, and GPDB has a complete ecosystem, which can be integrated with many enterprise-level products, such as SAS, Cognos, Informatic, Tableau, etc.; it can also be integrated with many open source software, such as Pentaho, Talend, etc.

2. Greenplum Architecture

2.1 Platform Architecture

Figure (1) is an overview of the Greenplum Database platform. The platform is divided into four levels, which we will look at from bottom to top.

Greenplum Database
MPP core architecture

GPDB is a large-scale shared-nothing processing architecture, which will be introduced later;

1. An advanced parallel optimizer is one of the keys to outstanding performance. GPDB has two optimizers, one is the optimizer based on PostgreSQL planner; the other is the newly developed ORCA optimizer. ORCA is a brand new project started by Greenplum 5 years ago. After several years of development and testing, this optimizer has recently become the default optimizer for GPDB Enterprise Edition.

2. The storage engine of GPDB supports polymorphic storage, and the data of a table can use different storage methods according to different access modes. The storage mode is transparent to the user. When executing a query, you don't need to care about the storage mode used by the data to be accessed, and the optimizer will automatically select the best query plan.

3. In a distributed database, some operations (such as cross-node association) require data exchange between multiple nodes. The parallel database stream engine of GPDB can select the most suitable data stream operator according to the characteristics of the data, such as the distribution method and the amount of data. Currently GPDB supports two data stream operators: Redistribution and Broadcast. Redistribution redistributes the data to each data node according to the hash value of the data, which is suitable for large data volumes; broadcasting sends data to all data nodes, suitable for small data volumes, such as dimension tables.

4. The software switch is an important component of GPDB. The software switch can establish a reliable UDP data communication mechanism between each data node and with the master node, which is the core of realizing efficient data flow.

5. The Scatter/Gather stream engine is specially designed for parallel data loading and exporting. Scatter means that data is distributed to each data node in parallel through the parallel loading server. Gather means that data can be distributed in parallel on demand according to the distribution strategy within GPDB.

service layer

GPDB supports multi-level fault tolerance and high availability:

1. High availability of the master node: In order to avoid the single point of failure of the master node, a replica of the master node (called Standby Master) can be set up, and synchronous replication is achieved between them through streaming replication technology. When the master node fails, the slave node becomes the master node, processing user requests and coordinating query execution. Failures are detected by heartbeats between them.

2. High availability of data nodes (Segment): Each data node can be equipped with a mirror, and synchronous replication of data (called filerep technology) is achieved between them through synchronization at the file operation level. It is recommended to use RAID5 disks on data nodes to further improve the high availability of data. The failure detection process (ftsprobe) periodically sends heartbeats to each data node. When a node fails, GPDB automatically fails over.

3. Network high availability: In order to avoid a single point of failure in the network, each host is configured with multiple network ports, and multiple switches are used to avoid the entire service being unavailable when a network failure occurs.

4. Online expansion: When the amount of data increases and the existing cluster cannot meet the demand, the GPDB database can be dynamically expanded. During the expansion process, the business can continue to run without downtime.

5. Task management refers to the management of resources and the management of usage.

Product Features

Data loading will be introduced later.

1. Data federation is quite interesting. The term "data lake" is very popular recently. The purpose of data lake is to no longer need to customize the data as before to generate specific business reports; instead, save the original data. Analysis is processed directly from the raw data. GBDB can implement a data lake (we call it data federation), it can access and process all data in the data center, whether your data is in Hadoop, on the file system, or in other databases, Greenplum can use a SQL Access all data with guaranteed ACID.

2. GPDB supports both row storage and column storage. Special optimizations are also made for data storage and processing that do not require updating.

3. Supports multiple compression methods, including QuickLZ, Zlib, RLE, etc.

4. Support multi-level partition table, partition supports multiple modes, including range, list, etc.

5. Support indexes such as B-tree, bitmap and GiST

6. The GPDB authentication mechanism supports a variety of methods, including LDAP and Kerberos. With access control lists (ACLs), flexible role-based security controls can be implemented.

7. Extended language support: GPDB supports the implementation of user-defined functions (UDFs, similar to Oracle's stored procedures) in a variety of popular languages, including Python, R, Java, Perl, C/C++, etc.

8. Geographic information processing: By integrating PostGIS, GPDB supports the storage and analysis of geographic information.

9. Built-in data mining algorithm library: Through the MADLib (now Apache incubation project) algorithm library, dozens of common data analysis and mining algorithms can be built into the GPDB database, including logistic regression, decision tree, random forest, etc. All algorithms can be used through SQL without writing any algorithm code.

10. Text retrieval: With the extension of GPText, GPDB can support efficient, flexible and rich full-text retrieval functions. Combined with MADLib, parallel text analysis and mining is possible.

Client Access and Tools

All functions of the GPDB database can be accessed through the psql command line tool, and application programming interfaces such as ODBC, JDBC, OLEDB, and libpq are also provided.

Management tools for databases or data clusters are very important. GPDB provides a graphical management tool GPCC (Greenplum Command Center) to help you manage status and monitor resource usage.

Greenplum Workload Manager is a new product just released for rule-based resource management. It supports custom rules that perform certain actions when a certain SQL satisfies the conditions described by the rules. For example, you can define rules to automatically cancel queries that consume more than 50% of CPU resources.

2.2 Massively Parallel Processing (MPP) Shared-Nothing Architecture

MPP is the most prominent feature of Greenplum Database. The term MPP is so popular these days that we can take a look at what it means. In the figure (2) below, there are two master nodes, one is the master node and the other is the slave master node. Through the softswitch mechanism, that is, through the high-speed network, the master node is connected to the data node. Each data node has its own CPU, its own memory, its own hard disk, and the only thing they share is the network. This is also why it is called a shared-nothing architecture. The advantage of this architecture is that the cluster is a distributed environment, data can be distributed on many nodes for parallel processing, and linear expansion can be achieved.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326400700&siteId=291194637