【Introduction to Greenplum】

Pivotal Greenplum is a commercial fully featured data warehouse powered by the open source Greenplum Database. It provides powerful and rapid analytics on petabyte scale data volumes. Uniquely geared toward big data analytics, Greenplum is powered by the world’s most advanced cost-based query optimizer delivering high analytical query performance on large data volumes.

Greenplum was previously a California-based company providing solutions and consulting services for new Enterprise Data Warehouse (EDW), Enterprise Data Cloud (EDC) and Business Intelligence (BI) for large enterprise users around the world. Greenplum is currently owned by Pivotal.

 

Greenplum DB claims to be the world's first open source massively parallel data warehouse, originally based on PostgreSQL, and has now added numerous database innovations. Greenplum provides powerful and fast analysis capabilities of PD-level data volume, especially the analysis capabilities for big data, and supports ultra-high-performance analysis and query of big data.

 

 

Greenplum Database is also referred to as GPDB. It has rich features:

First, complete standard support: GPDB fully supports ANSI SQL 2008 standard and SQL OLAP 2003 extension; in terms of application programming interface, it supports ODBC and JDBC. Perfect standard support makes system development, maintenance and management very convenient. However, the current NoSQL, NewSQL and Hadoop support for SQL is not perfect, different systems need to be developed and managed separately, and the portability is not good.

Second, it supports distributed transactions and ACID. Ensure strong data consistency.

Third, as a distributed database, it has good linear scalability. In the production environment of domestic and foreign users, there are many cases of GPDB clusters with hundreds of physical nodes.

Fourth, GPDB is an enterprise-level database product, and there are thousands of clusters running in the production environments of different customers around the world. These clusters provide services for the key businesses of many large financial, government, logistics, retail and other companies around the world.

Fifth, GPDB is the result of more than a decade of R&D investment by Greenplum (now Pivotal). GPDB is based on PostgreSQL 8.2, which had about 800,000 lines of source code, and GPDB now has 1.3 million lines of source code. Compared to PostgreSQL 8.2, about 500,000 lines of source code have been added.

Sixth, Greenplum has many partners, and GPDB has a complete ecosystem, which can be integrated with many enterprise-level products, such as SAS, Cognos, Informatic, Tableau, etc.; it can also be integrated with many open source software, such as Pentaho, Talend, etc.

 

GreenPlum main features:

Massively Parallel Processing Architecture

High-performance loading, using MPP technology to provide petabyte-level data loading performance

Big Data Workflow Query Optimization

Polymorphic data storage and execution

Advanced machine learning capabilities based on Apache MADLib

 

 

 

 

The database consists of Master Severs and Segment Severs interconnected by Interconnect.

The master host is responsible for: establishing and managing the connection with the client; parsing SQL and forming an execution plan; distributing the execution plan to the segment and collecting the execution result of the segment; the master does not store business data, but only stores the data dictionary.  

The segment host is responsible for: the storage and access of business data; the execution of user query SQL. 

greenplum uses mpp architecture.

 

Greenplum's architecture uses MPP (massively parallel processing). In an MPP system, each SMP node can also run its own operating system, database, and so on. In other words, the CPU within each node cannot access the memory of the other node. The information exchange between nodes is realized through the node interconnection network, and this process is generally called data redistribution. It is obviously different from the traditional SMP architecture. Usually, the MPP system is less efficient than SMP because it needs to transmit information between different processing units, but this is not absolute, because the MPP system does not share resources, Therefore, for it, there are more resources than SMP. When the transaction to be processed reaches a certain scale, the efficiency of MPP is better than that of SMP. This is determined by the proportion of the communication time occupied by the computing time. If the communication time is relatively long, the MPP system will not have an advantage. On the contrary, if the communication time is relatively small, the MPP system can give full play to the advantages of resources and achieve high efficiency.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326353892&siteId=291194637