Basic principles of MPP

foreword

I have been busy with work recently, and I haven't updated my blog for a while, which means that I have accumulated a wave of knowledge points that need to be sorted out and recorded, but what can be guaranteed is that all of them are sporadic knowledge points and will not involve work content.

1. Introduction to MPP

MPP (Massively Parallel Processing), that is, large-scale parallel processing , distributes tasks to multiple servers and nodes in parallel. Each node has an independent disk storage system and memory system. Business data is divided into On each node, after the calculation is completed on each node, the results of the respective parts are aggregated together to obtain the final result (similar to Hadoop). Each data node is connected to each other through a dedicated network or a commercial general network, and the calculation is coordinated with each other, as Provides database services as a whole.

1. MPP is essentially a database-based cluster architecture, which is different from traditional single-node databases and supports multi-node distributed storage and computing. ( It is worth mentioning that not all database cluster solutions are based on distributed storage )
2. MPP should be mostly used as a data warehouse to support query and analysis application scenarios. (personal understanding).
3. Use a relational database (such as PostgreSQL) to build a multi-node cluster for distributed storage and computing, and basically form an MPP database system.

2. Database non-shared cluster and database shared cluster

MPP belongs to the database non-shared cluster, so the simple distinction between the database non-shared cluster and the database shared cluster is made here for easy understanding.

1. Database non-shared cluster

In a database non-shared cluster, each node has an independent disk storage system and memory system, and business data is divided into each node according to the database model and application characteristics. Each data node is connected to each other through a dedicated network or a commercial general network. Collaborative computing provides database services as a whole. Non-shared database clusters have the advantages of complete scalability, high availability, high performance, excellent cost performance, and resource sharing.

2. Database sharing cluster

Data Shared Cluster Data sharing cluster, referred to as DSC, is a parallel cluster. DM instances located in different server systems access the same database at the same time. Nodes communicate through private networks. All control files, online logs and data files are stored in shared On the device, it can be accessed by all nodes in the cluster at the same time.
The advantages of DSC are mainly high availability and load balancing, and the downtime of a machine does not affect the application's access to the database.

Traditional single nodes do not belong to clusters, and dual-machine hot standby or Oracle RAC are all based on shared storage.
Oracle RAC cluster architecture diagram:
insert image description here

3. MPP architecture

insert image description here

1. MPP Architecture Features

(1) Tasks are executed in parallel

Based on the MPP architecture, the database supports parallel execution of large-scale tasks.

(2) Private resources

Based on the MPP architecture, each node has an independent disk storage system and memory system.

(3) Data distributed storage (localization)

Based on the MPP architecture and distributed data storage scenarios, each node has an independent disk storage system that can store data locally.

(4) Distributed computing

Based on the MPP architecture and distributed data storage scenarios, each node has an independent memory system, and the data is stored in the physical disk of each node after the calculation is completed on each node.

(5) Horizontal expansion

Based on the MPP architecture, each node is independent, so it is easy to expand horizontally.

(6) Shared Nothing architecture.

shared noting (SN) is a distributed computing architecture. In this architecture, each node is independent and self-sufficient, and there is no single point of competition in the system. More specifically, no nodes share storage and hard disks. People usually contrast SN with a system that maintains a large amount of centrally stored state information, whether in a database, application server, or other similar single point of competition.
SN has great advantages over the central control architecture. SN can avoid single point of failure, has self-recovery capability, and can be upgraded without destroying the original system.

2. MPP deployment architecture

The MPP deployment architecture is composed of multiple SMP (Symmetrical Multi-Processing) servers connected through a certain node interconnection network, and work together to complete the same task. From the user's point of view, it is a server system. Its basic feature is that multiple SMP servers (each SMP server is called a node) are connected through the node interconnection network, and each node only accesses its own local resources (memory, storage, etc.), which is a completely shared nothing (Share Nothing) ) structure, so the expansion ability is the best, and its expansion is theoretically unlimited.

4. MPP database

1. Introduction to MPPDB

MPPDB is a distributed parallel structured database cluster with Shared Nothing architecture. It has high performance, high availability, and high scalability. It can provide a cost-effective general computing platform for ultra-large-scale data management and is widely used to support various types of data. Warehouse systems, BI systems, and decision support systems.

2. MPPDB structure

insert image description here
MPPDB adopts a fully parallel MPP + Shared Nothing distributed flat architecture. Each node (node) in this architecture is independent, self-sufficient, and peers between nodes, and there is no single-point bottleneck in the entire system. Has very strong scalability.

3. MPPDB features
(1) Low hardware cost

The PC Server completely uses the x86 architecture, and does not need expensive Unix servers and disk arrays.

(2) Cluster Architecture and Deployment

Fully parallel distributed architecture of MPP + Shared Nothing, using Non-Master deployment, node-to-peer flat structure.

(3) Massive data distributed compression storage

It can handle structured data above the PB level, and adopts hash distribution and random storage strategies for data storage; at the same time, it adopts advanced compression algorithms to reduce the space required for storing data, which can reduce the used space by 1 to 20 times and increase accordingly I/O performance;

(4) Data loading efficiency

Based on the policy-based data loading mode, the overall loading speed of the cluster can reach 2TB/h.

(5) High scalability and high reliability

Supports expansion and reduction of cluster nodes, and supports full and incremental backup/restore.

(6) High availability and easy maintenance

Data provides redundancy protection through copies, automatic fault detection and management, and automatic synchronization of metadata and business data. Graphical tools are provided to simplify the administrator's management of the database.

(7) High concurrency

Reading and writing are not mutually exclusive, and data is loaded and queried at the same time, and the concurrent capacity of a single node is greater than 300 users.

(8) Row-column hybrid storage

Provides a row-column hybrid storage solution, thereby improving the query response time for special query scenarios of column storage databases.

(9) Standardization

Support SQL92 standard, support C API, ODBC, JDBC, ADO.NET and other interface specifications.

4. Common MPPDB

Foreign DBMPP products:

GREENPLUM(EMC)

Greenplum is a relational database for data warehouse applications. It is developed based on the popular PostgreSQL and has a good architecture. It has great advantages in data storage, high concurrency, high availability, linear expansion, response speed, ease of use and cost performance. Obvious advantages. For big data, Greenplum's performance is excellent in terms of terabytes of data volume, and the performance of a single machine is several times faster than that of Hadoop; in terms of function and syntax, it is much easier to use than the SQL engine Hive on Hadoop. It is easier for ordinary users to get started. Greenplum has complete tools and the whole system is relatively complete. It does not need to spend too much time and energy on transformation like Hive. It is very suitable as a solution for some large data warehouses. Greenplum can be easily integrated with Hadoop Combined, the data is directly offloaded to Hadoop, and MapReduce tasks can be written directly on the database, and the configuration is simple.

Asterdata(Teradata)、Nettezza(IBM)、Vertica(HP)

Domestic DBMPP products:
DM Dameng, TiDB (pingCAP), OpenGauss & GaussDB (GaussDB), SequoiaDB (SequoiaDB), OB & PolarDB (Ali), TDSQL (Tencent), GBase 8a MPP cluster (Nanda General)

Guess you like

Origin blog.csdn.net/qq_37432174/article/details/131736274