What are MPP architecture and distributed architecture? The 2000-word long article will explain clearly to you!

In the previous article, we explained in detail the reasons for the birth of the data warehouse, the basic characteristics, the difference between the data warehouse and the database, and the construction of the data warehouse. Now look at the MPP architecture and distributed architecture.

1. MPP architecture
MPP (Massively Parallel Processing) architecture is a distributed data processing technology that can improve data processing performance by distributing workloads to multiple nodes.

Different from the traditional shared architecture, MPP adopts a non-shared architecture (Share Nothing), which forms a cluster of stand-alone database nodes, each node has an independent disk and memory system, and is connected to each other through a dedicated network or a commercial general-purpose network for collaborative computing, thereby providing Overall data processing services.

In terms of design, the MPP architecture gives priority to consistency (Consistency), followed by availability (Availability), while trying to achieve partition tolerance (Partition Tolerance).

The MPP architecture is often used in scenarios such as data warehouses, data marts, and big data analysis. Its distributed design can effectively cope with the continuous growth of data scale and increase in complexity, but it also faces some challenges.
insert image description here
Advantages of the MPP architecture
The advantages of the MPP architecture are mainly reflected in the following aspects:

  1. High performance: The MPP architecture disperses data to multiple nodes, each node has independent processing capabilities and can process multiple tasks at the same time, thereby greatly improving data processing performance. At the same time, its job scheduling and data balancing mechanism make full use of computing tasks, optimize data distribution and transmission, and reduce system delay and throughput.

  2. Horizontal expansion: With the continuous increase in data size and complexity, traditional stand-alone databases are gradually unable to meet business needs. The MPP architecture can achieve horizontal expansion by increasing hardware resources such as computing nodes, CPUs, and storage, and provide better applications for enterprise businesses. experience.

  3. Fine management: MPP architecture needs to manage data scattered on multiple nodes, so fine management and scheduling mechanisms are required to better manage data flow and task execution process, and optimize data backup, recovery, compression, and data cleaning processes .

  4. High availability: The MPP architecture adopts a distributed design and has high availability. When a node fails, the system can automatically switch to other nodes to ensure service continuity.

Disadvantages of the MPP architecture
The MPP architecture is suitable for medium-sized enterprise data processing scenarios and has become one of the important architectures for enterprise data processing. But the MPP architecture also has some disadvantages and problems:

  1. Storage opacity: The MPP architecture needs to model and design data fragmentation, and use a certain algorithm to divide the data according to certain rules, so the storage location is opaque to users. When performing a query, the query task needs to be executed on all data nodes, which increases the delay of the query. For the fault handling of the queried data nodes, the problem of data recovery also needs to be considered.

  2. Single-node bottleneck: When performing parallel computing, computing tasks will be distributed to all nodes for calculation. A single-node bottleneck will become a shortcoming of the entire system, with poor fault tolerance, which may lead to slow response of the entire system. In addition, the MPP architecture itself has a large number of nodes and a large amount of data, and the cost of node failure is also high.
    insert image description here

  3. Distributed transactions: Due to the distributed storage between nodes in the MPP architecture, remote calls will be delayed during transaction processing, and some transaction operations need to be processed across multiple nodes. At this time, transaction processing in distributed systems will become very complicated and affect System scalability.

In short, although the MPP architecture has the advantages of high performance and horizontal expansion, it also has shortcomings such as storage opacity, single-node bottlenecks, and distributed transaction implementation. Reasonable selection and design should be made according to specific business needs.
insert image description here
2. Distributed architecture
Distributed architecture is a computing architecture that distributes computing tasks to multiple computing nodes concurrently, and is mainly used to deal with large-scale data and complex computing problems. This architecture is also often referred to as big data architecture or distributed batch processing architecture, and includes multiple specific implementations, such as Hadoop, Spark, etc.

Specifically, in a distributed architecture, each node has its own computing power and storage resources, enabling site autonomy (running local applications independently). Data is shared globally and transparently in the cluster, and all nodes are connected through LAN or WAN, but the communication overhead between nodes is relatively high, so it is necessary to minimize data movement during operation.

In terms of design, distributed systems usually give priority to partition tolerance (Partition Tolerance), followed by availability (Availability), and try to achieve consistency (Consistency).

In short, the distributed architecture is suitable for large-scale data processing and complex computing scenarios. It has high scalability and fault tolerance, but it also needs to solve the problems of distributed data consistency, task scheduling and communication overhead. Reasonable selection and design according to business needs.
insert image description here
Advantages of distributed architecture
Compared with MPP architecture, distributed architecture has the following advantages

  1. High throughput: The distributed architecture can distribute computing tasks to multiple nodes for parallel execution, so it has high processing speed and throughput. As the cluster size increases, the processing speed will be accelerated, and it can be expanded to exabytes of data volume.

  2. Public storage: Distributed architectures usually use public storage systems such as HDFS (Hadoop Distributed File System) to manage data. This storage method is suitable for storing large-scale heterogeneous data, and has good scalability and fault tolerance.

  3. Flexibility: The distributed architecture has good flexibility, and can easily increase or decrease cluster nodes to adapt to different business needs. It also supports multiple programming languages ​​and open source frameworks, such as Hadoop, Spark, Flink, etc.

3. MPP architecture + distributed architecture
MPP architecture and distributed architecture have their own advantages and applicable scenarios, and can also be used in combination in some scenarios. For example, MPP architecture provides high-performance parallel computing capabilities, while distributed architecture provides Highly scalable and fault-tolerant, the combination of the two can form a more complete big data processing architecture.

Specifically, a public storage system (such as HDFS) in a distributed architecture can be used to manage data, and the data can be distributed and stored on multiple nodes to achieve data partition fault tolerance and scalability. Then the MPP architecture is adopted in the upper-layer architecture, and the parallel computing capability of MPP is used to optimize tasks such as data query and calculation, thereby reducing operation delay and improving processing efficiency.

Of course, issues such as data segmentation and scheduling strategies also need to be considered in the actual implementation, such as how to segment data to improve the parallelism of MPP calculations, and how to improve the efficiency of data query.
Reply to the big data in the background to get relevant information for free!

Guess you like

Origin blog.csdn.net/yuanziok/article/details/132401445