Greenplum Disadvantages

Speaking of Greenplum, it was first contacted by SUN when they launched their data warehouse product DWA. I didn’t know much about this database product stacked with PgSQL. At that time, the focus was still on the hardware of DWA itself. Of course, it is undeniable. , DWA still has some characteristics.

Later, we found that ordinary PC+SAS disks have very good throughput, which is not inferior to some expensive storage devices. In this way, we tried to build an environment with PC+Greenplum, and the effect completely exceeded our expectations, and the throughput completely exceeded our large-scale storage. Since then, we have stopped superstitious about those expensive hosts and storage, and started to try some new things, such as using PC+SAS/SATA to stack cheap storage, using Greenplum to build a data warehouse computing environment, search hadoop cluster, PC +SSD to build OLTP database, use Intel Nehalem to replace minicomputer, etc.

Yesterday, I went to participate in a technical sharing of Greenplum by the data warehouse department, during which I listed a large number of performance data comparisons, especially the comparison with the current set of Oracle RAC. The results are self-evident, in the application of data warehouse, especially the processing of large amount of data, the performance varies greatly. That's when the problem arises. Many people feel that this product is so amazing that it can solve all the problems of the data warehouse, as if it is a gift from God to us. In the end, many people are asking: Oracle is too bad, with such good equipment, the performance is still so poor, why should we use it? Alas, Greenplum is good, but not "magic", let's not be blinded by these "magical" data.

As for Greenplum, I am actually in a state of half-understanding. It is a bit overwhelming to explain the principle to you. Here I will simply analyze why Greenplum is fast. What "magical" technique did he use?

How to improve the processing capacity of the data warehouse, there are the following two main factors: first, the throughput capacity, which is the so-called IO; second, the parallel computing capacity.

We all know that Oracle RAC is a shared everything architecture, while Greenplum is a shared nothing architecture. The entire cluster consists of many segment hosts (data nodes) + master hosts (control nodes), where each segment host runs many PgSQL databases (segments).

<iframe id="iframe_0.18789027263233343" style="border: medium; border-image: none; width: 578px; height: 408px;" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.hellodb.net/wp-content/uploads/2009/07/greenplum.jpg?_=2644290%22%20style=%22border:none;max-width:578px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.18789027263233343',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no"></iframe>

When data enters the database, the first thing to do is data distribution, that is, to distribute the data of a table to each segment as evenly as possible. We need to specify a distribute column for each table, and then do data distribution according to hash. The purpose of this is to make full use of the IO capabilities of each node. We know that the IO capabilities of PCs are considerable now. A specially designed data node like DWA, Sun Fire X4500 Server, integrates 48 SATA blocks in one box. disk, known as "Scan 1 Terabyte of data in 60 seconds". In fact, there is no need to buy DWA. Domestic manufacturers all have that kind of disk-intensive PC, which is cheap and sufficient. We have been using it.

<iframe id="iframe_0.840501765512368" style="border: medium; border-image: none; width: 578px; height: 385px;" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.hellodb.net/wp-content/uploads/2009/07/greenplum2.jpg?_=2644290%22%20style=%22border:none;max-width:578px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.840501765512368',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no"></iframe>

When many people see the Greenplum architecture, the first question is what function does the master machine undertake? Will it become the bottleneck of the system? This is also an important feature of the Greenplum system. The master only undertakes a very small amount of control functions and interaction with clients, and does not undertake any calculations at all. If there is a central node, it means that the system cannot scale linearly at all, because the master will definitely become the bottleneck of the system. Greenplum does not have this problem. Data interaction between nodes does not need to go through the master, but is completed directly between nodes.

Now, if we want to query the data of a certain table, we just need to assign the work to each node, and IO is no longer a problem. Next, we need to solve the problem of parallel computing. The core problem is to join multiple tables. Because the tables are distributed through the DT column, each node knows that the data is on a certain node through the DT column. Suppose two tables are joined using the DT column, because the same data is on the same node, so only need Calculate the corresponding nodes, and then combine the results. If the non-DT column is used for join, because the nodes do not know the distribution of data, a process of data redistribution will be performed (redistribute). Let's look at the following example. All three tables use the id column as the DT column. First, the id is used for join, because the non-DT column join is designed. At this time, Greenplum will do the redistribute work, and the function is to redistribute the data according to the hash. , the purpose of this is to let the node know which node the data is on in order to complete the join action. We see that group by also does redistribute later, because group by is also a non-DT column, and hash aggregate action also requires interaction data between nodes, and nodes must also know the distribution of data. If there is a redistribute action, will it be more efficient? Because redistribute is only for the required data, and it is all done in the node cache, it is definitely slower than the DT column join, but the efficiency is still very high.

<iframe id="iframe_0.17964831515706808" style="border: medium; border-image: none; width: 578px; height: 344px;" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.hellodb.net/wp-content/uploads/2009/07/greenplum3.jpg?_=2644290%22%20style=%22border:none;max-width:578px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.17964831515706808',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no"></iframe>

Greenplum really takes advantage of the ubiquity of parallelism, starting multiple PgSQL databases on one host at the same time, so that the multi-core CPU on the hardware can take full advantage of it. Someone asked me: Can Greenplum process multiple tasks in parallel? The answer is: impossible. Because Greenplun has fully utilized the IO and processing power of the machine, it is no longer possible to process multiple tasks at the same time.

Another interesting feature of Greenplum is that when data is loaded, there is not a central data distribution node that we generally imagine, but all nodes read data at the same time, and then according to the hash algorithm, leave their own data, and other The data of the node is directly transmitted to him through the network, so the data loading speed is very fast.

Greenplum HA Architecture

<iframe id="iframe_0.6840042604336357" style="border: medium; border-image: none; width: 578px; height: 282px;" src="data:text/html;charset=utf8,%3Cimg%20id=%22img%22%20src=%22http://www.hellodb.net/wp-content/uploads/2009/07/greenplum4.jpg?_=2644290%22%20style=%22border:none;max-width:578px%22%3E%3Cscript%3Ewindow.onload%20=%20function%20()%20%7Bvar%20img%20=%20document.getElementById('img');%20window.parent.postMessage(%7BiframeId:'iframe_0.6840042604336357',width:img.width,height:img.height%7D,%20'http://www.cnblogs.com');%7D%3C/script%3E" frameborder="0" scrolling="no"></iframe>

Looking at Greenplum now, it's not magical. In fact, Oracle RAC is also a very good solution for data warehouses. Oracle has all similar technologies. We can make an assumption like this, if for a fixed SQL, I can also use Oracle RAC to do what Greenplum does. According to SQL, we can do Hash+Range partitioning of the table (in fact, Greenplum is also hash+range partitioning , use hash to distribute data to different databases, and then use range to partition the tables on each database), and then use the parallel processing capability of RAC. Oracle also has a similar function of partition-wise join, but there is no data redistribute operation. The biggest problem with Oracle is the architecture of shared everything, which results in limited IO processing capacity. Our large-scale storage throughput is only 1.4GB/S, and the scalability is also limited. The Oracle database machine that has been introduced before is the solution provided by Oracle specifically for data warehouses.

In fact, there is no magical technology. The reason why Greenplum is magical is that our scene has its characteristics. In fact, we can also design a scene to get the conclusion that Greenplum is bad, so don't believe the manufacturer's data, and don't believe what can be solved. All problems with technology, that simply don't exist.

"Don't be obsessed with brother, brother is just a legend."

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326684692&siteId=291194637