MPP DB is a big data real-time analysis system

In the field of big data, real-time analysis system (online query) is the most common scenario. I wrote a " Real-time Analysis System (HIVE/HBASE/IMPALA) Analysis " to discuss the current common solutions in the industry. Internet companies mostly use HIVE/HBASE . For example, Tencent changed its name to TDW based on HIVE 's in-depth customization and transformation, and Xiaomi and other companies chose HBASE . For the introduction of HIVE/HBASE/IMPALA , you can read my previous articles.

In the current real-time analysis system, the most difficult thing is multi-dimensional complex query. There is currently no good solution. In the past two days, I have discussed MPP DB (distributed database, with Greenplum as the most typical representative). In terms of performance, MPP DB is indeed better than HIVE/HBASE/IMPALA in multi-dimensional complex query performance . Therefore, many voices believe that MPP DB is a future solution suitable for this scenario. MPP DB seems to have better performance for multi-dimensional complex queries, but at the same time there are two fatal shortcomings, which must be considered when selecting models:

1. Extensibility:

MPP DB claims to be able to expand to more than 1,000 nodes. In fact, in the process of application, as far as I can see from public information, there are no more than 100 nodes. For example , the largest cluster in Alipay that uses Greenplum for financial data analysis is more than 60 . machine. In addition, we communicated with Greenplum Company that the largest one used for data storage in Guangdong Mobile is less than 100 units. This is simply not on the same order of magnitude as hadoop 's 4,5 thousand nodes a node cluster.

Why is MPP DB not scalable?

There are many reasons, including product maturity and application breadth, but the most fundamental problem is the architecture itself. When it comes to architecture, we must first talk about the CAP principle:

Consistency ( consistency ),  consistent data update, all data changes are synchronous Availability ( availability ),  good response performance Partition tolerance ( partition tolerance Reliability theorem: any distributed system can only satisfy two points at the same time, no way All three. Advice: architects should not waste energy on how to design the perfect distributed system that satisfies all three, but should make trade-offs.




MPP DB is still based on the extension of the original DB . The natural pursuit of consistency ( Consistency ) in the DB will inevitably lead to poor partition fault tolerance. When the cluster scale becomes too large and there are too many business data, the metadata management of MPP DB is a complete disaster. The metadata is huge, and it is difficult to recover once an error occurs, and the database will be destroyed at every turn.

Therefore, MPP DB should have qualitative hints on scalability, and should have architectural breakthroughs in metadata and data storage, and reduce the requirements for consistency, so that scalability can be improved. Otherwise, it is difficult to believe that an MPP DB database is a Can be easily extended.

2. Concurrency support:

A query system is designed for human use, so the higher the concurrency that can be supported, the better. The core principle of MPP DB is that a large query is analyzed into sub-queries, distributed to the bottom layer for execution, and finally the results are merged . This violent SCAN method, for a single query, utilizes the capabilities of the entire system. A single query is relatively fast, but at the same time brings the problem of excessive force. The concurrency that the entire system can support must not be high. From experience, it supports 50 to 100 concurrent capabilities.

At present, when HBASE/IMPALA deals with complex queries, it is also realized by the method of full-disk SCAN. In this scenario, the more hard disks, the better, and the faster the rotation speed, the better. Why does HBASE claim to support thousands of concurrency? This is only possible in specific scenarios (with user identifiers when querying, that is, with row keys). In complex query scenarios, any system will stop working.

Therefore, the application scenario of MPP DB is very obvious, and it is suitable for small clusters (within 100) and low concurrency (about 50). Whether MPP DB is a trend in the future, I don't know, but at least for now, it is very difficult to use MPP DB to deal with real-time analysis systems of big data.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326402836&siteId=291194637