Distributed SQL Computing Method Based on Distributed Database Storage and Hadoop Distributed Computing

Read the full text http://click.aliyun.com/m/23098/
1. Catalog
2. Catalog

3. Background and design ideas

4. Architecture

No proxy node

With proxy node

Module description

Difference between the two architectures

5. Application architecture

6. Basic Concept Description

7. Addition, deletion and modification operations

8. Query operations

Stage tree

Stage

Query steps

9. Examples

Balance strategy

Query

9..1 Sorting

9..2 Grouping and aggregation

9..3 Joining

9..4 Subqueries

10. Comparing with existing Differences and advantages of the system

11. Application scenarios




3. Background and design ideas


In order to solve the problem that complex SQL (such as global sorting, grouping, join, sub-queries, especially these logical operations of unbalanced fields) is difficult to implement under distributed databases; after having some practical application experience of distributed databases and hadoop On the basis of comparing the advantages and disadvantages of the two, plus some refining and thinking of my own, I designed a system that integrates the two, using the advantages of the two to supplement the shortcomings of the two, specifically, using the database level The idea of ​​segmentation realizes data storage, and the idea of ​​mapreduce is used to realize SQL calculation.



The horizontal division of the database here means that only the database is divided into the database without the table. For tables of different levels, the number of sub-databases can be different. For example, the data volume of 100 million is divided into 10 sub-databases, and the data volume of 1 billion is divided into 50 sub-databases. . For the calculation using the idea of ​​mapreduce; for a requirement, it is converted into one or more SQLs with dependencies, each of which is decomposed into one or more mapreduce tasks, and each mapreduce task contains mapsql, shuffle (shuffle) ), reducesql, this process can be understood as similar to hive, the difference is that even the map and reduce operations in the mapreduce task are implemented through sql, not the map and reduce operations in hadoop.



This is the basic idea of ​​mapreduce, but in hadoop In the ecosystem, the first-generation mapreduce stores the results in the disk, and the second-generation mapreduce stores the results in the memory or disk according to the memory usage. By analogy with storing in the database, the result of the mapreduce is stored in the table, while The caching mechanism of the database naturally supports deciding whether to store in memory or disk according to the memory situation; in addition, in the hadoop ecosystem, the computing model is not a kind of mareduce. The computing idea of ​​mapreduce here can be replaced by a spark-like RDD iterative computing method; The system is explained based on mapreduce.



4. Architecture
According to the above ideas, the architecture of the system is as follows:

there is no agent node
Read more http://click.aliyun.com/m/23098/

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326561696&siteId=291194637