Read the full text
http://click.aliyun.com/m/23098/
1. Catalog
2. Catalog
3. Background and design ideas
4. Architecture
No proxy node
With proxy node
Module description
Difference between the two architectures
5. Application architecture
6. Basic Concept Description
7. Addition, deletion and modification operations
8. Query operations
Stage tree
Stage
Query steps
9. Examples
Balance strategy
Query
9..1 Sorting
9..2 Grouping and aggregation
9..3 Joining
9..4 Subqueries
10. Comparing with existing Differences and advantages of the system
11. Application scenarios
3. Background and design ideas
In order to solve the problem that complex SQL (such as global sorting, grouping, join, sub-queries, especially these logical operations of unbalanced fields) is difficult to implement under distributed databases; after having some practical application experience of distributed databases and hadoop On the basis of comparing the advantages and disadvantages of the two, plus some refining and thinking of my own, I designed a system that integrates the two, using the advantages of the two to supplement the shortcomings of the two, specifically, using the database level The idea of segmentation realizes data storage, and the idea of mapreduce is used to realize SQL calculation.
The horizontal division of the database here means that only the database is divided into the database without the table. For tables of different levels, the number of sub-databases can be different. For example, the data volume of 100 million is divided into 10 sub-databases, and the data volume of 1 billion is divided into 50 sub-databases. . For the calculation using the idea of mapreduce; for a requirement, it is converted into one or more SQLs with dependencies, each of which is decomposed into one or more mapreduce tasks, and each mapreduce task contains mapsql, shuffle (shuffle) ), reducesql, this process can be understood as similar to hive, the difference is that even the map and reduce operations in the mapreduce task are implemented through sql, not the map and reduce operations in hadoop.
This is the basic idea of mapreduce, but in hadoop In the ecosystem, the first-generation mapreduce stores the results in the disk, and the second-generation mapreduce stores the results in the memory or disk according to the memory usage. By analogy with storing in the database, the result of the mapreduce is stored in the table, while The caching mechanism of the database naturally supports deciding whether to store in memory or disk according to the memory situation; in addition, in the hadoop ecosystem, the computing model is not a kind of mareduce. The computing idea of mapreduce here can be replaced by a spark-like RDD iterative computing method; The system is explained based on mapreduce.
4. Architecture
According to the above ideas, the architecture of the system is as follows:
there is no agent node
Read more
http://click.aliyun.com/m/23098/
Distributed SQL Computing Method Based on Distributed Database Storage and Hadoop Distributed Computing
Guess you like
Origin http://10.200.1.11:23101/article/api/json?id=326561696&siteId=291194637
Ranking