These two problems are not only the problems encountered by Mycat at present , but also the problems encountered by other distributed systems. The solution can only be a compromise, either time for space, or space for time.
1. Multiple Aggregation Problems
For example, I have a log table: department, user, module, access time.... Now there is a requirement as follows: real-time statistics of which department and which user, at a certain moment, which system module accesses the most .
select
department, user, access time, module, count(*) as cn
from a table
group by department, user, access time, module
order by cn desc
When encountering massive data, mycat stops directly
2. Deep paging problem
Deep paging in a clustered system
To understand why deep paging is problematic, let's assume searching in an index with 5 primary shards. When we request the first page of results (results 1 to 10), each shard generates its own top 10 results and returns them to the requesting node, which then sorts all 50 results to select Top 10 results.
Now suppose we request page 1000 - results 10001 to 10010. Both work the same way, except each shard must produce the top 10010 results. Then request the node to sort these 50050 results and discard 50040!
You can see that in a distributed system, the cost of sorting results grows exponentially as the paging goes deeper. This is why any statement in a web search engine cannot return more than 1000 results.
Why does requesting page 1000 - results 10001 to 10010 need to return 10010 results?
Because according to different dimension statistics, the sorting position of each piece of data in the whole system is not clear, so it is necessary to aggregate the results of 10010 of each machine to do the final sorting.