Table partitioning scheme for tables with large amounts of data and points that need to be paid attention to when querying across (same type) tables encounters grouping

Recently, I have started to improve the table with a relatively large amount of data. When the amount of data in a table is large, and the data is added regularly or in real time, the capacity of the table needs to be considered at this time, because the data of a table cannot be infinitely large, so it is imminent to consider sub-tables~~

solution:

1) Combined with the Insert frequency of the data in the table, roughly calculate the size of the data in the table below. Plan the maximum amount of data for each table, consider dividing the table by year or month or day, the table name is basically the same, and the time string is used as the suffix (such as: table name _yyyy, table name _yyyyMM, table name _yyyyMMdd, etc.);

2) If each record of the table with the smallest dimension is entered by minutes (not days in short), the time field is in "yyyyMMdd" format (ie "days"). Then, to request the statistical results grouped by day for a period of time (from which day to which day), you can directly query the table with the smallest dimension.

If the table is stored separately by day, the situation of cross-table needs to be considered. For the idea of cross-table, see my previous article;

3) If you need to query the statistics grouped by "month" for a period of time (from which month to which month), it is no longer recommended to directly query the table with the smallest dimension, because it is necessary to consider sub-tables, and the number of union all tables is too large. Too many, resulting in stuck, unrealistic in efficiency.

——Solution: The scheduled task counts the overall situation of the previous month at the beginning of each month. The time field is in "yyyyMM" format (ie "month"). If the table with the smallest dimension is stored by month, read the summary of the previous month. The situation becomes a record, which is the overall situation of the previous month, and is specially stored in a table such as "table name_month". The month table is fine;

4) If you need to query the statistics by "year" for a period of time (from which year to which year), as above, you can summarize the data on the basis of the month dimension in the "table name_month" table and store it in the "table name_month" table. Name_year" table, at this time each record in this year table represents the overall situation of a "year", the time field is in "yyyy" format (ie "year"), at this time directly read "table name_ year" table.

5) While regularly summarizing data to the corresponding dimension table by month and year, it is also necessary to establish the monthly table of the next month or the year table of the next year.

The next thing to talk about is the points that need to be paid attention to when the same type of table is grouped. Now suppose there are two tables itm_test and itm_test2, their time field is app, according to the background of the discussion, the value of the app field of these two tables Certainly will not be repeated. .

The data of these two tables are as follows:

itm_test table:

itm_test2 table:

(Now suppose app is a time field. The values of the time fields of these two tables will not be repeated...)

Case 1: Group by a single field, this field is not a time field app.

Writing method one (two tables are grouped and then union all, and the query conditions are scattered in each table ):

(SELECT src,SUM(cnt) FROM itm_test WHERE app>'app2' GROUP BY src ORDER BY src)
UNION ALL (SELECT src,SUM(cnt) FROM itm_test2 WHERE app<'app7' GROUP BY src ORDER BY src)

search result:

Writing method two (two tables are first union all and then group by, and the query conditions are scattered in each table):

SELECT src, SUM(cnt) FROM ((SELECT * FROM itm_test WHERE app>'app2') UNION ALL (SELECT * FROM itm_test2 WHERE app<'app7')) tmp GROUP BY src ORDER BY src

search result:

Writing method three (two tables are first union all and then group by, and the query conditions are in the temporary table after union all):

SELECT src,SUM(cnt) FROM((SELECT  * FROM itm_test) UNION ALL (SELECT * FROM itm_test2 ) tmp
  WHERE app > 'app2' AND app < 'app7' GROUP BY src ORDER BY src

search result:

To sum up: Writing method 1 is wrong, because the original intention is to count the summaries of cnt by src for a period of time, but writing method 1 has 2 records of nbk5, that is, 2 records with the same src, which is contrary to the original intention.

Scenario 2: Group by multiple fields, one of which contains the time field app.

Writing method one (two tables are grouped and then union all, and the query conditions are scattered in each table ) :

(SELECT app,src,SUM(cnt) FROM itm_test WHERE app>'app2' GROUP BY app,src ORDER BY app,src )
UNION ALL (SELECT app,src,SUM(cnt) FROM itm_test2 WHERE app<'app7' GROUP BY app,src ORDER BY app,src )

Writing method two (two tables are first union all and then group by, and the query conditions are scattered in each table) :

SELECT app,src, SUM(cnt) FROM ((SELECT * FROM itm_test WHERE app>'app2') UNION ALL (SELECT * FROM itm_test2 WHERE app<'app7')) tmp GROUP BY app,src ORDER BY app,src

Writing method three ( two tables are first union all and then group by, and the query conditions are in the temporary table after union all ):

SELECT app,src,SUM(cnt) FROM((SELECT  * FROM itm_test) UNION ALL (SELECT * FROM itm_test2 ) tmp
  WHERE app > 'app2' AND app < 'app7' GROUP BY src ORDER BY app,src

The three ways of writing are the same query result:

To sum up, at this time, the statistics are grouped by multiple fields, and one of the fields is app (time field), and the three writing methods are equivalent.

Therefore, when querying across the same type of table encounters grouping statistics, it is necessary to see whether the time field is in the conditions of group by: if it is, the three writing methods are equivalent, if not, the writing method one is wrong. The query conditions are the same whether they are scattered in each table or concentrated in a temporary table. The temporary table needs to be added with a table alias, otherwise an error will be reported:

Every derived table must have its own alias

Table partitioning scheme for tables with large amounts of data and points that need to be paid attention to when querying across (same type) tables encounters grouping

Guess you like