The theoretical basis of the mathematics of GROUP BY and PARTITION BY? (2)

Write in front

When using SQL for various data extraction, a common operation is to group data according to a certain standard. Not only when using SQL, but also when organizing or analyzing data in daily life, we often need to group data.
SQL statements have grouping functions are GROUP BY and PARTITION BY , they can be grouped according to the specified column for the table. The only difference is that GROUP BY aggregates each group into a row of data after grouping.

Set theory and group by

There is the following table.
Insert picture description here
Use GROUP BY or PARTITION BY on this table to obtain information on a team basis. No matter which one is used, the original table Teams can be divided into the following subsets, and then aggregated through the SUM function, or the rank is calculated through the RANK function.

SELECT member, team, age ,
          RANK() OVER(PARTITION BY team ORDER BY age DESC) rn,
          DENSE_RANK() OVER(PARTITION BY team ORDER BY age DESC) dense_rn,
          ROW_NUMBER() OVER(PARTITION BY team ORDER BY age DESC) row_num
      FROM Members
     ORDER BY team, rn;

Insert picture description here
The divided subset is shown in the figure below.
Insert picture description here

They have the following three properties.
1. They are all non-empty collections.
2. The union of all subsets is equal to the set before the division.
3. There is no intersection between any two subsets.

Because these subsets are separated by the value of the "team" column that exists in the table, it is impossible to have an empty set. Moreover, adding up all the divided subsets is obviously the original set. In other words, there are no unaffiliated members after the division.
There are no members who belong to both subsets (= belong to multiple teams at the same time). A member must only belong to a certain subset after the division. So we can also think that GROUP BY and PARTITION BY are functions used to divide team members .
In mathematics, each subset satisfying the above three properties is called "partition", and the operation of dividing the original set into several classes is called "classification". These are terms in fields such as group theory. The segmented category has the same meaning as the "category" in the "category", which is easy to understand.
The name of the PARTITION BY clause in SQL comes from the concept of class (ie partition). Although we can make the GROUP BY clause also use this name, because it will perform aggregation operations after classification, a different name is used to avoid ambiguity. Generally speaking, we can use a variety of ways to classify the collection. The same in SQL, if you change the columns of GROUP BY and PARTITION BY, the generated grouping will change accordingly.
In SQL, GROUP BY is used very frequently, so we can know that there are many types around us. For example, the class in the school and the birthplace of the student. A class without students is meaningless, and people born in two provinces should also not exist (people whose birthplace is unknown may have them, but such people should be classified as NULL).

What can be done with modulo in SQL?

Finding the remaining classes will divide the set of natural numbers into classes of equal size, so it is very convenient when you need to sample a certain proportion from a large amount of data. For example, use the following query statement to randomly reduce the data to one-fifth of the original (when there are no consecutively numbered columns in the table, use the ROW_NUMBER function to renumber it).
The size of the original table is about 9W data, 1/5 of which can be retrieved using the following sql.

select count( 1 )
from( select *, row_number() over(
order by glass_id ) as seq1
from sor.wpp_adefect_f_n ) t
where mod( seq1, 5 )= 0;

The above query statement definitely meets the random sampling requirement of " dividing data equally at random".

Guess you like

Origin blog.csdn.net/MyySophia/article/details/114938964