Jingdong is too ruthless: 100W data deduplication, use distinct or group by, talk about the reason?

Background Note:

Mysql tuning is a common tuning job for everyone. Therefore, Mysql tuning is a very, very core interview knowledge point . In the reader community (50+) of the 40-year-old architect Nien , related interview questions are a very, very high-frequency communication topic.

Recently, a small partner interviewed JD.com and said that he encountered an interview question about de-duplication and optimization:

100W record deduplication, should use distinct or group by, talk about the reason?

When MYsql was designed, how to deduplicate data with high performance is also the focus and difficulty of tuning. In the community, there are also approximate variants:

Form 1: Is distinct more efficient or group by more efficient? ?

Form 2: Why does group by cause slow mysql generation?

Form 3: ..., there are many variants in the latter, all of which will be included in "Nin's Java Interview Collection".

Here, Nien will give you the optimization of data deduplication, and do a systematic and systematic sorting out, so that you can fully demonstrate your strong "technical muscles" and make the interviewer love "can't help yourself, drooling " .

Also include this question and reference answers in our " Nin Java Interview Collection " V73, for the reference of later friends, and improve everyone's 3-high architecture, design, and development levels.

For the PDF files of the latest "Nin Architecture Notes", "Nin High Concurrency Trilogy" and "Nin Java Interview Collection", please go to the official account at the end of the article [Technical Freedom Circle] to obtain

Next, let's take a look at the functions, usage and underlying principles of distinct and group by.

First: look at the distinct function and usage

DISTINCT is a keyword used to remove duplicate rows from the results returned by a SELECT statement. When using the SELECT statement to query data, if the result set contains duplicate rows, you can use the SELECT DISTINCT statement to remove these duplicate rows. For example, the following statement returns a result set that contains duplicate rows:

SELECT name FROM users;

You can use the following statement to remove duplicate rows:

SELECT DISTINCT name FROM users;

This results in a result set that does not contain duplicate rows. It should be noted that the DISTINCT keyword will have a certain impact on the performance of the query, because it requires sorting and deduplication of the result set. Therefore, you need to be cautious when using the DISTINCT keyword, and use indexes as much as possible to optimize queries to improve query performance.

distinct Usage

SELECT DISTINCT columns FROM table_name WHERE where_conditions;

For example:

mysql> select distinct age from student;
+------+
| age  |
+------+
|   10 |
|   12 |
|   11 |
| NULL |
+------+
4 rows in set (0.01 sec)

The DISTINCT keyword is used to return uniquely distinct values. It is used before the first field in the query statement and acts on all columns of the main clause .

Note: The DISTINCT clause treats all NULL values ​​as the same value

If a column has NULL values ​​and you use the DISTINCT clause on that column, MySQL will keep one NULL value and remove the other NULL values.

distinct multi-column deduplication

The deduplication of distinct multiple columns is performed according to the specified deduplication column information.

That is, only if all the specified column information is the same , it will be considered as duplicate information.

SELECT DISTINCT column1,column2 FROM table_name WHERE where_conditions;

mysql> select distinct sex,age from student;
+--------+------+
| sex    | age  |
+--------+------+
| male   |   10 |
| female |   12 |
| male   |   11 |
| male   | NULL |
| female |   11 |
+--------+------+
5 rows in set (0.02 sec)

Second: look at the usage scenarios of GROUP BY

The main usage scenario of GROUP BY is in group aggregation .

Specifically, the GROUP BY clause is usually used to group query results by one or more columns, and then aggregate calculations for each group.

For example, given a table that stores each person's name, age, and city, you can use the GROUP BY clause to group people by city and calculate statistics such as the average age or population size for each city.

The GROUP BY clause is often used with aggregate functions such as COUNT, SUM, AVG, MAX, and MIN to calculate aggregate values ​​for each group. For example, you can use the GROUP BY clause and the COUNT function to count the number of people in each city.

However, in addition to group aggregation, GROUP BY can also be used for data deduplication, and in some specific scenarios, the performance exceeds distinct

Note: the GROUP BY clause sorts the result set, so it may cause a temporary file sort to be used. If the query contains an ORDER BY clause, improper use will generate temporary file sorting, which is prone to slow SQL problems.

How to optimize it? Please continue to follow Nien's " Ninan Java Interview Collection ", which will be introduced later in conjunction with other interview questions.

Group by underlying principle

group bystatement groups the result set by one or more columns. It is usually used together with COUNT, SUM, AVG and other functions on grouped columns.

Suppose there is a need: count the number of users in each city.

The corresponding SQL statement is as follows:

select city ,count(*) as num from user group by city;

The results of the execution part are as follows:

First use Explain to check the execution plan

explain select city ,count(*) as num from user group by city;

In the Extra field, we can see the following information:

  • Using using temporary means that an internal temporary table is created during execution.
    Note that the temporary table here may be a temporary table on the memory, or a temporary table on the hard disk . Of course, if the temporary table is relatively small, it is based on memory. What is certain is that the performance of the temporary table based on memory is high, and the time The consumption must be smaller than the actual consumption of the hard disk-based temporary table .
  • The Using filesort is used , which means that the sorting of the index is not used during the execution, but a temporary file is used.
    "Using filesort" is a phrase in MySQL's EXPLAIN output that indicates that a query requires the use of a temporary file to sort the result set.
    This can happen when the query includes an ORDER BY clause or a GROUP BY clause, and the database cannot use the index to satisfy the sort order.
    Sorting large result sets using temporary files can be expensive in terms of disk I/O and memory usage, so it is best to avoid "Using filesort" as much as possible.
    Some strategies to avoid file sorting include optimizing the query with appropriate indexes, limiting the size of the result set, or modifying the query to use a different sorting algorithm.

So group bywhy does the statement use both temporary tables and temporary file sorting?

First look at the entire execution process:

  1. In the execution process, a temporary memory table is first created. There are two fields in the table city, numwhich cityare the primary key.
  2. Scan the user table, take out a row of data in turn, and the value of the field in the data cityis c;
  • If there is no row with primary key c in the temporary table, insert a new record (c, 1);
  • If exists, update the behavior(c, num + 1);
  1. After traversing, sort according to city, and finally return the result set to the client.

The execution diagram of this process is as follows:

Then enter the second stage for memory sorting. Among them, the sorting execution steps of the memory temporary table are order byessentially the same as the process of .

  1. Data is copied from the memory temporary table to sort bufferthe
  2. sort bufferFor sorting, use full field sorting or rowid sorting according to the actual situation,
  3. The sorting result is written back to the memory temporary table,
  4. Return a result set from an in-memory temporary table.

The flow diagram is as follows:

The use of group by in deduplication scenarios

It is divided into two deduplication scenarios for introduction:

  • Single Column Deduplication
  • Multi-column deduplication

The difference between single-column deduplication and multi-column deduplication lies in the basis for deduplication.

Single-column deduplication refers to deduplication for a certain column of data, that is, only one duplicate value in the column is retained. For example, if you have a list of names with duplicate data, you can use a single column deduplication to remove duplicate names and keep only one.

Multi-column deduplication refers to deduplication for multi-column data, that is, only one row is reserved for duplicate rows in multi-column data. For example, if you have a list of names, ages, and cities, you can use multi-column deduplication to remove duplicate rows with the same name, age, and city, and keep only one row.

Single Column Deduplication

Most of the GROUP BY clauses are used for single-column deduplication.

For example, suppose there is a table containing student names and cities, you can use the following SQL statement to deduplicate a single column to get the number of students in different cities:

SELECT city, COUNT(DISTINCT name)  FROM student  GROUP BY city;

In the above statement, the cities are grouped using the GROUP BY clause, and the COUNT and DISTINCT functions are used to count the number of distinct names in each city. This will return a result set where each row contains a city and the number of distinct names in that city, thus deduplicating a single column.

For single-column deduplication, the use of group by is similar to distinct:

Single-column deduplication syntax:

SELECT columns FROM table_name WHERE where_conditions GROUP BY columns;

implement:

mysql> select age from student group by age;
+------+
| age  |
+------+
|   10 |
|   12 |
|   11 |
| NULL |
+------+
4 rows in set (0.02 sec)

Note: Like the DISTINCT clause, group by treats all NULL values ​​as the same value

If a column has NULL values, and a group by clause is used on the column, MySQL will keep one NULL value and remove the other NULL values.

Multi-column deduplication

The GROUP BY clause can also be used to deduplicate multiple columns.

For example, suppose there is a table containing student names, cities and ages, you can use the following SQL statement to deduplicate multiple columns to obtain the number of students in different cities and different ages:

SELECT city, age, COUNT(DISTINCT name) 
FROM student
GROUP BY city, age;

In the above statement, the GROUP BY clause is used to group cities and ages, and the COUNT and DISTINCT functions are used to count the number of distinct names in each city, age combination. This will return a result set in which each row contains a city, an age, and the number of distinct names in the city, age combination, thereby achieving the purpose of deduplication for multiple columns.

Multi-column deduplication syntax:

SELECT column1, column2 FROM table_name WHERE where_conditions GROUP BY column1, column2;

Multi-column deduplication example:

mysql> select sex,age from student group by sex,age;
+--------+------+
| sex    | age  |
+--------+------+
| male   |   10 |
| female |   12 |
| male   |   11 |
| male   | NULL |
| female |   11 |
+--------+------+
5 rows in set (0.03 sec)

The difference between group by multi-column de-duplication and single-column de-duplication

The GROUP BY clause is used to group the result set according to the specified column and perform aggregation on each group. The columns specified in the GROUP BY clause will become the aggregate key used to group the rows in the result set. After the aggregation operation, you can use the GROUP BY clause to deduplicate .

Single-column deduplication refers to using the GROUP BY clause to group the result set by a single column and perform an aggregation operation on each group. For example, the following query will return a list of distinct city names:

SELECT city FROM mytable GROUP BY city;

Multi-column deduplication refers to using the GROUP BY clause to group the result set according to multiple columns , and perform aggregation operations on each group. For example, the following query returns a list of distinct cities and states:

SELECT city, state FROM mytable GROUP BY city, state;

The principle of group by is to group and sort the results first, and then return the first piece of data in each group . Therefore, the difference is: single-column deduplication is only grouped by one column, while multi-column deduplication is grouped by multiple columns .

Distinct and group by deduplication principle analysis

In most examples, DISTINCT can be regarded as a special GROUP BY, and their implementations are based on grouping operations, and can be implemented through loose index scans and compact index scans .

Both loose index scan and compact index scan are index scan methods in MySQL.

Loose Index Scan (Loose Index Scan) means that when MySQL uses an index for query, if the data in the index is not continuous, MySQL will scan the entire index tree until it finds a record that meets the conditions. This scanning method will increase the time complexity of the query, because the entire index tree needs to be scanned.

Tight Index Scan means that when MySQL uses an index for query, if the data in the index is continuous, MySQL will read the index data blocks in order until it finds the records that meet the conditions. This scanning method will reduce the time complexity of the query, because the index data blocks can be read sequentially, avoiding scanning the entire index tree.

Normally, MySQL automatically chooses to use compact index scans if the InnoDB storage engine is used. However, if the MyISAM storage engine is used, or if the query condition contains not equal to (<>) or does not contain (NOT IN) operators, MySQL will use a loose index scan.

In general, a compact index scan is faster than a loose index scan because it avoids scanning the entire index tree. However, if the MyISAM storage engine is used, or if the query condition contains not equal to or does not contain operators, MySQL will use loose index scanning. At this time, you need to pay attention to the efficiency of the query.

Both DISTINCT and GROUP BY can use indexes for scan searches .

For example, the following two SQLs, we analyze these two SQLs, we can see that in the extra, these two SQLs use compact index scan Using index for group-by.

Therefore, in general, for DISTINCT and GROUP BY statements with the same semantics, we can use the same index optimization method to optimize them.

mysql> explain select int1_index from test_distinct_groupby group by int1_index;
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
| id | select_type | table                 | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | test_distinct_groupby | NULL       | range | index_1       | index_1 | 5       | NULL |  955 |   100.00 | Using index for group-by |
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
1 row in set (0.05 sec)

mysql> explain select distinct int1_index from test_distinct_groupby;
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
| id | select_type | table                 | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | test_distinct_groupby | NULL       | range | index_1       | index_1 | 5       | NULL |  955 |   100.00 | Using index for group-by |
+----+-------------+-----------------------+------------+-------+---------------+---------+---------+------+------+----------+--------------------------+
1 row in set (0.05 sec)

But for GROUP BY, before MYSQL8.0, GROUP Y will implicitly sort by field by default .

As you can see, the following sql statement also performs filesort while using a temporary table.

mysql> explain select int6_bigger_random from test_distinct_groupby GROUP BY int6_bigger_random;
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+-------+----------+---------------------------------+
| id | select_type | table                 | partitions | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra                           |
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+-------+----------+---------------------------------+
|  1 | SIMPLE      | test_distinct_groupby | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 97402 |   100.00 | Using temporary; Using filesort |
+----+-------------+-----------------------+------------+------+---------------+------+---------+------+-------+----------+---------------------------------+
1 row in set (0.04 sec)

Implicit ordering of Group by

For implicit sorting, we can refer to the official Mysql explanation:

GROUP BY defaults to implicit sorting (meaning that sorting will be performed even if the GROUP BY column does not have an ASC or DESC indicator). However, GROUP BY for explicit or implicit ordering is deprecated. To generate a given sort order, provide an ORDER BY clause.

Reference link:

MySQL :: MySQL 5.7 Reference Manual :: 8.2.1.14 ORDER BY Optimization

Therefore, before Mysql8.0, Group by will sort the results according to the role field (the field after Group by) by default . When the index can be used, Group by does not need additional sorting operations; but when the index sorting cannot be used, the Mysql optimizer has to choose to implement GROUP BY by using a temporary table and then sorting. What's terrible is that when the size of the temporary result set exceeds the size of the temporary table set by the system, Mysql will copy the data of the temporary table to the disk for operation , and the execution efficiency of the statement will become extremely low. This is why Mysql chose to deprecate this operation (implicit sort) .

Based on the above reasons, Mysql has been optimized and updated in 8.0:

In the past (before Mysql5.7 version), Group by will be implicitly sorted according to certain conditions.

In mysql 8.0, this feature has been removed, so it is no longer necessary to disable implicit sorting by adding order by null, however, query results may differ from previous MySQL versions.

If you want to generate results in a given order, specify the fields that need to be sorted by ORDER BY.

MySQL :: MySQL 8.0 Reference Manual :: 8.2.1.16 ORDER BY Optimization

Therefore, our conclusion also came out:

  • In the case of the same semantics and an index: both group by and distinct can use the index, and the efficiency is the same. Because group by and distinct are almost equivalent, distinct can be regarded as a special group by.
  • In the case of the same semantics and no index: distinct is more efficient than group by. The reason is: Both distinct and group by perform grouping operations, but before Mysql8.0, group by performed implicit sorting, which triggered filesort and inefficient SQL execution. Since Mysql8.0, Mysql has deleted the implicit sorting. Therefore, at this time, when the semantics are the same and there is no index, the execution efficiency of group by and distinct is almost equivalent .

In short, starting from Mysql8.0, whether there is an index or an index, the execution efficiency of group by and distinct is almost equivalent

However, in the scenario of deduplication of 100W level data, it is recommended to use group by first.

So, why should group by be preferred?

Compared with distinct, the semantics of group by are clear.

  1. group by semantics are clearer
  2. group by can perform more complex processing on data
  3. Since the distinct keyword will take effect on all fields, the use of group by is more flexible when performing composite business processing.
  4. Group by can perform more complex processing on data according to the grouping situation, such as filtering data by having, or performing operations on data by aggregation functions.

Therefore, whether it is a 100W-level data deduplication scenario or a common data deduplication scenario, it is recommended to use group by first.

say at the end

This article is included in the "Nin Java Interview Collection" V73 PDF, please go to the official account [Technical Freedom Circle] at the end of the article to get it

Basically, if you thoroughly understand Nien's "Ninan Java Interview Collection", it is easy to get offers from big companies.

In the end, the interviewer fell in love so much that he "couldn't help himself, drooling". The offer is coming.

During the learning process, if you have any questions, you can come to the 40-year-old architect Nien to communicate.

The realization path of technical freedom PDF:

Realize your architectural freedom:

" Have a thorough understanding of the 8-figure-1 template, everyone can do the architecture "

" 10Wqps review platform, how to structure it? This is what station B does! ! ! "

" Alibaba Two Sides: How to optimize the performance of tens of millions and billions of data?" Textbook-level answers are coming "

" Peak 21WQps, 100 million DAU, how is the small game "Sheep a Sheep" structured? "

" How to Scheduling 10 Billion-Level Orders, Come to a Big Factory's Superb Solution "

" Two Big Factory 10 Billion-Level Red Envelope Architecture Scheme "

… more architecture articles, being added

Realize your responsive freedom:

" Responsive Bible: 10W Words, Realize Spring Responsive Programming Freedom "

This is the old version of " Flux, Mono, Reactor Combat (the most complete in history) "

Realize your spring cloud freedom:

" Spring cloud Alibaba Study Bible "

" Sharding-JDBC underlying principle and core practice (the most complete in history) "

" Get it done in one article: the chaotic relationship between SpringBoot, SLF4j, Log4j, Logback, and Netty (the most complete in history) "

Realize your linux freedom:

" Linux Commands Encyclopedia: 2W More Words, One Time to Realize Linux Freedom "

Realize your online freedom:

" Detailed explanation of TCP protocol (the most complete in history) "

" Three Network Tables: ARP Table, MAC Table, Routing Table, Realize Your Network Freedom!" ! "

Realize your distributed lock freedom:

" Redis Distributed Lock (Illustration - Second Understanding - The Most Complete in History) "

" Zookeeper Distributed Lock - Diagram - Second Understanding "

Realize your king component freedom:

" King of the Queue: Disruptor Principles, Architecture, and Source Code Penetration "

" The King of Cache: Caffeine Source Code, Architecture, and Principles (the most complete in history, 10W super long text) "

" The King of Cache: The Use of Caffeine (The Most Complete in History) "

" Java Agent probe, bytecode enhanced ByteBuddy (the most complete in history) "

Realize your interview questions freely:

4000 pages of "Nin's Java Interview Collection" 40 topics

The PDF file update of the above Nien architecture notes and interview questions, ▼Please go to the following [Technical Freedom Circle] official account to get it▼

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/131030436