Several methods of removing duplicate data in SQL, I will tell you all at once

198268598ef7e4ed727be680e98ddbe0.png

When using SQL to extract and analyze data, we often encounter scenarios where data is duplicated, and we need to analyze the data after deduplication.

Taking the sales report of an e-commerce company as an example, we use the distinct or group by statement as a common deduplication method. Today, we introduce a new method that uses window functions to deduplicate data.

798ec981fc6d173fe1a8126894419261.jpeg

[Field Explanation]

Visitor id: Customers who enter the store to browse baby

Browsing time: the date when the visitor enters the store browsing page

Browsing time: the length of time visitors enter the store to browse the page

Now you need to know each visitor in the store and the corresponding browsing date (every visitor browsing multiple times on the same day counts as one record)

【Problem solving ideas】

Method 1: distinct

SQL is written as follows:

select distinct 访客id ,浏览时间 
     from 淘宝日销售数据表;

search result:

dab47d2366d2372c7c2c0535bca5a709.png

When using the distinct statement to deduplicate multiple fields, you need to pay special attention to two points:

1) The distinct syntax stipulates that for single-field and multi-field deduplication, it must be placed before the first query field.

2) If multiple fields in the table are deduplicated, the deduplication process is to deduplicate multiple fields as a whole. For example, in the above example, we deduplicate the visitor id and browsing time as a whole, instead of visitor id alone After deduplication, the name will be deduplicated separately, so the same visitor ID will correspond to different browsing times.

Method 2: group by

SQL is written as follows:

select 访客id ,浏览时间
     from 淘宝日销售数据表
group by 访客id ,浏览时间;

search result:

3312a59357df9cad3dece2a20e4e6682.png

group by groups the visitor id and browsing time. After grouping and summarizing, the number of rows in the table is changed. There is only one category in a row. After using group by, the visitor id and browsing time will be kept as a category, and the duplicates will not be displayed.

Method 3: Window Functions

When using the window function to deduplicate, it is slightly more complicated than distinct and group by. The window function does not reduce the number of rows in the original table, but sorts the fields after grouping. Detailed explanation of window functions (please click - easy-to-understand learning: SQL window functions )

The basic syntax of a window function is as follows:

<窗口函数> over (partition by <用于分组的列名>
                order by <用于排序的列名>)

According to the requirements of the topic, each visitor and the corresponding browsing date are obtained. We group the visitor id, browsing time, and sort the browsing time (seconds).

SQL is written as follows:

select 访客id ,浏览时间 ,row_number()over(partition by 访客id ,浏览时间
order by 浏览时长(秒)) as 排名
     from 淘宝日销售数据表;

search result:

81f1c47097b17b28d2fa988c56567402.png

The window function query is grouped by each customer and browsing date. If there are several browsing on the same day, they will be sorted according to the number of likes, and the filter ranking is 1, and each visitor and the corresponding browsing date can be obtained.

SQL is written as follows:

select 访客id ,浏览时间 ,row_number()over(partition by 访客id ,浏览时间
order by 浏览时长(秒)) as 排名
     from 淘宝日销售数据表;

search result:

adb9b845b8bd0f5f50c70048cd894f00.png

Have you got the three operations for removing duplicates? Welcome to add your deduplication method in the comment area~

e3994ae973abac8c1446efcd02d19e2c.jpeg

 ⬇️Click "Read the original text"

 Sign up for free Data analysis training camp

Guess you like

Origin blog.csdn.net/zhongyangzhong/article/details/130143644