mysql cleans up duplicate data and keeps the latest one

background

Before, the data was submitted through the form.
Later, the function of excel batch import was added, but this interface did not judge the data duplication and overwrite the update, resulting in
a large amount of duplicate data in the online environment
. If you want to ensure that the database does not have duplicate data, you can add a unique index to limit it.

Reference: Mysql duplicate data check and keep one (id auto-increment)

It is only suitable for tables with self-incrementing ids. My table id is uuid, so I need to use other means to implement the min function

-- 连表删除
delete t_user
from 
t_user,
(
	SELECT
	   min(id) id,
	   username
	  FROM
	   t_user
	  GROUP BY
	   username
	  HAVING
	   count(*) > 1
) t2
where t_user.username = t2.username
-- 删除重复的比最小id大的数据，即保留id最小的那条数据
and t_user.id > t2.id

Transformation: the primary key is uuid

The data is uniquely judged, and 4 fields determine a record: type product type, name product name, customer customer, year_month
created_time Created time, according to sorting to distinguish between old and new

simple version

To clear each group of duplicate data at a time, it needs to be executed several times, which is suitable for the case of small amount of data

-- 每次都是取的最早的id
-- 多执行几次
DELETE from test_table_02 where id in (
	SELECT SUBSTRING_INDEX(ids, ',', 1) id from (
		select GROUP_CONCAT(id ORDER BY created_time) ids, type, name, customer, `year_month`, count(1) c 
		from test_table_02 
		GROUP BY type, name, customer, `year_month`
		having c > 1
	) aaaa
)

GROUP_CONCAT(id ORDER BY created_time)Aggregate each group of ids, and sort according to the creation time, the first one is the earliest. In reverse order, the first one is the newest.
SUBSTRING_INDEX(ids, ',', 1)intercept the first id

Extension: 2 ways to intercept the first id

Method 1: Use the SUBSTRING_INDEX() function

set @ids = '12,3,4,5,6';

-- SUBSTRING_INDEX() 函数返回一个字符串中指定分隔符出现在指定次数【之前】的子字符串
-- SUBSTRING_INDEX(str, delim, count)
-- 从头开始，截取到第几个分隔符
select SUBSTRING_INDEX(@ids, ',', 1); -- 12
select SUBSTRING_INDEX(@ids, ',', 2); -- 12,3

Method 2: Combination of INSTR() and LEFT() functions

set @ids = '12,3,4,5,6';

-- INSTR() 函数返回一个子字符串在一个字符串中【第一次】出现的位置的数字索引。 -- INSTR() 函数是不区分大小写的。
-- 与java不同，下标从1开始，即为到这个字符位置的字符串长度
select INSTR(@ids,','); -- 3

-- LEFT() 函数从指定字符串的左侧 返回指定数量的字符组成的字符串。
-- LEFT(string, length)
-- 从头开始，截取指定长度
select left(@ids, INSTR(@ids,',')-1); -- 12

Enhanced version: delete all at once

It should be noted that group_concatthere is a length limit, if it exceeds, it will be truncated, and it may not be deleted completely
- Solve the default length limit of group_concat
- 28_mysql's group_concat function sets the maximum length
```
-- 查看数据库中group_concat_max_len的大小 默认1024
show variables like 'group_concat_max_len';
-- 临时修改当前连接
SET SESSION group_concat_max_len=102400; 
```

select a.id from test_table_02 a
INNER JOIN (
	select ids from
		(
			-- 保留第一个 截取第2个到最后一个id
			select SUBSTR(ids,INSTR(ids,',')+1) ids from (
			-- 保留最后一个 截取第1个到倒数第2个id
			-- select left(ids, (CHAR_LENGTH(ids)-INSTR(reverse(ids),',') ) ) from (
					select GROUP_CONCAT(id ORDER BY created_time) ids, type, name, customer, `year_month`, count(1) c 
					from test_table_02 
					GROUP BY type, name, customer, `year_month`
					having c > 1
				) aaaa
		) bbb
	) b on FIND_IN_SET(a.id, b.ids)

SUBSTR(ids,INSTR(ids,',')+1)Keep the first interception from the second to the last id
left(ids, (CHAR_LENGTH(ids)-INSTR(reverse(ids), ',') ) )Keep the last one to intercept the 1st to the penultimate 2nd id
- reverse(ids)reverse string
- INSTR(reverse(ids), ',')Get the length of the last id, including the length of the comma
- CHAR_LENGTH(ids)Get character length
- CHAR_LENGTH(ids)-INSTR(reverse(ids), ',')Total length of characters - the length of the last id+comma = the length of the 1st to the last 2nd id

expand

When finding the length, pay attention to the difference between length() and char_length()

The difference and usage of length() and char_length() in MySQL
char_length(str) character length
- Whether it is a Chinese character, a number or a letter, it is considered a character
length(str) calculation unit: byte
- Utf8 encoding: three bytes for a Chinese character, one byte for a number or letter.
- gbk encoding: two bytes for a Chinese character, one byte for a number or letter.

There is no such thing as Chinese, the results of LENGTH and CHAR_LENGTH are probably the same

-- 6 计算字节
select LENGTH('你好')
-- 2 计算字符
select CHAR_LENGTH('你好')

Keep the first one: use the SUBSTR() function

set @ids = '12,3,4,5,6';

-- MySQL SUBSTR() 函数从一个字符串中返回一个【从指定位置开始】的【指定长度】的子字符串。
-- 可以传2个参数，也可以传3个，不传第3个参数len，指定长度，就是到结尾
-- SUBSTR(str, pos)
-- SUBSTR(str, pos, len)
-- 从第2个id开始取到结尾 保留第一个
select SUBSTR(@ids, INSTR(@ids,',') + 1) -- 3,4,5,6

Keep the last one: left() function

set @ids = '12,3,4,5,6';

-- 从头取到倒数第二个 保留最后一个
-- REVERSE() 函数返回反转后的字符串
-- CHAR_LENGTH() 返回给定字符串的长度，注意与 LENGTH() 区别，LENGTH()函数返回指定字符串的以字节为单位的长度 
-- UTF-8，一个中文字占用 3 个字节
select left(@ids, (CHAR_LENGTH(@ids)-INSTR(reverse(@ids),','))) -- 12,3,4,5

mysql cleans up duplicate data and keeps the latest one

background

Reference: Mysql duplicate data check and keep one (id auto-increment)

Transformation: the primary key is uuid

simple version

Extension: 2 ways to intercept the first id

Method 1: Use the SUBSTRING_INDEX() function

Method 2: Combination of INSTR() and LEFT() functions

Enhanced version: delete all at once

expand

When finding the length, pay attention to the difference between length() and char_length()

Keep the first one: use the SUBSTR() function

Keep the last one: left() function

Guess you like