MySQL duplicate data quickly process a small coup [turn]

Just recently migrated to a MySQL database from a dream of helping customers. I have to speak briefly handle duplicate data.

The data stored in the database is divided into three types: 1. A is filtered out through the strict sense of the data. Data sources such as program-end filtration, the database side check mark provided on the filtering table field in the data source, set the trigger filter, filtering, etc. stored procedure call; 2. the other is original data without any processing. Such abnormal program end codes resulting in a desired non-normal data, the database side not provided with any filter rules data retention like. This will produce a series of junk data, of course, also contains duplicate data I want to say today.

3. Finally, there are duplicate data SQL statements that may arise in the implementation process. For example with two sheet always produce a series of NULL.

Today I say duplicate data, duplicate data does not include SQL statements generated in the execution, only contains the original duplicate data processing. Next, at a few classic scenes it.

The first record completely repeat, this is actually the most simple to the heavy scene.

No such table's primary key d1

    mysql-(ytt/3305)->show create table d1\G
    *************************** 1. row ***************************
    Table: d1
    Create Table: CREATE TABLE `d1` (
    `r1` int(11) DEFAULT NULL,
    `r2` int(11) DEFAULT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
    1 row in set (0.00 sec)

 

The number of records for a total of four million.

    mysql-(ytt/3305)->select count(*) from d1 limit 2;   
    +----------+
    | count(*) |
    +----------+
    | 4000000 |
    +----------+
    1 row in set (0.18 sec)

 

You can see a full three-quarters of record is a duplicate.

Such as records (1,1) have four.

    mysql-(ytt/3305)-> select * from db1 order by r1,r2 limit 5;
    +------+------+
    | r1 | r2 |
    +------+------+
    | 1 | 1 |
    | 1 | 1 |
    | 1 | 1 |
    | 1 | 1 |
    | 2 | 2 |
    +------+------+
    5 rows in set (1.65 sec)

 

This de-emphasis is very simple, do either at the database level, or to filter out information leading to the database to guide the good news. Done in the database, nothing more than a clone new table, screening over the normal data out, after the re-naming, delete the original table, the step is not very complicated, examples are as follows:

mysql-(ytt/3305)->create table d2 like d1;
Query OK, 0 rows affected (0.01 sec)

 

Which consumes time and re-inserted in the new table to herein

mysql-(ytt/3305)->insert into d2 select distinct r1,r2 from d1;
Query OK, 1000000 rows affected (19.40 sec)
Records: 1000000 Duplicates: 0 Warnings: 0

     

mysql-(ytt/3305)->alter table d1 rename to d1_bak;
Query OK, 0 rows affected (0.00 sec)

     

mysql-(ytt/3305)->alter table d2 rename to d1;
Query OK, 0 rows affected (0.00 sec)

     

mysql-(ytt/3305)->drop table d1_bak;
Query OK, 0 rows affected (0.00 sec)

 

Above spent a total of about 20 seconds to look, look at the weight up at the system level, export the data,

mysql-(ytt/3305)->select * from db1 into outfile '/var/lib/mysql-files/d1.txt';
Query OK, 4000000 rows affected (1.84 sec)

 

System-level deduplication with native OS tools sort and uniq.

root@ytt-pc:/var/lib/mysql-files# time cat d1.txt |sort -g |uniq > d1_uniq.txt
real 0m7.345s
user 0m7.528s
sys 0m0.272s

 

Introduced into the original table,

mysql-(ytt/3305)->truncate table d1;
Query OK, 0 rows affected (0.05 sec)

root@ytt-pc:/var/lib/mysql-files# mv d1_uniq.txt d1.txt

 

The processed data directly into the database

    root @ rendered pc: / home / rendered / scripts # hour mysqlimport -uytt -pytt -P3305 -h 127.0 . 0.1 --use-Thread = 2 -vvv rendered / var / lib / MySQL files / d1.txt
    mysqlimport: [Warning] Using a password on the command line interface can be insecure.
    Connecting to 127.0.0.1
    Selecting database rendered
    Loading data from SERVER file: /var/lib/mysql-files/d1.txt into d1
    ytt.d1: Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0
    Disconnecting from 127.0.0.1

     
    real 0m3.272s
    user 0m0.012s
    sys 0m0.008s

 

Look good record deal,

    mysql-(ytt/3305)->select * from d1 where 1 order by r1,r2 limit 2;
    +------+------+
    | r1 | r2 |
    +------+------+
    | 1 | 1 |
    | 2 | 2 |
    +------+------+
    2 rows in set (0.40 sec)

 

OS level slightly higher efficiency, generally comprises deriving data, the data de-duplication, data import, almost half of the time the database layer.

The second, and in fact, similar to the first, except that the tables have a primary key, but other recording field values ​​is repeated.

For example, in addition to adding the table d4 primary key, and before the other recording exactly. Recorded as follows:

    mysql-(ytt/3305)->select * from d4 order by r1,r2 limit 5;
    +---------+------+------+
    | id | r1 | r2 |
    +---------+------+------+
    | 1 | 1 | 1 |
    | 3000001 | 1 | 1 |
    | 2000001 | 1 | 1 |
    | 1000001 | 1 | 1 |
    | 2 | 2 | 2 |
    +---------+------+------+
    5 rows in set (1.08 sec)

 

But this would have required general and specific business to discuss, for example, I need to leave biggest primary key duplicate records, such as the above that, leaving the maximum of this record with id 3000001. Such a re-sql to get,

    mysql-(ytt/3305)->delete a from d4 a left join (select max(id) id from d4 group by r1, r2) b using(id) where b.id is null;    
    Query OK, 3000000 rows affected (23.29 sec)

 

Removed the 300W line duplicate records, and the remaining quarter of the normal data.

    mysql-(ytt/3305)->select count(*) from d4;
    +----------+
    | count(*) |
    +----------+
    | 1000000 |
    +----------+
    1 row in set (0.06 sec)

 

Look at the effect of retaining the maximum, others deleted.

    mysql-(ytt/3305)->select * from d4 order by r1,r2 limit 5;
    +---------+------+------+
    | id | r1 | r2 |
    +---------+------+------+
    | 3000001 | 1 | 1 |
    | 3000002 | 2 | 2 |
    | 3000003 | 3 | 3 |
    | 3000004 | 4 | 4 |
    | 3000005 | 5 | 5 |
    +---------+------+------+
    5 rows in set (0.25 sec)

 

Third, unlike the first two, this is reflected in the field values ​​in the extra characters, such as spaces, extra newline

Third, unlike the first two, this is reflected in the field values in the extra characters, such as spaces, extra line breaks like. Still look at a few examples: trailing whitespace within 1. Remove field value, such is the most simple. This MySQL have an existing function, a basic SQL can be. 

    Table y11 500W line sample data have
    mysql-(ytt/3305)->select count(*) from y11;
    +----------+
    | count(*) |
    +----------+
    | 5242880 |
    +----------+
    1 row in set (0.30 sec)
    
    Using the trim function.
    mysql-(ytt/3305)->update y11 set r1 = trim(r1), r2 = trim(r2);
    Query OK, 5242880 rows affected (2 min 1.56 sec)
    Rows matched: 5242880 Changed: 5242880 Warnings: 0     

    mysql-(ytt/3305)->select * from y11 limit 5;
    +----+------------------------+------------------------+
    | id | r1 | r2 |
    +----+------------------------+------------------------+
    | 1 | sql server | sql server |
    | 2 | sql server | sql server |
    | 3 | sql server | sql server |
    | 6 | db2 mysql oracle mysql | db2 mysql oracle mysql |
    | 7 | db2 mysql oracle mysql | db2 mysql oracle mysql |
    +----+------------------------+------------------------+
    5 rows in set (0.00 sec)

 

2. Remove the various words in the middle of whitespace (space, line breaks, tabs, etc.); word before and after the intermediate spaces have scenes.

Still table y11, from the results, a variety of line breaks, spaces have let the results can not be displayed properly.

  1. mysql-(ytt/3305)->select * from y11 limit 5;

  2. +----+-----------------------------------------------------+------------------------------------------------------+

  3. | id | r1 | r2 |

  4. +----+-----------------------------------------------------+------------------------------------------------------+

  5. | 1 | sql server | sql server |

  6. | 2 | sql server | sql server |

  7. server | sql server |

  8. | mysql | db2 mysql oracle

  9. | 7 | db2 mysql oracle mysql | db2 mysql oracle mysql

  10. +----+-----------------------------------------------------+------------------------------------------------------+

  11. 5 rows in set (0.00 sec)

 

The method may first thought is to export data to a text file, finished with a variety of tools on linux process over again guided into, for example:

    mysql-(ytt/3305)->select * from y11 into outfile '/var/lib/mysql-files/y11.txt' fields terminated by ',' enclosed by '"';
    Query OK, 5242880 rows affected (3.54 sec)

     
    mysql-(ytt/3305)->truncate y11;
    Query OK, 0 rows affected (0.23 sec)

 

Sed by the following process, replace all whitespace characters.

    root@ytt-pc:/var/lib/mysql-files# time sed -i 's/\s\+/ /g' y11.txt     

    real 0m27.476s
    user 0m20.105s
    sys 0m7.233s

 

Into table y11

    mysql-(ytt/3305)->load data infile '/var/lib/mysql-files/y11.txt' into table y11 fields terminated by ',' enclosed by '"';
    Query OK, 5242880 rows affected (30.25 sec)
    Records: 5242880 Deleted: 0 Skipped: 0 Warnings: 0

 

Although the above purpose was reached, but the process is too cumbersome, if it can not solve the MySQL layer reconsider.

MySQL can use regular replacement function directly replace the extra characters to a single space, but also a simple SQL.

    mysql-(ytt/3305)->update y11 set r1 = regexp_replace(r1,'[[:space:]]+',' '), r2 = regexp_replace(r2,'[[:space:]]+',' ');
    Query OK, 4194304 rows affected (1 min 32.05 sec)
    Rows matched: 5242880 Changed: 4194304 Warnings: 0

 

Just in time a little longer, but also the impact is not great.

    mysql-(ytt/3305)->select * from y11 limit 5;
    +----+------------------------+-------------------------+
    | id | r1 | r2 |
    +----+------------------------+-------------------------+
    | 1 | sql server | sql server |
    | 2 | sql server | sql server |
    | 3 | sql server | sql server |
    | 6 | db2 mysql oracle mysql | db2 mysql oracle mysql |
    | 7 | db2 mysql oracle mysql | db2 mysql oracle mysql |
    +----+------------------------+-------------------------+
    5 rows in set (0.00 sec)

 

I think inevitably there will be daily data processing data deduplication scenarios, hope this section to help everyone.

Transfer from

https://mp.weixin.qq.com/s/RsFC5IwzdLH38t1znAyoXQ

Guess you like

Origin www.cnblogs.com/paul8339/p/12295597.html