Oracle 亿级表高效删除重复数据只保留一条

一、服务器信息

1.内存

[oracle@xmldb ~]$ free -g
             total       used       free     shared    buffers     cached
Mem:           125         92         33          0          0         59
-/+ buffers/cache:         32         93
Swap:           80          0         80

2.CPU

[oracle@xmldb ~]$ cat /proc/cpuinfo| grep "cpu cores"| uniq
cpu cores	: 8
[oracle@yundingora ~]$ cat /proc/cpuinfo| grep "processor"| wc -l
32

3.IO

服务器IO

[oracle@xmldb ~]$ dd if=/home/oracle/linuxx64_12201_database.zip of=/home/oracle/linuxx64_12201_database.zip.dd
6745501+1 records in
6745501+1 records out
3453696911 bytes (3.5 GB) copied, 25.2508 s, 137 MB/s

数据库IO

数据存在另外一个磁阵上,后续再补

二、数据库信息

1.数据库内存信息

SQL> show parameter ga;

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
allow_group_access_to_sga	     boolean	 FALSE
lock_sga			     boolean	 FALSE
pga_aggregate_target		     big integer 0
sga_max_size			     big integer 1088M
sga_target			     big integer 0
unified_audit_sga_queue_size	     integer	 1048576

三、表信息

1.数据表信息

select a.table_name,a.partitioned,a.degree,b.num_cols,a.num_rows,round(a.blocks*8/1024,2) as size_m,a.logging,a.last_analyzed
  from all_tables a,(select table_name, count(*) as num_cols from user_tab_columns group by table_name) b 
where a.table_name='TB_DELETE_TEST' 
  and a.table_name=b.table_name; 

TABLE_NAME     PAR   DEGREE	NUM_COLS  NUM_ROWS    SIZE_M   LOG   LAST_ANAL
-------------- ----- ---------- ---------- ----------- -------- ----- -----
TB_DELETE_TEST NO	 1	11        107946703   11011.56 YES   21-APR-20

Elapsed: 00:00:00.79

--该表无索引。


2.1少量重复记录时

FI_QRY@orcl>select count(*) as distinct_2_cols_cnts from (select /*+parallel(30)*/ distinct acc,med_no,med_op_date from TB_DELETE_TEST);

DISTINCT_2_COLS_CNTS
--------------------
	   107946694

Elapsed: 00:00:29.56

select 107946703-107946694 as repeat_cnts from dual;

REPEAT_CNTS
-----------
   9

Elapsed: 00:00:00.00

2.2大量重复记录时

FI_QRY@orcl>select count(*) as distinct_2_cols_cnts from (select /*+parallel(30)*/ distinct ACC,PAPER_NO from TB_DELETE_TEST);

DISTINCT_2_COLS_CNTS
--------------------
	    94681760

Elapsed: 00:00:30.88

--重复记录条数

FI_QRY@orcl>select 107946703-94681760 as repeat_cnts from dual;

REPEAT_CNTS
-----------
   13264943

Elapsed: 00:00:00.00

四、高效去重

3.1少量重复记录时,通过DML语句完成即可

(注意各处hint)

FI_QRY@orcl>delete /*+RULE parallel(8)*/ from TB_DELETE_TEST a
  where exists  (select /*+parallel(8)*/ 
                    from ( select /*+parallel(30)*/ rowid rid,row_number() over (partition by acc,med_no,med_op_date order by rowid) rn from TB_DELETE_TEST) b
                     where b.rn <> 1 and a.rowid=b.rid);

9 rows deleted.

Elapsed: 00:03:28.89

3.2大量重复记录时,建议通过DDL语句完成

(注意各处hint)

FI_QRY@orcl>create /*+parallel(30)*/ table TB_DELETE_TEST_NEW as select /*+parallel(30)*/ DISTINCT * from TB_DELETE_TEST;

Table created.

Elapsed: 00:01:26.46
FI_QRY@orcl>rename TB_DELETE_TEST to TB_DELETE_TEST_OLD;

Table renamed.

Elapsed: 00:00:00.99
FI_QRY@orcl>rename TB_DELETE_TEST_NEW to TB_DELETE_TEST;

Table renamed.

Elapsed: 00:00:00.02
FI_QRY@orcl>drop table TB_DELETE_TEST_OLD purge;

Table dropped.

Elapsed: 00:00:00.56
FI_QRY@orcl>

如果使用快速delete耗时相当长

FI_QRY@orcl>delete /*+RULE parallel(8)*/ from TB_DELETE_TEST1 a
  where exists  (select /*+parallel(8)*/ 
                    from ( select /*+parallel(30)*/ rowid rid,row_number() over (partition by acc,paper_no order by rowid) rn from TB_DELETE_TEST1) b
                  where b.rn <> 1 and a.rowid=b.rid); 

13264943 rows deleted.

Elapsed: 03:01:20.54

其他:如,通过表设置分区,通过程序按分区依次delete。

综上,如果重复数据量较少可以使用上述delete方式快速删除并保留一条,如果重复数据量较大则建议使用DDL实现

此外,如果表数据量在千万级以下,重复数据来在百万级,通过上述delete仍然可以在数分钟内完成去重,详见:

https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:15258974323143

猜你喜欢

转载自blog.csdn.net/lanxuxml/article/details/105650694