Oracle billion-level tables efficiently delete duplicate data and only retain one

One, server information

1. Memory

[oracle@xmldb ~]$ free -g
             total       used       free     shared    buffers     cached
Mem:           125         92         33          0          0         59
-/+ buffers/cache:         32         93
Swap:           80          0         80

2.CPU

[oracle@xmldb ~]$ cat /proc/cpuinfo| grep "cpu cores"| uniq
cpu cores	: 8
[oracle@yundingora ~]$ cat /proc/cpuinfo| grep "processor"| wc -l
32

3.IO

Server IO

[oracle@xmldb ~]$ dd if=/home/oracle/linuxx64_12201_database.zip of=/home/oracle/linuxx64_12201_database.zip.dd
6745501+1 records in
6745501+1 records out
3453696911 bytes (3.5 GB) copied, 25.2508 s, 137 MB/s

Database IO

数据存在另外一个磁阵上,后续再补

Two, database information

1. Database memory information

SQL> show parameter ga;

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
allow_group_access_to_sga	     boolean	 FALSE
lock_sga			     boolean	 FALSE
pga_aggregate_target		     big integer 0
sga_max_size			     big integer 1088M
sga_target			     big integer 0
unified_audit_sga_queue_size	     integer	 1048576

Three, table information

1. Data sheet information

select a.table_name,a.partitioned,a.degree,b.num_cols,a.num_rows,round(a.blocks*8/1024,2) as size_m,a.logging,a.last_analyzed
  from all_tables a,(select table_name, count(*) as num_cols from user_tab_columns group by table_name) b 
where a.table_name='TB_DELETE_TEST' 
  and a.table_name=b.table_name; 

TABLE_NAME     PAR   DEGREE	NUM_COLS  NUM_ROWS    SIZE_M   LOG   LAST_ANAL
-------------- ----- ---------- ---------- ----------- -------- ----- -----
TB_DELETE_TEST NO	 1	11        107946703   11011.56 YES   21-APR-20

Elapsed: 00:00:00.79

--该表无索引。


2.1 When a small number of repeated records

FI_QRY@orcl>select count(*) as distinct_2_cols_cnts from (select /*+parallel(30)*/ distinct acc,med_no,med_op_date from TB_DELETE_TEST);

DISTINCT_2_COLS_CNTS
--------------------
	   107946694

Elapsed: 00:00:29.56

select 107946703-107946694 as repeat_cnts from dual;

REPEAT_CNTS
-----------
   9

Elapsed: 00:00:00.00

2.2 When a large number of duplicate records

FI_QRY@orcl>select count(*) as distinct_2_cols_cnts from (select /*+parallel(30)*/ distinct ACC,PAPER_NO from TB_DELETE_TEST);

DISTINCT_2_COLS_CNTS
--------------------
	    94681760

Elapsed: 00:00:30.88

--重复记录条数

FI_QRY@orcl>select 107946703-94681760 as repeat_cnts from dual;

REPEAT_CNTS
-----------
   13264943

Elapsed: 00:00:00.00

Four, efficient deduplication

3.1 When there is a small amount of repeated records, it can be completed through DML statements

(Pay attention to hints everywhere)

FI_QRY@orcl>delete /*+RULE parallel(8)*/ from TB_DELETE_TEST a
  where exists  (select /*+parallel(8)*/ 
                    from ( select /*+parallel(30)*/ rowid rid,row_number() over (partition by acc,med_no,med_op_date order by rowid) rn from TB_DELETE_TEST) b
                     where b.rn <> 1 and a.rowid=b.rid);

9 rows deleted.

Elapsed: 00:03:28.89

3.2 When a large number of repeated records, it is recommended to complete the DDL statement

(Pay attention to hints everywhere)

FI_QRY@orcl>create /*+parallel(30)*/ table TB_DELETE_TEST_NEW as select /*+parallel(30)*/ DISTINCT * from TB_DELETE_TEST;

Table created.

Elapsed: 00:01:26.46
FI_QRY@orcl>rename TB_DELETE_TEST to TB_DELETE_TEST_OLD;

Table renamed.

Elapsed: 00:00:00.99
FI_QRY@orcl>rename TB_DELETE_TEST_NEW to TB_DELETE_TEST;

Table renamed.

Elapsed: 00:00:00.02
FI_QRY@orcl>drop table TB_DELETE_TEST_OLD purge;

Table dropped.

Elapsed: 00:00:00.56
FI_QRY@orcl>

If you use fast delete, it takes a long time

FI_QRY@orcl>delete /*+RULE parallel(8)*/ from TB_DELETE_TEST1 a
  where exists  (select /*+parallel(8)*/ 
                    from ( select /*+parallel(30)*/ rowid rid,row_number() over (partition by acc,paper_no order by rowid) rn from TB_DELETE_TEST1) b
                  where b.rn <> 1 and a.rowid=b.rid); 

13264943 rows deleted.

Elapsed: 03:01:20.54

 

Others: For example, set partitions through the table, and delete them sequentially by partition through the program.

 

In summary, if the amount of duplicate data is small, you can use the above delete method to quickly delete and keep one. If the amount of duplicate data is large, it is recommended to use DDL .

In addition, if the amount of data in the table is below tens of millions and the repeated data is in the millions, the deduplication can still be completed within a few minutes through the above delete . For details, see:

https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:15258974323143

Guess you like

Origin blog.csdn.net/lanxuxml/article/details/105650694