What is a good way to remove duplicates?

Andy :

I have a varchar column. It contains values separated by semicolon (;).

For example, it looks like

10;20;21;17;20;21;22;

It's not always 7 elements. It could contain anything from around 30 to 70. The reason they designed it this way is because the values are actually genome segments and it makes sense to enter or retrieve it collectively

I need to remove records with duplicate columns, so if I see another record with the same value as above, I need to remove it.

I also need to remove the record if it contains same values in another record. For example, I need to remove

10;;21;17;20;21;22;

because it's the same as the first but it doesn't have the second value, 20. If it's more complete than the first, I will remove the first one instead.

1;2;3;4;5;6;7; and 1;2;3;4;5;6;7;8; are dups and I'm taking the 2nd one because it's more complete. 1;2;3;4;5;6;;7 is also a duplicate. In this case, if they have 13 or more matched numbers and no mismatch, we will merge them so it becomes a single value 1;2;3;4;5;6;7;7;.

I can scan each record in java but I'm afraid that it will be complicated and time consuming, given that the table contains millions of records. I was wondering if it's doable in oracle itself.

My final goal is to calculate the frequency that those numbers occur. For instance, if number 10 appears 5 out of 100 times, it will be 5%. The calculation will be simple. However, I can't calculate this unless I make sure there's no duplicates in the table in the first place.

APC :

Note: This answer is a placeholder because the question looks in danger of closure but I think it will be worthy of an answer once all the rules are established.


It's trivial to remove the exact duplicates:

delete from your_table y
where y.rowid not in ( select min(x.rowid)
                       from your_table x
                       group by x.genome_string)

The hard part is establishing duplicating strings which have exact matches and nulls. Merging rows makes the logic even more convoluted.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=160245&siteId=1