34 ETL System Summary P7 Exclude Duplicate Record System

7. Eliminate duplicate recording systems

Deduplication is one of the most common requirements of ETL, and each system should incorporate repeated data processing into the planning at the beginning of its design.

  • Deal with duplicate data in stock: use rowid in SQL (may need to be generated by window function), group by, distinct to remove duplication
  • To deal with the problem of incremental duplication: maintain the key column to the history table, and then compare the incremental data. Or use Merge to update historical data to prevent duplication.
  • The uniqueness constraint can be used to ensure that the data is not duplicated

Guess you like

Origin blog.csdn.net/hardyer/article/details/108663758