Alibaba Cloud Big Data Practical Record 7: How to deal with duplicate data in production environment forms

I. Introduction

Today I discovered that there is a form in the data warehouse. The data in the table is duplicated, and the data in each column is exactly the same. This situation is not allowed in a production environment. So we need to find a way to delete it.

2. Delete duplicate data

Note: If you want to modify the data table of the generated environment, you may need to apply for the relevant form permissions in the Security Center.

2.1 Added deduplication logic to scheduling tasks

Deduplication is the ultimate goal. In achieving this goal, you first need to identify the root causes of duplicate data. After analysis and search, it was found that the duplicate data originated from a form. The data in the table was synchronized to one place through asynchronous processing, and then the business personnel reported some information, and then the collected information was returned to the database, and then Obtained by preprocessing.

If pre-positioning is used, it is necessary to ensure that the data is not repeated during asynchronous processing, which is difficult to grasp. Therefore, post-positioning processing is adopted, that is, deduplication logic is added at the source table processing level. Specifically, the data in the source table is added. Add keywords when writing to the table distinctto ensure deduplication of query results. Examples are as follows:

INSERT OVERWRITE TABLE table_name
select distinct xxx from t1 ……;

Or group byremove duplicates by:

INSERT OVERWRITE TABLE table_name
select xxx from t1 …… group by xxx;

In this way, we can effectively avoid duplicate data during data processing, thereby improving data accuracy and reliability and reducing storage costs caused by duplicate data.

After modification, submit and release schedule. If you don't need to use the data in a hurry, just wait for the scheduled task to automatically update the data.
What if you are anxious to look at the data? Then after publishing, manually rerun the schedule ~~~

2.2 One-time deduplication in query window

If you need to use the data urgently, you can also create a query window or node, deduplicate the source table, and then overwrite the results into the source table.

The specific operations are as follows:
1. Create a new query window:

image.png

2. Enter the code to execute.

image.png

Reference pseudo code:

INSERT OVERWRITE TABLE table_name
select distinct xxx from table_name;

There is an essential difference between this pseudocode and the above pseudocode. The same table is operated here , but the above is a different table, and may also contain complex logic such as JOIN, etc.WHERE

2.3 Manual processing of local duplicate data

If there are only a few pieces of duplicate data, you can also perform partial processing manually, that is: first delete, then insert. (This operation can also be implemented in the query window or SQL node)

delete from table_name where id in(1,2);
insert into table_name(id,app_name) values(1,xx),(2,xx);

This solution requires inputting data one by one, which is relatively cumbersome and is only suitable for a small amount of data processing.

2.4 Data backup issues

To actually execute the deletion process, if the source table is not an intermediate table but the original form, you may need to add an additional layer of backup. For example, create a temporary form and copy a copy of the data to the temporary form. The reference pseudo code is as follows:

-- lifecycle 是表的生命周期,单位:天
create table table_name_like like table_name lifecycle 10;

Note: lifecycleIt is the life cycle of the table. Non-partitioned tables are calculated from the last time the table data was modified. If there is no change in the data after a specified number of days, it will be automatically recycled (similar drop tableoperation). If it is a partitioned table, the partition is recycled. Unlike a non-partitioned table, even if the last partition of the partitioned table is recycled, the table will not be deleted.

Of course, you can also check the backup data of related projects. If there is a backup, no additional operations are required. See more: Backup and Recovery.

3. Summary

When dealing with duplicate data in a production environment, you actually want to remove duplicate data from the form, and you can use a variety of deletion methods. For example: overwriting ( INSERT OVERWRITE), this method essentially clears the data first, and then performs the insertion operation. We can deduplicate the newly inserted data. This method can be implemented by scheduling tasks, or by executing code through the query window. If there is a problem with only a few pieces of local data, you can also handle it manually. Just specify to delete the duplicate data, and then insert the deleted data again.

Guess you like

Origin blog.csdn.net/qq_45476428/article/details/132332995