For example:
name | score |
---|---|
Computer | 1600 |
Phone | 12 |
Phone | 12 |
Steps:
1. Copy table structure
CREATE TABLE <new_table> LIKE <old_table>;
2. Insert deduplicated data
insert overwrite table
<new_table> select distinct * from <old_table> ;
ps: Sometimes executing this statement will report the following error:
FAILED: SemanticException TOK_ALLCOLREF is not supported in current context
Just write all the column names:
insert overwrite table
<new_table> select distinct name, score from <old_table> ;
2. Partial data duplication
For example:
name | score | type |
---|---|---|
Computer | 1600 | 2 |
Phone | 12 | 1 |
Phone | 15 | 1 |
Steps:
1. Copy table structure
CREATE TABLE <new_table> LIKE <old_table>;
2. Insert deduplicated data
insert overwrite table <new_table>(
select t.name, t.score, t.type
from (
select
name, score, ,type, row_number() over(distribute by name sort by score ) as rn
from <old_table>
) t where t.rn=1
);
3. To sum up:
insert overwrite table <new_table> (
select <字段>
from (
select <字段>, row_number() over(distribute by <有重复的字段> sort by <重复字段的排列根据字段>) as rn
from <old_table>
) t where t.rn=1
);
-
Attached: Basic usage of ROW_NUMBER() OVER function
语法:ROW_NUMBER() OVER(PARTITION BY COLUMN ORDER BY COLUMN)
Simply put, row_number() starts from 1 and returns a number for each grouped record. Here, ROW_NUMBER() OVER (ORDER BY
xlh DESC) first descends the xlh column, and then returns a serial number for no xlh record after the descending order . Example: xlh row_num
1700 1 1500 2 1085 3 710
4row_number() OVER (PARTITION BY COL1 ORDER BY COL2)
means grouping according to COL1, and sorting according to COL2 within the group, and the value calculated by this function represents the sequence number after sorting within each group (consecutive and unique within the group)