Hive data deduplication method

1. The data is all repeated

For example:

name score
Computer 1600
Phone 12
Phone 12

Steps:

1. Copy table structure
CREATE TABLE <new_table> LIKE <old_table>;

2. Insert deduplicated data
insert overwrite table
<new_table> select distinct * from <old_table> ;

ps: Sometimes executing this statement will report the following error:
FAILED: SemanticException TOK_ALLCOLREF is not supported in current context

Just write all the column names:
insert overwrite table
<new_table> select distinct name, score from <old_table> ;

2. Partial data duplication

For example:

name score type
Computer 1600 2
Phone 12 1
Phone 15 1

Steps:

1. Copy table structure
CREATE TABLE <new_table> LIKE <old_table>;

2. Insert deduplicated data
insert overwrite table <new_table>(
select t.name, t.score, t.type
from (
select
name, score, ,type, row_number() over(distribute by name sort by score ) as rn
from <old_table>
) t where t.rn=1
);

3. To sum up:

insert overwrite table <new_table> (
select <字段>
from (
select <字段>, row_number() over(distribute by <有重复的字段> sort by <重复字段的排列根据字段>) as rn
from <old_table>
) t where t.rn=1
);

  • Attached: Basic usage of ROW_NUMBER() OVER function

    语法:ROW_NUMBER() OVER(PARTITION BY COLUMN ORDER BY COLUMN)

    Simply put, row_number() starts from 1 and returns a number for each grouped record. Here, ROW_NUMBER() OVER (ORDER BY
    xlh DESC) first descends the xlh column, and then returns a serial number for no xlh record after the descending order . Example: xlh row_num
    1700 1 1500 2 1085 3 710
    4

    row_number() OVER (PARTITION BY COL1 ORDER BY COL2)
    means grouping according to COL1, and sorting according to COL2 within the group, and the value calculated by this function represents the sequence number after sorting within each group (consecutive and unique within the group)

Guess you like

Origin blog.csdn.net/selectgoodboy/article/details/88532005