11. The billion-level data cleaning practice of ClickHouse series

1. Background requirements

Assuming that we have created 100 million registered addresses at present, we need to analyze the information of provinces, cities and regions as much as possible according to the registered addresses, and remove the information of provinces, cities and regions in the registered addresses. The data storage format is csv, about
20G

Effect after cleaning

2. Solutions

2.1 Create a table, the engine uses ReplaceMergeTree

The reason why ReplaceMergeTree is used is because it can merge old data according to the sort key, which is very suitable. If we use other engines, it will be more heavy to parse out the province and directly update the field. In ClickHouse, you will see a lot of updates to be executed. select * from system.mutations where is_done=0;ClickHouse The content that is heavy on the operation is recorded in system.mutations

CREATE TABLE etl.dwd_company (
    district String 

Guess you like

Origin blog.csdn.net/SJshenjian/article/details/130351849