Clickhouse storage engine

1. Classification of commonly used storage engines

1.1 ReplacingMergeTree

        This engine is based on MergeTree and adds the function of "handling duplicate data". The difference between this engine and MergeTree is that it will delete duplicates with the same primary key.
Features:
1. Use the ORDER BY sort key as the unique key for judging duplication.
2. Data deduplication will only be triggered during the merge process.
3. Duplicate data will be deleted in units of data partitions. Duplicate data in different partitions will not be deleted.
4. Found The way to repeat data depends on the data has been sorted by ORDER BY
5. If there is no ver version number, keep the last row of the repeated data
6. If the ver version number is set, keep the data with the largest ver version number in the repeated data

1.2 Example of table creation statement

create table replace_table(
id string,
code String,
create_time DateTime
)ENGINE=RepTacingMergeTree() PARTITION BY toYYYYMM(create_time)ORDER BY (id,code) PRIMARY KEY id;

The order by data is used as the primary key to deduplicate the data, but the data in different partitions will not be deduplicated

 

1.2 SummingMergeTree

        The engine inherits from MergeTree. The difference is that when merging the data fragments of the SummingMergeTree table, ClickHouse will merge all the rows with the same aggregate data condition Key into one row, which contains the summary values ​​of the columns with numeric data types in the merged rows. If the conditional key of the aggregated data is combined in such a way that a single key value corresponds to a large number of rows, the storage space can be significantly reduced and the speed of data query can be accelerated. For columns that cannot be added, the value that appears first will be taken.
Features:
1Use the DRDERBY sort key as the condition key for aggregated data
2Trigger summary logic when merging partitions
3. Aggregate data in units of data partitions, and data from different partitions will not be summarized
4If the Columns summary column is specified when defining the engine (non-primary key) SUM summarizes these fields
5. If not specified, summarizes all non-primary key numeric type fields
6. SUM summarizes the data of the same aggregation Key, relying on ORDER BY sorting
7. During the SUM summary process of the same partition, non- The data of the summary field retains the value of the first row 8. Nested structures are supported, but the column field name must end with a Map suffix.

1.3 AggregateMergeTree

        Description: Logic. clickHouse will replace all rows with the same primary key (within a data slice the engine inherits from MergeTree and changes the merge segment of the data slice) with a single row storing the state of a series of aggregate functions.
You can use the AggregatingMergeTree table for incremental data aggregation. The data aggregation engine including the materialized view needs to use the AggregateFunction type to process all columns.
If you want to merge and reduce the number of rows according to a set of rules, it is appropriate to use AggregaingMergeTree. AggregatingMergeTree cannot be used directly. insert to query and write data. Generally, insert select is used. But more commonly used is to create materialized views.
The data is aggregated in advance to form a data cube, and the data is preprocessed and aggregated in advance.

1.3.1 First create a base table for the MergeTree engine

 

1.3.2 Create a materialized view of AggregatingMergeTree

 1.4 CollapsingMergeTree

add and delete

        Yandex's official introduction is that CollapsingMergeTree will asynchronously remove (fold) these pairs of rows whose values ​​​​are equal except for the values ​​of 1 and -1 in the specific column ign. Rows that are not paired are preserved. This engine can significantly reduce storage capacity and improve SELEC query efficiency.
The CollapsingMergeTree engine has a status column sign. This value is 1 for the "status" line and 1 for the "cancel" line. For the data, only the data whose status column is status is concerned, and the data whose status column is canceled is not concerned.

 1.5 VersionedCollapsingMergeTree

        This engine is similar to collapsingMergeTree, but a version is added to the collapsingMergeTree engine. For example, it can be suitable for non-real-time user online statistics, and counts the online business of each node user.

CREATE TABLE [IF NOT EXISTS] [db,jtable_name [ON CLUSTER cluster]
name1 [type1][DEFAULTIMATERIALIZEDIALIAS expr1].name2 [type2][DEFAULTIMATERIALIZEDALIAS expr2]
ENGINE = VersionedCollapsingMergeTree(sign, version)IPARTITION BY expr)
[ORDER BY expr][SAMPLE BY expr]
[SETTINGS name=value, ...]

2. Clickhouse connects to other storage engines

2.1 connect to mysql

mysql table creation statement

2.2 connect kafka

Kafka SETTINGS
        kafka_broker_list = 'localhost:9092',
        kafka_topic_list ='topic1,topic2',
        kafka_group_name ='group1',
        kafka format = 'JSONEachRow',
        kafka_row_delimiter = '\n'
        kafka_schema = '',
        kafka num_consumers = 2

The kafka engine table will be deleted after writing, and a materialized view needs to be created

3. Data backup

 After the data is written in the partition, the written data is recorded to the zk node and consumed by other copies

 

 zk node information

 

 

 

 

         

 4. Distributed table

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_16803227/article/details/132149865