[Clickhouse] Clickhouse TTL data survival time

Insert picture description here

1 Overview

Reprinted: Clickhouse TTL

2. Environment

Operating environment:

centos 7.6
Clickhouse> select version();
 
SELECT version()
 
┌─version()─┐
│ 20.4.4.18 │
└───────────┘

TTL stands for Time To Live 数据的存活时间. In MergeTree中,可以为某个列字段或者整张表设置TTL. 当时间达到时,若列字段级别的TTL 则会删除这一列的数据;若表级别的TTL则会删除整张表的数据;若同时设置了列级别的和表级别的TTL则以先到期的为准.

Regardless of column-level or table-level TTL, it needs to rely on a Datetime or date type field to express the expiration time of TTL through the INTERVAL operation on this time field:

Example:

TTL time_column + interval 3 DAY

Indicates that the data survival time is time_column3 days after time.

Operations supported by INTERVAL:second,minute,hour,day,week,month,quarter,year。

3. Applicable scenarios

Data warehouse construction needs to consider the life cycle of data. The life cycle of data includes the initial writing, storage, processing, query, archiving, and destruction of data.
In fact, the data volume of the data warehouse has doubled, which not only produces huge storage capacity, but also causes management difficulties. Changing storage methods and storage migration requires consideration of
costs and risks for the project . A design like clickhouse can effectively deal with the problem of effective data storage cycle and destruction. The emergence of ck adds another choice to the business selection of data storage warehouses.

In summary:

  1. Periodically delete expired data
  2. Periodically move expired data for archiving

Suitable:

1.2.3.分区表
4.物化视图的列

4. TTL at column level

To set the column-level TTL, you need to declare the TTL expression for them when you define the table fields. The primary key field cannot be declared TTL.

Clickhouse> create table t_column_ttl(id UInt64 comment 'Primary key'
,create_time Datetime
,product_desc String  TTL create_time + interval 10 second
,product_type UInt8 TTL create_time + interval 10 second) 
engine=MergeTree partition by toYYYYMM(create_time) order by id;
 
 
CREATE TABLE t_column_ttl
(
    `id` UInt64 COMMENT 'Primary key', 
    `create_time` Datetime, 
    `product_desc` String TTL create_time + toIntervalSecond(10), 
    `product_type` UInt8 TTL create_time + toIntervalSecond(10)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(create_time)
ORDER BY id
 
Ok.
 
0 rows in set. Elapsed: 0.059 sec. 
 
Clickhouse> insert into table t_column_ttl values(1,now(),'Huawei',1),(2,now()+interval 1 minute,'Apple',2);
 
INSERT INTO t_column_ttl VALUES
 
Ok.
 
2 rows in set. Elapsed: 0.018 sec. 
 
Clickhouse> select * from t_column_ttl;
 
SELECT *
FROM t_column_ttl
 
┌─id─┬─────────create_time─┬─product_desc─┬─product_type─┐
│  12020-06-15 12:10:20 │ Huawei       │            1 │
│  22020-06-15 12:11:20 │ Apple        │            2 │
└────┴─────────────────────┴──────────────┴──────────────┘
 
2 rows in set. Elapsed: 0.003 sec. 
 
Clickhouse> select sleep(10);
 
SELECT sleep(10)
 
↙ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) Received exception from server (version 20.4.4):
Code: 160. DB::Exception: Received from localhost:9000. DB::Exception: The maximum sleep time is 3 seconds. Requested: 10. 
 
0 rows in set. Elapsed: 0.111 sec. 
 
Clickhouse> select sleep(3);
 
SELECT sleep(3)
 
┌─sleep(3)─┐
│        0 │
└──────────┘
 
1 rows in set. Elapsed: 3.003 sec. 
 
Clickhouse> select * from t_column_ttl;
 
SELECT *
FROM t_column_ttl
 
┌─id─┬─────────create_time─┬─product_desc─┬─product_type─┐
│  12020-06-15 12:10:20 │ Huawei       │            1 │
│  22020-06-15 12:11:20 │ Apple        │            2 │
└────┴─────────────────────┴──────────────┴──────────────┘
 
2 rows in set. Elapsed: 0.002 sec. 
 
Clickhouse> optimize table t_column_ttl final;
 
OPTIMIZE TABLE t_column_ttl FINAL
 
Ok.
 
0 rows in set. Elapsed: 0.004 sec. 
 
Clickhouse> select * from t_column_ttl;
 
SELECT *
FROM t_column_ttl
 
┌─id─┬─────────create_time─┬─product_desc─┬─product_type─┐
│  12020-06-15 12:10:20 │              │            0 │
│  22020-06-15 12:11:20 │              │            0 │
└────┴─────────────────────┴──────────────┴──────────────┘
 
2 rows in set. Elapsed: 0.003 sec. 

Execute optimize命令会强制触发TTL清理, if you query again, you can see that after the TTL condition is met, the field column that defines the TTL operation will be restored to the default value of the data type.

Modify the TTL of a column field or modify the TTL of an existing field::

Clickhouse> alter table t_column_ttl MODIFY COLUMN product_desc String  TTL create_time + INTERVAL  2 DAY;

Add the TTL of the field:

Clickhouse> alter table t_column_ttl add column product_name String comment '产品名称' ttl create_time + interval 3 month;
 

– View TTL information:

Clickhouse> desc t_column_ttl;
 
DESCRIBE TABLE t_column_ttl
 
┌─name─────────┬─type─────┬─default_type─┬─default_expression─┬─comment─────┬─codec_expression─┬─ttl_expression─────────────────────┐
│ id           │ UInt64   │              │                    │ Primary key │                  │                                    │
│ create_time  │ DateTime │              │                    │             │                  │                                    │
│ product_desc │ String   │              │                    │             │                  │ create_time + toIntervalDay(2)     │
│ product_type │ UInt8    │              │                    │             │                  │ create_time + toIntervalSecond(10) │
│ product_name │ String   │              │                    │ 产品名称    │                  │ create_time + toIntervalMonth(3)   │
└──────────────┴──────────┴──────────────┴────────────────────┴─────────────┴──────────────────┴────────────────────────────────────┘
 
5 rows in set. Elapsed: 0.003 sec. 

6. Table level TTL

You can add a TTL expression to the table parameters of MergeTree to set TTL for the entire table.

– Before setting, you need to find the configured disk and volume:

Clickhouse> select * from system.disks
:-] ;
 
SELECT *
FROM system.disks
 
┌─name─────────┬─path──────────────────────┬──free_space─┬─total_space─┬─keep_free_space─┐
│ default/var/lib/clickhouse/7788174336160956405761024 │
│ disk_archive │ /data/clickhouse_archive/213757952107248353280 │
│ disk_cold    │ /data/clickhouse_cold/213757952107248353280 │
│ disk_hot1    │ /opt/clickhouse_hot1/16288342016268304384000 │
│ disk_hot2    │ /opt/clickhouse_hot2/16288342016268304384000 │
└──────────────┴───────────────────────────┴─────────────┴─────────────┴─────────────────┘
 
5 rows in set. Elapsed: 0.004 sec. 
 
Clickhouse> select * from system.storage_policies;
 
SELECT *
FROM system.storage_policies
 
┌─policy_name──────┬─volume_name─┬─volume_priority─┬─disks─────────────────────┬─max_data_part_size─┬─move_factor─┐
│ JBOD_default     │ disk_group  │               1['disk_hot1','disk_hot2']00.1 │
│ defaultdefault1['default']00 │
│ default_hot_cold │ hot         │               1['disk_hot1','disk_hot2']10485760.2 │
│ default_hot_cold │ cold        │               2['disk_cold']107374182400.2 │
│ default_hot_cold │ archive     │               3['disk_archive']00.2 │
└──────────────────┴─────────────┴─────────────────┴───────────────────────────┴────────────────────┴─────────────┘
 
5 rows in set. Elapsed: 0.006 sec. 
 

The definition of the table:

create table t_table_ttl(id UInt64 comment '主键',create_time Datetime comment '创建时间',product_desc String  comment '产品描述' TTL create_time + interval 10 minute,product_type UInt8 ) 
engine=MergeTree partition by toYYYYMM(create_time) order by create_time
TTL create_time  + INTERVAL 1 MONTH ,
    create_time + INTERVAL 1 WEEK TO VOLUME 'default',
    create_time + INTERVAL 2 WEEK TO DISK 'default';
 

You can see that the entire table of t_table_ttl is set to TTL. When TTL cleanup is triggered, which data rows that meet the expiration time will be deleted.

Table-level TTL modification:

Clickhouse> alter table t_table_ttl modify ttl create_time + interval 2 month;
 

View information:

Clickhouse> select database,name,engine,data_paths,metadata_path,metadata_modification_time,partition_key,sorting_key from system.tables where name='t_table_ttl';
 
SELECT 
    database, 
    name, 
    engine, 
    data_paths, 
    metadata_path, 
    metadata_modification_time, 
    partition_key, 
    sorting_key
FROM system.tables
WHERE name = 't_table_ttl'
 
┌─database─┬─name────────┬─engine────┬─data_paths──────────────────────────────────────┬─metadata_path──────────────────────────────────────┬─metadata_modification_time─┬─partition_key─────────┬─sorting_key─┐
│ study    │ t_table_ttl │ MergeTree │ ['/var/lib/clickhouse/data/study/t_table_ttl/']/var/lib/clickhouse/metadata/study/t_table_ttl.sql2020-06-15 13:01:32 │ toYYYYMM(create_time) │ create_time │
└──────────┴─────────────┴───────────┴─────────────────────────────────────────────────┴────────────────────────────────────────────────────┴────────────────────────────┴───────────────────────┴─────────────┘
 
1 rows in set. Elapsed: 0.010 sec. 
 

View the structure of the table:

Clickhouse> desc t_table_ttl;
 
DESCRIBE TABLE t_table_ttl
 
┌─name─────────┬─type─────┬─default_type─┬─default_expression─┬─comment──┬─codec_expression─┬─ttl_expression─────────────────────┐
│ id           │ UInt64   │              │                    │ 主键     │                  │                                    │
│ create_time  │ DateTime │              │                    │ 创建时间 │                  │                                    │
│ product_desc │ String   │              │                    │ 产品描述 │                  │ create_time + toIntervalMinute(10) │
│ product_type │ UInt8    │              │                    │          │                  │                                    │
└──────────────┴──────────┴──────────────┴────────────────────┴──────────┴──────────────────┴────────────────────────────────────┘
 
4 rows in set. Elapsed: 0.002 sec. 

Note:列级别或者表级别的TTL 目前暂不支持取消操作 .

7. How TTL works

If a MergeTree table is set to TTL, the data partition will be used as the unit when writing data, and a ttl.txt file will be generated in each partition directory.

data input:

Clickhouse> insert into t_table_ttl(id,create_time,product_desc,product_type)values(10,now(),'Huawei',1),(20,now()+ interval 10 minute,'Apple',2);
 
 
[root@hadoop101 ~]# ls -l /var/lib/clickhouse/data/study/t_table_ttl/202006_1_1_0/
total 60
-rw-r----- 1 clickhouse clickhouse 465 Jun 15 13:14 checksums.txt
-rw-r----- 1 clickhouse clickhouse 115 Jun 15 13:14 columns.txt
-rw-r----- 1 clickhouse clickhouse   1 Jun 15 13:14 count.txt
-rw-r----- 1 clickhouse clickhouse  34 Jun 15 13:14 create_time.bin
-rw-r----- 1 clickhouse clickhouse  48 Jun 15 13:14 create_time.mrk2
-rw-r----- 1 clickhouse clickhouse  39 Jun 15 13:14 id.bin
-rw-r----- 1 clickhouse clickhouse  48 Jun 15 13:14 id.mrk2
-rw-r----- 1 clickhouse clickhouse   8 Jun 15 13:14 minmax_create_time.idx
-rw-r----- 1 clickhouse clickhouse   4 Jun 15 13:14 partition.dat
-rw-r----- 1 clickhouse clickhouse   8 Jun 15 13:14 primary.idx
-rw-r----- 1 clickhouse clickhouse  39 Jun 15 13:14 product_desc.bin
-rw-r----- 1 clickhouse clickhouse  48 Jun 15 13:14 product_desc.mrk2
-rw-r----- 1 clickhouse clickhouse  28 Jun 15 13:14 product_type.bin
-rw-r----- 1 clickhouse clickhouse  48 Jun 15 13:14 product_type.mrk2
-rw-r----- 1 clickhouse clickhouse 137 Jun 15 13:14 ttl.txt
 
可以看到在分区目录下有ttl.txt 文件,文件的内容为:
# cat ttl.txt 
ttl format version: 1
{
   
   "columns":[{
   
   "name":"product_desc","min":1592198679,"max":1592199279}],"table":{
   
   "min":1597468479,"max":1597469079}}

You can see that MergeTree saves TTL related information through a string of JSON configurations.
columns is used to save column-level TTL information

  1. tables is used to save table-level TTL information
  2. Min and max save the minimum and maximum values ​​of the date field specified by TTL in the current data partition and the timestamp calculated by the INTERVAL expression, respectively.
Clickhouse> select now();
 
SELECT now()
 
┌───────────────now()─┐
│ 2020-06-15 13:28:02 │
└─────────────────────┘
 
1 rows in set. Elapsed: 0.004 sec. 

Listed:

Clickhouse> select toDateTime('1592198679') ttl_min,toDateTime('1592199279') ttl_max,ttl_min - min(create_time) expire_min,ttl_max - max(create_time) expire_max from t_table_ttl;
 
SELECT 
    toDateTime('1592198679') AS ttl_min, 
    toDateTime('1592199279') AS ttl_max, 
    ttl_min - min(create_time) AS expire_min, 
    ttl_max - max(create_time) AS expire_max
FROM t_table_ttl
 
┌─────────────ttl_min─┬─────────────ttl_max─┬─expire_min─┬─expire_max─┐
│ 2020-06-15 13:24:392020-06-15 13:34:39600600 │
└─────────────────────┴─────────────────────┴────────────┴────────────┘
 
1 rows in set. Elapsed: 0.026 sec. 
 

Table value:

Clickhouse> select toDateTime('1597468479') ttl_min,toDateTime('1597469079') ttl_max,ttl_min - min(create_time) expire_min,ttl_max - max(create_time) expire_max from t_table_ttl;
 
SELECT 
    toDateTime('1597468479') AS ttl_min, 
    toDateTime('1597469079') AS ttl_max, 
    ttl_min - min(create_time) AS expire_min, 
    ttl_max - max(create_time) AS expire_max
FROM t_table_ttl
 
┌─────────────ttl_min─┬─────────────ttl_max─┬─expire_min─┬─expire_max─┐
│ 2020-08-15 13:14:392020-08-15 13:24:3952704005270400 │
└─────────────────────┴─────────────────────┴────────────┴────────────┘
 
1 rows in set. Elapsed: 0.006 sec. 
 

It can be seen that the extreme value interval recorded in ttl.txt is exactly equal to the maximum and minimum values ​​of create_time in the current data partition, which is consistent with the expectation of the TTL expression.

The general processing logic can be inferred through the information recording method of TTL:

  1. MergeTree uses the partition directory as a unit, records the expiration time through ttl.txt, and uses this as the criterion for judgment.
  2. Whenever a batch of data is written, a ttl.txt file is generated for this partition based on the calculation result of the interval expression
  3. Only merge partitions in MergeTree will trigger the logic of TTL expired data
  4. When deleting partitions, we chose to use the greedy algorithm, and the algorithm rule is to find the partition with the earliest expiration and the earliest time as much as possible.
  5. If a column in a partition is deleted due to TTL expiration, the new partition directory generated after the merge will not contain the data files (.bin and .mrk) of this column field

注意

  1. The default merge frequency of TTL is controlled by the parameter of MergeTree merge_with_ttl_timeout, and the default period is less than 86400 seconds.
    It maintains a proprietary one exclusively TTL任务队列. Different from the conventional merge task of MergeTree, if this value is set too small, it may cause performance loss.
    This setting means that TTL deletion is performed every 24 hours only on one partition or when a background merge occurs. Therefore, in the worst case, ClickHouse now deletes a partition that matches the TTL delete expression at most every 24 hours.
    This behavior may not be ideal, so if you want the TTL delete expression to perform the delete operation faster, you can modify the merge_with_ttl_timeout setting of the table
alter table t_table_ttl  MODIFY SETTING merge_with_ttl_timeout = 3600;

Set to one hour.

  1. In addition to triggering TTL merge, the optimize command can force the trigger to merge.
触发一个分区合并:
optimize table t;
触发所有分区合并:
optimize table t final;
  1. There is currently no way to delete the declaration of ttl, but it provides a method to globally control the start and stop of TTL merge tasks:
system stop/start TTL MERGE

Related parameters:

The following is a list of parameters and their current default values:

background_move_pool_size:8
background_move_processing_pool_thread_sleep_seconds:10
background_move_processing_pool_thread_sleep_seconds_random_part:1.0
background_move_processing_pool_thread_sleep_seconds_if_nothing_to_do:0.1
background_move_processing_pool_task_sleep_seconds_when_no_work_min:10
background_move_processing_pool_task_sleep_seconds_when_no_work_max:600
background_move_processing_pool_task_sleep_seconds_when_no_work_multiplier:1.1
background_move_processing_pool_task_sleep_seconds_when_no_work_random_part:1.0

Guess you like

Origin blog.csdn.net/qq_21383435/article/details/113531930