ClickHouse basic usage summary

View system configuration

view system table

select * from system.clusters;

Verify zookeeper
#Verify whether zookeeper is correctly configured with the current database clickhouse

SELECT * FROM system.zookeeper WHERE path = '/clickhouse';

build table

create local table


MergeTree, the engine itself does not have the function of synchronizing replicas. If ReplicaMergeTree is specified, it will be synchronized to the corresponding replica. Generally, in practical applications, the creation of a distributed table specifies the table of Replica.

The distributed table itself does not store data, and the data storage is actually done by the local table t_cluster. This dist_t_cluster only acts as a proxy.

If other nodes can synchronize to the table structure after the table is created on any node, it means that the cluster takes effect.

Using ReplacingMergeTree

CREATE TABLE default.test ON CLUSTER clickhouse_cluster
(
    name String DEFAULT 'lemonNan' COMMENT '姓名',
    age int DEFAULT 18 COMMENT '年龄',
    gongzhonghao String DEFAULT 'lemonCode' COMMENT '公众号',
    my_time DateTime64(3, 'UTC') COMMENT '时间'
) ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(my_time)
ORDER BY my_time

Use MergeTree 

CREATE TABLE t_cluster ON CLUSTER clickhouse_cluster (
id Int16,
name String,
birth Date
)ENGINE = MergeTree()
PARTITION BY toYYYYMM(birth)

ORDER BY id

When does clickhouse use the ReplacingMergeTree engine? When to use the MergeTree engine? 
Both the ReplacingMergeTree engine and the MergeTree engine are commonly used engine types in ClickHouse, but their applicable scenarios are slightly different.

The ReplacingMergeTree engine is suitable for scenarios where historical data needs to be preserved when data is updated . When the engine is inserting data, it will detect whether the new data conflicts with the existing data. If there is a conflict, it will replace the old data with the new data and record the deletion time of the old data . In this way, historical data can be preserved when querying, and historical data can be undone and restored.

The MergeTree engine is suitable for most general OLAP scenarios. The engine can quickly sort and compress data, and supports partitioning by one or more columns, so as to achieve fast query and aggregation. The MergeTree engine is more suitable for processing large amounts of data and highly concurrent query requests.

Therefore, if you need to keep historical data when the data is updated, you can use the ReplacingMergeTree engine; if you only need to perform regular OLAP queries, you can use the MergeTree engine .
 

What basis does ClickHouse use to judge whether new data conflicts with existing data?

When ClickHouse uses the ReplacingMergeTree engine, it will judge whether the new data conflicts with the existing data according to the specified primary key in the table . Specifically, ClickHouse will match according to the primary key value. If the primary key value in the new data is the same as the primary key value in the existing data, the two data will be considered to be in conflict.

If no primary key is specified, ClickHouse will use all columns as the merge key to compare new data with existing data. If the values ​​of all columns are the same, two pieces of data are considered to be in conflict.

When judging data conflicts, ClickHouse includes a strict comparison of data types and case, so you need to pay attention to the precision and format of the data. At the same time, ClickHouse also supports custom functions to make judgments under specific conditions to meet more flexible business needs.

For example, the following is to specify the name field as the primary key

CREATE TABLE default.test88 ON CLUSTER clickhouse_cluster
(
    name String   NOT NULL   COMMENT '姓名',
    age int DEFAULT 18 COMMENT '年龄',
    gongzhonghao String DEFAULT 'lemonCode' COMMENT '公众号',
    my_time DateTime64(3, 'UTC') COMMENT '时间',
    PRIMARY KEY (name)
) ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(my_time)
ORDER BY name

Create a distributed table

CREATE TABLE default.dist_t_cluster
 ON CLUSTER clickhouse_cluster as t_cluster 
engine = Distributed(clickhouse_cluster, default, t_cluster, rand());

Insert test data Insert
a few more entries, view the distributed table on any node, and you can see the data.

insert into dist_t_cluster values(1, 'aaa', '2021-02-01'), (2, 'bbb', '2021-02-02');

The creation template of the distributed table engine:

Distributed(clusterName, databaseName, tableName[, sharding_key])
1.
Note that the cluster identifier (clusterName) is not the identifier in the copy table macro, but the one specified in <remote_servers>.
2. The name of the database where the local table is located (databaseName)
3. The name of the local table (tableName)
4. (Optional) sharding key (sharding key)

This key, together with the fragmentation weight (weight) configured in config.xml, determines the routing when writing to the distributed table, that is, which physical table the data will eventually fall on. It can be the original data of a column in the table (such as site_id), or it can be the result of a function call, such as the above SQL statement using a random value rand(). Note that the key should try to ensure that the data is evenly distributed. Another common operation is to use the hash value of a highly differentiated column, such as intHash64(user_id).

Distributed DDL

DDL operations such as creating tables and deleting tables in ClickHouse are troublesome. You need to log in to each node in the cluster to execute DDL statements. How to simplify this operation?

ClickHouse (ie CH) supports cluster mode. The syntax of ON CLUSTER <cluster_name> can be added to the DDL statement, so that the DDL statement can be executed on all instances in the cluster once executed, which is simple and convenient.

A cluster has 1 or more nodes.
DDL statements such as CREATE, ALTER, DROP, RENAME, and TRUNCATE all support distributed execution
[that is, if a DDL statement is executed on any node in the cluster, each node in the cluster will execute the same statement in the same order.
This saves you the trouble of going to a single node to execute DDL in turn]
 

Source: Clickhouse cluster-based implementation of distributed DDL usage examples and pits_clickhouse insert on cluster_java programming art blog-CSDN blog

partition partition

The data in the table can be partitioned and stored according to the specified fields, and each partition exists in the form of a directory in the file system. The time field is commonly used as a partition field. A table with a large amount of data can be partitioned by hour, and a table with a small amount of data can be partitioned by day or month. When querying, use the partition field as the Where condition, which can effectively filter out a large number of non-result sets data.

Partition by a field

create table partition_table_test(
id UInt32,
name String,
city String
) engine = MergeTree()
 order by id
partition by city;

Partition by time

CREATE TABLE default.test ON CLUSTER clickhouse_cluster
(
    name String DEFAULT 'lemonNan' COMMENT '姓名',
    age int DEFAULT 18 COMMENT '年龄',
    gongzhonghao String DEFAULT 'lemonCode' COMMENT '公众号',
    my_time DateTime64(3, 'UTC') COMMENT '时间'
) ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(my_time)
ORDER BY my_time

#Query partition related information 

select database, table, partition, partition_id, name, path from system.parts where database = 'data_sync' and table = 'test';

#delete partition

alter table data_sync.test drop partition '202203'

 Source: Clickhouse data table, basic operation of data partition partition_clickhouse drop partition_Bulut0907's Blog-CSDN Blog

Synchronization between ClickHouse and Kafak

Synchronization flowchart

data sheet


# create data table

CREATE DATABASE IF NOT EXISTS data_sync;
CREATE TABLE IF NOT EXISTS data_sync.test
(
    name String DEFAULT 'lemonNan' COMMENT '姓名',
    age int DEFAULT 18 COMMENT '年龄',
    gongzhonghao String DEFAULT 'lemonCode' COMMENT '公众号',
    my_time DateTime64(3, 'UTC') COMMENT '时间'
) ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(my_time)
ORDER BY my_time

engine table


# Create kafka engine table, address: 172.16.16.4, topic: lemonCode

CREATE TABLE IF NOT EXISTS data_sync.test_queue(
    name String,
    age int,
    gongzhonghao String, 
    my_time DateTime64(3, 'UTC')
) ENGINE = Kafka
SETTINGS
  kafka_broker_list = '172.16.16.4:9092',
  kafka_topic_list = 'lemonCode',
  kafka_group_name = 'lemonNan',
  kafka_format = 'JSONEachRow',
  kafka_row_delimiter = '\n',
  kafka_schema = '',
  kafka_num_consumers = 1

Note: There are two important parameters in setting

kafka_thread_per_consumerThe parameters of subscribing to Kafka data kafka_num_consumersare related to the number of consumer threads. The specific functions are as follows:

  • kafka_thread_per_consumerParameter: This parameter indicates the number of threads for each Kafka consumer. The default is 1, which means each consumer uses one thread. If you need to increase the consumption speed, you can increase the value of this parameter to increase the number of consumer threads.

  • kafka_num_consumersParameter: This parameter indicates the number of Kafka consumers created. The default value is 1, which means only one consumer is created. If you need to increase the consumption speed, you can increase the value of this parameter to increase the number of consumers.

It should be noted that increasing the number of consumer threads and the number of consumers may increase the consumption speed, but it will also increase the load on the ClickHouse server. Therefore, it is necessary to make a trade-off according to the actual situation when setting these parameters. At the same time, appropriate adjustments need to be made according to factors such as the number of Kafka partitions and the size of the data.


 materialized view


# create materialized view

CREATE MATERIALIZED VIEW IF NOT EXISTS data_sync.test_mv TO data_sync.test AS 
SELECT name, age, gongzhonghao, my_time FROM data_sync.test_queue;

kafka subscription ideas

1. Kafka data can be sent to distributed tables to take advantage of cluster storage.

Create a local table:

CREATE TABLE data_sync.test ON CLUSTER clickhouse_cluster
(
    name String   NOT NULL   COMMENT '姓名',
    age int DEFAULT 18 COMMENT '年龄',
    gongzhonghao String DEFAULT 'lemonCode' COMMENT '公众号',
    my_time DateTime64(3, 'UTC') COMMENT '时间',
) ENGINE = ReplacingMergeTree()
PARTITION BY toYYYYMM(my_time)
ORDER BY name

Create a distributed table:

CREATE TABLE data_sync.test_dist
 ON CLUSTER clickhouse_cluster as test 
engine = Distributed(clickhouse_cluster, data_sync, test, rand());

Create the engine table:

CREATE TABLE IF NOT EXISTS data_sync.test_queue(
    name String,
    age int,
    gongzhonghao String, 
    my_time DateTime64(3, 'UTC')
) ENGINE = Kafka
SETTINGS
  kafka_broker_list = '172.16.16.4:9092',
  kafka_topic_list = 'lemonCode',
  kafka_group_name = 'lemonNan',
  kafka_format = 'JSONEachRow',
  kafka_row_delimiter = '\n',
  kafka_schema = '',
  kafka_num_consumers = 1

Materialized view:

Note that the target table is assigned to the distributed table.

CREATE MATERIALIZED VIEW IF NOT EXISTS data_sync.test_mv TO data_sync.test_dist AS 
SELECT name, age, gongzhonghao, my_time FROM data_sync.test_queue;

 2. Engine tables and materialized views can be created on multiple nodes, and multiple nodes can access kafka data at the same time.

data simulation


The following is the data trend of the start simulation flow chart. Those who have installed Kafka can skip the installation steps.

Install kafka
kafka here is a stand-alone installation for demonstration

# start zookeeper

docker run -d --name zookeeper -p 2181:2181  wurstmeister/zookeeper


# Start kafka, the ip address after KAFKA_ADVERTISED_LISTENERS is the machine ip

docker run -d --name kafka -p 9092:9092 -e KAFKA_BROKER_ID=0 -e KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181 --link zookeeper -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://172.16.16.4:9092 -e KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092 -t wurstmeister/kafka


Use the kafka command to send data
# Start the producer and send a message to topic lemonCode

kafka-console-producer.sh --bootstrap-server 172.16.16.4:9092 --topic lemonCode


# Send the following message

{"name":"lemonNan","age":20,"gongzhonghao":"lemonCode","my_time":"2022-03-06 18:00:00.001"}
{"name":"lemonNan","age":20,"gongzhonghao":"lemonCode","my_time":"2022-03-06 18:00:00.001"}
{"name":"lemonNan","age":20,"gongzhonghao":"lemonCode","my_time":"2022-03-06 18:00:00.002"}
{"name":"lemonNan","age":20,"gongzhonghao":"lemonCode","my_time":"2022-03-06 23;59:59.002"}


View Clickhouse's datasheet

select * from data_sync.test;

source:

https://www.cnblogs.com/wuhaonan/p/15978470.html

other commands

If you need to modify the table structure or adjust other target tables, you can stop the data subscription first, and then start the subscription after the changes are completed.

Note that if multiple nodes have an engine table, each node must execute it.

#stop subscribing

DETACH TABLE data_sync.test_queue;

# start subscription

ATTACH TABLE data_sync.test_queue;

 Delete view:

drop view data_sync.test_mv;

Guess you like

Origin blog.csdn.net/csdncjh/article/details/131007974