[Basics] ClickHouse table engine integrated with Kafka


Insert image description here

0.Preface

In order to facilitate integration with Kafka, ClickHouse provides a dedicated table engine called Kafka engine. The Kafka engine allows you to create a table in ClickHouse whose data source comes from one or more Kafka queues. Combining the Kafka engine and Materialized Views enables real-time data processing and querying by consuming data from the Kafka queue and then storing the data into tables in other engines.

1. Integration example

To create a Kafka engine table, you need to provide the following key parameters:

  1. kafka_broker_list: Kafka broker address list, comma-separated string.
  2. kafka_topic: Kafka topic to subscribe to.
  3. kafka_group_name: consumer group name, used to identify the consumer group to which the ClickHouse instance belongs.
  4. kafka_format: Message format, used to specify how to parse messages in Kafka into table rows, such as JSONEachRow, etc.

Example of creating a Kafka engine table:

CREATE TABLE kafka_table
(
    column1 String,
    column2 UInt64,
    column3 Float64
) ENGINE = Kafka
SETTINGS
    kafka_broker_list = 'kafka1:9092,kafka2:9092',
    kafka_topic = 'kafka_topic_name',
    kafka_group_name = 'clickhouse_group',
    kafka_format = 'JSONEachRow';

In order to consume and store data from a Kafka table into a table in another table engine (such as MergeTree), you can create a Materialized View, for example:

CREATE MATERIALIZED VIEW mv_kafka_to_storage
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(column2)
ORDER BY (column1, column2)
AS SELECT
    column1,
    column2,
    column3
FROM kafka_table;

Using the Kafka engine and Materialized View, you can implement real-time data consumption, processing and query in ClickHouse, thereby greatly improving the efficiency of data processing.


Official tutorial

This engine is used in conjunction with Apache Kafka .

Kafka characteristics:

  • Publish or subscribe to data streams.
  • Fault-tolerant storage mechanism.
  • Process streaming data.

Old version format:

    Kafka(kafka_broker_list, kafka_topic_list, kafka_group_name, kafka_format
          [, kafka_row_delimiter, kafka_schema, kafka_num_consumers])

New version format:

    Kafka SETTINGS
      kafka_broker_list = 'localhost:9092',
      kafka_topic_list = 'topic1,topic2',
      kafka_group_name = 'group1',
      kafka_format = 'JSONEachRow',
      kafka_row_delimiter = '\n',
      kafka_schema = '',
      kafka_num_consumers = 2

Required parameters:

  • kafka_broker_list– Comma separated list of brokers ( localhost:9092).
  • kafka_topic_list– topic list( my_topic).
  • kafka_group_name– Kafka consumer group name ( group1). If you don't want messages to be duplicated across the cluster, use the same group name in each shard.
  • kafka_format– Message body format. FORMATUse the same notation as for functions in the SQL section , eg JSONEachRow. For details, please refer to Formatsthe section.

Optional parameters:

  • kafka_row_delimiter- Delimiter between each message body (record).
  • kafka_schema– This parameter is required if a schema is required for parsing the format. For example, Captain Proto requires the schema file path and schema.capnp:Messagethe name of the root object.
  • kafka_num_consumers– Number of consumers for a single table. The default value is: 1, if the throughput of one consumer is insufficient, specify more consumers. The total number of consumers should not exceed the number of partitions in the topic, since only one consumer can be assigned to each partition.

Example 1:

  CREATE TABLE queue (
    timestamp UInt64,
    level String,
    message String
  ) ENGINE = Kafka('localhost:9092', 'topic', 'group1', 'JSONEachRow');

  SELECT * FROM queue LIMIT 5;

  CREATE TABLE queue2 (
    timestamp UInt64,
    level String,
    message String
  ) ENGINE = Kafka SETTINGS kafka_broker_list = 'localhost:9092',
                            kafka_topic_list = 'topic',
                            kafka_group_name = 'group1',
                            kafka_format = 'JSONEachRow',
                            kafka_num_consumers = 4;

  CREATE TABLE queue2 (
    timestamp UInt64,
    level String,
    message String
  ) ENGINE = Kafka('localhost:9092', 'topic', 'group1')
              SETTINGS kafka_format = 'JSONEachRow',
                       kafka_num_consumers = 4;

Consumed messages are automatically tracked, so each message is only recorded once in a different consumer group. If you want to get the data twice, create a copy with another group name.

Consumer groups can be flexibly configured and synchronized between clusters. For example, if there are 10 topics and 5 table replicas in the cluster, each replica will get 2 topics. If the number of replicas changes, topics are automatically redistributed among the replicas. For more information visit http://kafka.apache.org/intro .

SELECTQueries are not very useful for reading messages (except for debugging) because each message can only be read once. It is more practical to create real-time threads using materialized views. You can do this:

  1. Use the engine to create a Kafka consumer and serve as a data stream.
  2. Create a structure table.
  3. Create a materialized view that transforms the data in the engine behind the scenes and puts it into the previously created table.

When MATERIALIZED VIEWadded to an engine, it will collect data in the background. Data can be continuously collected from Kafka and SELECTconverted into the required format by .

Example 2:

  CREATE TABLE queue (
    timestamp UInt64,
    level String,
    message String
  ) ENGINE = Kafka('localhost:9092', 'topic', 'group1', 'JSONEachRow');

  CREATE TABLE daily (
    day Date,
    level String,
    total UInt64
  ) ENGINE = SummingMergeTree(day, (day, level), 8192);

  CREATE MATERIALIZED VIEW consumer TO daily
    AS SELECT toDate(toDateTime(timestamp)) AS day, level, count() as total
    FROM queue GROUP BY day, level;

  SELECT level, sum(total) FROM daily GROUP BY level;

To improve performance, received messages are grouped into max_insert_block_size sized blocks. If a block is not formed within stream_flush_interval_ms milliseconds, the data is flushed to the table regardless of the integrity of the block.

To stop receiving topic data or change transformation logic, detach the materialized view:

  DETACH TABLE consumer;
  ATTACH TABLE consumer;

If you use ALTERalter the target table, it is recommended to stop the materialized view to avoid differences between the data in the target table and the view.

Configuration

Similar to GraphiteMergeTree, the Kafka engine supports extended configuration using ClickHouse configuration files. Two configuration keys are available: global ( kafka) and topic-level ( kafka_*). Global configuration is applied first, then topic-level configuration if present.

  <!-- Global configuration options for all tables of Kafka engine type -->
  <kafka>
    <debug>cgrp</debug>
    <auto_offset_reset>smallest</auto_offset_reset>
  </kafka>

  <!-- Configuration specific for topic "logs" -->
  <kafka_logs>
    <retry_backoff_ms>250</retry_backoff_ms>
    <fetch_min_bytes>100000</fetch_min_bytes>
  </kafka_logs>

For a detailed list of configuration options, see librdkafka Configuration Reference . Use underscore ( ) in ClickHouse configuration _, not dot ( .). For example, check.crcs=trueit would be <check_crcs>true</check_crcs>.

Kerberos support

For Kafka using Kerberos, it is enough to set security_protocol to sasl_plaintext, if the Kerberos ticket is obtained and cached by the operating system.
Clickhouse also supports its own use of keyfiles to maintain Kerbros credentials. Just configure the three sub-elements of sasl_kerberos_service_name, sasl_kerberos_keytab, and sasl_kerberos_principal.

Example:

  <!-- Kerberos-aware Kafka -->
  <kafka>
    <security_protocol>SASL_PLAINTEXT</security_protocol>
    <sasl_kerberos_keytab>/home/kafkauser/kafkauser.keytab</sasl_kerberos_keytab>
    <sasl_kerberos_principal>kafkauser/[email protected]</sasl_kerberos_principal>
  </kafka>

virtual column

  • _topic– Kafka Topics.
  • _key– Key to the message.
  • _offset– The offset of the message.
  • _timestamp– The timestamp of the message.
  • _timestamp_ms– The timestamp of the message in milliseconds.
  • _partition– Partitioning of Kafka topics.

Reference documentation

  • ClickHouse Kafka Engine: https://clickhouse.tech/docs/en/engines/table-engines/integrations/kafka/
  • ClickHouse + Kafka — How to Build Real-Time Data Pipelines: https://medium.com/@coderunner/debugging-with-git-7afbcd3b9f1e

Guess you like

Origin blog.csdn.net/wangshuai6707/article/details/132921975