Now that we have ES, why do we still use ClickHouse——A summary of why ClickHouse is so fast from the principle

By understanding several major characteristics of CH, understand the whole process of ClickHouse real-time processing engine architecture design, core technology design, and operating mechanism of the hundreds of billions of enterprises.

1 Initial ClickHouse

1.1 What is ClickHouse

The full name of ClickHouse is Click Stream, Data WareHouse, which is a columnar database management system (DBMS) for online analysis (OLAP) . It was developed by Yandex, a local search engine company in Russia, for its own web traffic analysis product Yandex.Metrica, and later evolved into the current ClickHouse.

1.2 Advantages and disadvantages of ClickHouse

  • Advantages of ClickHouse

ClickHouse has ROLAP, online real-time query, complete DBMS function support, columnar storage, does not require any data preprocessing, supports batch updates, has very complete SQL support and functions, supports high availability, does not depend on Hadoop's complex ecology, and is out of the box Ready to use and many other features.

In the case of 100 million data sets, the average response speed of ClickHouse is 2.63 times that of Vertica, 17 times that of InfiniDB, 27 times that of MonetDB, 126 times that of Hive, 429 times that of MySQL, and 10 times that of Greenplum. Test results: https://clickhouse.tech/benchmark/dbms/.

Although ClickHouse has so many features and advantages, there are obviously some disadvantages:

  1. No full transactional support;
  2. Sparse indexes make ClickHouse not good at fine-grained or key-value type data query requirements;
  3. Lack of high frequency, low latency ability to modify or delete existing data. It can only be used to delete or modify data in batches;
  4. Not good at join operation, and the syntax is special;
  5. Due to the parallel processing mechanism, even one query will use half of the CPU resources, so high concurrency is not supported.

1.3 Who is using ClickHouse

ClickHouse is very suitable for the field of business intelligence (that is, what we call the BI field), in addition, it can also be widely used in advertising traffic, Web, App traffic, telecommunications, finance, e-commerce, information security, online games , Internet of Things and many other fields.

ClickHouse is an open source columnar database that has attracted much attention in recent years, mainly used in the field of data analysis (OLAP). At present, the domestic community is hot, and various major manufacturers have followed up with large-scale use:

  1. Toutiao internally uses ClickHouse to analyze user behavior. There are thousands of ClickHouse nodes internally, with a maximum of 1200 nodes in a single cluster. The total data volume is dozens of PB, and the daily increase of raw data is about 300TB.
  2. Tencent internally uses ClickHouse for game data analysis, and has established a complete set of monitoring operation and maintenance system for it.
  3. Ctrip's internal access trial started in July 2018, and currently 80% of its business is running on ClickHouse. Every day, more than one billion data increments and nearly one million query requests are made.
  4. Kuaishou is also using ClickHouse internally. The total storage is about 10PB, and 200TB is added every day. 90% of the queries are less than 3S.

3 data engine

ClickHouse is an OLAP analytical database, which also has the concept of library and table, and both library and table provide different types of engines, so the underlying engine of ClickHouse can be divided into two types: database engine and table engine .

3.1 Library Engine

ClickHouse supports specifying a library engine when creating a library, currently supports 5 types, namely: Ordinary, Dictionary, Memory, Lazy, MySQL. Among them, Ordinary is the default library engine. Under this type library engine, any type of table engine can be used.

  • Ordinary engine : the default engine, if you do not specify the database engine to create an Ordinary database;

  • Dictionary engine : This database automatically creates tables for all data dictionaries;

  • Memory engine : All data will only be stored in memory, and the data will disappear after service restart. This database engine can only create Memory engine tables;

  • MySQL engine : changing the engine will automatically pull the data in the remote MySQL, and create a data table of the MySQL table engine under the library;

  • Lazy delay engine : save the table in memory during the time expiration_time_in_secondsinterval , only applicable to the log engine table.

3.2 Table Engine

Compared with the library engine, the table engine plays a more central role in ClickHouse. It directly determines how data is stored and read in CH, whether it supports concurrent reading and writing, Idex, and so on.

See the official website for details: https://clickhouse.tech/docs/zh/engines/table-engines/

ClickHouse's table engine provides about 28 table engines in four series (Log, MergeTree, Integration, Special), each with its own purpose. For example, the Log series is used for small table data analysis, the MergeTree series is used for large data volume analysis, and the Integration series is mostly used for external data integration. Table engines of the Log, Special, and Integration series have relatively limited application scenarios, simple functions, and special-purpose applications. The MergeTree series table engines are orthogonal to two special table engines (Replicated, Distributed) to form a variety of MergeTree with different functions. table engine.

3.3 MergeTree engine

The MergeTree series is the official main storage engine of ClickHouse, which supports almost all core functions. In this series, the commonly used table engines are: MergeTree, ReplacingMergeTree, CollapsingMergeTree, VersionedCollapsingMergeTree, SummingMergeTree, AggregatingMergeTree, etc.

About the features of MergeTree

The native MergeTree table engine is mainly used for massive data analysis, and supports data partitioning, storage order, primary key index, sparse index, data TTL, etc. MergeTree supports all ClickHouse SQL syntax, but some functions are not consistent with MySQL, for example, the primary key in MergeTree is not used for deduplication.

In order to solve the problem that the same primary key of MergeTree cannot be deduplicated, ClickHouse provides the ReplacingMergeTree engine for deduplication. ReplacingMergeTree ensures that the data is eventually deduplicated, but it cannot guarantee that the primary key will not be repeated during the query process. Because data with the same primary key may be sharded to different nodes, but compaction can only be performed on one node, and the timing of optimize is also uncertain.

To solve the deletion scenario, the CollapsingMergeTree engine requires specifying a marker column Sign (1 when inserting, and -1 when deleting) in the table creation statement. During background compaction, rows with the same primary key and opposite Sign will be collapsed, and i.e. delete. to remove the limitation of ReplacingMergeTree.

In order to solve the problem that CollapsingMergeTree cannot be folded normally when it is written out of order, the VersionedCollapsingMergeTree table engine adds a column of Version in the table creation statement, which is used to record the corresponding relationship between the status row and the cancel row in the case of disorder. Rows with the same primary key, the same Version, and the opposite Sign will be deleted during Compaction.

To solve aggregation scenarios, ClickHouse supports pre-aggregation of primary key columns through SummingMergeTree. During background Compaction, multiple rows with the same primary key will be summed, and then replaced with one row of data, thereby greatly reducing storage space usage and improving aggregate computing performance. Similarly, AggregatingMergeTree is used to pre-aggregate the average value.

MergeTree table creation syntax

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name ( name1 [type] [DEFAULT|MATERIALIZED|ALIAS expr], name2 [type] [DEFAUErEMAMLERLALLIZED|ALIAS expr], 省略...
) ENGINE = MergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[SETTINGS name=value, 省略...]

A few key options:

key name required illustrate
PARTITION BY partition key no Specifies the standard by which table data is partitioned. The partition key can be a single column field or multiple column fields, and column expressions are also supported. If no partition field is implemented, all data is in one partition
ORDER BY sort key yes It is used to specify the standard by which the data is sorted within a data segment. Usually the same as the primary key
PRIMARY KEY primary key yes After the declaration, a first-level index will be generated according to the primary key field. Usually the same as the primary key
SETTINGS Index granularity no The default value is 8192, that is, the MergeTree index only generates an index every 8192 rows of data by default
SAMPLE BY sample expression no Used to declare the standard by which the data is sampled

Note the important parameters in settings :

  • index_granularityThe default is 8192 = 1024 * 8, it is recommended not to modify
  • index_granularity_bytesThe default is 10M, which needs to be enabled by enable_mixed_granularity_parts
  • enable_mixed_granularity_partsEnable adaptive index granularity by default
  • merge_with_ttl_timeoutProvide data TTL function.

It is worth noting that the MergeTree primary key index is a sparse index (a piece of data creates an index) . The index generated for each piece of data is a dense index .

4 working principle

Starting from the requirements of OLAP scenarios, ClickHouse has customized and developed a new set of high-efficiency columnar storage engine, and realized rich functions such as data ordered storage, primary key index, sparse index, data Sharding, data Partitioning, TTL, master-backup replication, etc. Together, these features provide the foundation for ClickHouse's blazing fast analytical performance.

4.1 Data Partitioning

From the perspective of physical structure, the so-called data partition is to divide all the data in a table into multiple subfolders according to a certain dimension. Unlike other databases that write in append mode, MergeTree generates a batch of new partition directories for each batch of data write (insert) . At a later time (10 to 15 minutes after writing, you can also manually execute the optimize query statement), ClickHouse will merge multiple directories belonging to the same partition into a new directory through background tasks. The old partition directory that already exists will not be deleted immediately, but will be deleted through background tasks at a later time (8 minutes by default).

Naming Rules for Partition Folders

Partition naming rules: PartitionID_MinBlockNum_MaxBlockNum_Level:

1) partition_id: 2021905, the specific partition generation rules are as follows;

1)不指定分区键:如果建表时不指定分区键,则数据默认不分区,所有数据写到一个默认分区 all 里面。

2)使用整型:如果分区键取值属于整型且无法转换为日期类型 YYYVYMMDD 格式,则直接按照该整型的字符形式输出作为分区 ID 的取值。

3)使用日期类型:如果分区键取值属于日期类型,或者是能够转换为 YYYYMMDD 日期格式的整型,则按照分区表达式逻辑格式化后作为分区ID的取值。

4)使用其它类型:如果分区键取值既不属于整型,也不屋于日期类型,如 String、Float 等,则通过 128 位 Hash 算法取其 Hash 值作为分区 ID 的取值。

2) min_block_number: 1, the minimum block number, the MergeTree engine starts counting from 1, +1 each time;

3) max block_number: 1, the maximum block number, the newly inserted data, the minimum and maximum numbers are consistent;

4) level: 0, this can be understood as the number of merges, the newly inserted data is all 0, and each merge is +1.

Merge rules for partitioned folders

insert image description here

The meaning of the files in the partition folder

(Here is the new version of the mergtree engine. There is only one file for bin and mrk3. In the old version, each column in the table will have its own bin and mrk2 file)

  • checksums.txt checksums file. Stored in binary format. It saves the size and hash value of the size of the remaining various files (primary.idx, count.txt, etc.), which are used to quickly verify the integrity and correctness of the file.

  • columns.txt column information file, stored in plain text format. It is used to save the column field information under this data partition.

  • count.txt Count file, stored in plain text format. Used to record the total number of rows of data under the current data partition directory.

  • primary.idx primary index file, primary key index file.

  • The xxx.bin data file is stored in a compressed format. The default is the LZ4 compressed format. It is used to store the data of a certain column. Each column corresponds to a file. For example, the column date is date.bin.

  • The xxx.mrk2 column field marks the file. If the adaptive size index interval is used, the mark file is named .mrk2, otherwise it is named .mrk.

  • There are also secondary indexes and partition key related information files, hop index files, etc.

It will not be expanded here, and will be introduced in detail later.

4.2 Column storage

Because OLAP generally performs aggregate analysis on a large number of rows and a small number of columns, column storage is basically a must-have solution, and has the following advantages over row storage:

  • In analysis scenarios, it is often necessary to read a large number of rows but a few columns . In the row-storage mode, the data is stored continuously by row, and the data of all columns are stored in a block, and the columns that do not participate in the calculation are all read during IO, and the read operation is severely amplified. In the column storage mode, only the columns involved in the calculation need to be read, which greatly reduces the IO cost and speeds up the query.

  • The data in the same column belongs to the same type, the compression ratio is high, and the data compression effect is remarkable . Column storage often has a compression ratio of up to ten times or even higher, which saves a lot of storage space and reduces storage costs.

  • A high compression ratio means a smaller data size, and it takes less time to read the corresponding data from the disk. At the same time, it also means that the memory of the same size can store more data, and the system cache effect is better.

  • Compression algorithms can be freely selected according to different types, and the most suitable compression algorithm can be selected for different column types.

insert image description here

4.3 Primary index

After using PRIMARY KEY to define the primary key of MergeTree, a primary index will be generated for the data table according to the index_granularity interval (8192 rows by default) and saved to the file . The first-level index is a sparse index, and its advantage is that a small amount of index marks can record a large amount of data interval location information. In ClickHouse, the first-level index is resident in memory. In general: the first-level index and the tag file are aligned one by one, and the data between the two index tags is a data interval. In the data file, all the data in this data interval generates a compressed data block.primary.idx

insert image description here

Of course, the sparse index also has a disadvantage that it does not deduplicate . In order to achieve the deduplication effect, it needs to be implemented in combination with specific table engines ReplacingMergeTree, CollapsingMergeTree, and VersionedCollapsingMergeTree.

4.4 Secondary Index

The secondary index of ClickHouse is also called the jump index , and its purpose is the same as that of the primary index , in order to reduce the scope of search. However, the secondary index is closed by default, and the granularity is controlled by the granularityparameter . After opening, the skp_idx_[Column].idxand skp_idx_[Column].mrkfiles will be generated in the partition directory. The generation rule of the secondary index is also very simple: every other piece granularity * index_granularityof data, a secondary index will be generated.

A table supports the declaration of multiple secondary indexes, and supports multiple types: minmax (maximum and minimum), set (deduplication set), ngrambf_v1 (ngram participle Bloom index) and tokenbf_v1 (punctuation mark participle Bloom index).

  • minmax: In index_granularity, it stores the min and max values ​​calculated by the specified expression; it can help quickly skip blocks that do not meet the requirements and reduce IO in equivalence and range queries.

  • set(max_rows): In the unit of index granularity, store the distinct value set of the specified expression, which is used to quickly judge whether the equivalent query hits the block and reduce IO.

  • ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): After performing ngram segmentation on the string, build a bloom filter, which can optimize query conditions such as equivalent, like, and in.

  • tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): Similar to ngrambf_v1, the difference is that it does not use ngram for word segmentation, but uses punctuation for word segmentation.

  • Bloom_filter([false_positive]): Build a bloom filter for the specified column to speed up the execution of query conditions such as equal value, like, and in.

insert image description here

4.5 Data Compression

column.binIt is the data storage file of ClickHouse, which stores data of one column. Since one column is the same type of data, it is convenient and efficient to compress. During compression, a compressed data block consists of header information and compressed data .

  • The header information is represented by a fixed 9-bit byte, which consists of 1 UInt8 (1 byte) integer and 2 UInt32 (4 bytes) integers, respectively representing the type of compression algorithm used, the size of the compressed data, and the compressed previous data size.

  • The size of each compressed data block is strictly controlled within 64KB~1MB according to the data byte size before compression, and the upper and lower limits are specified by the min_compress_block_size (default 65536=64KB) and max_compress_block_size (default 1048576=1M) parameters.

specific compression rules

  1. Single batch data size < 64KB: If a single batch of data is less than 64KB, continue to obtain the next batch of data until the accumulated size >= 64KB, and generate the next compressed data block. If the average record is less than 8byte, multiple data batches are compressed into one data block;

  2. Single batch of data 64KB <= size <=1MB: If the size of a single batch of data is between 64KB and 1MB, the next compressed data block will be generated directly;

  3. Single batch data size > 1MB: If a single batch of data directly exceeds 1MB, it will first be truncated according to the size of 1MB and generate the next compressed data block. The rest of the data continues to be executed according to the above rules. At this point, a batch of data may generate multiple compressed data blocks. If the average size of each record exceeds 128byte, the current batch of data will be compressed into multiple data blocks.

insert image description here

4.6 Data labeling

There is a one-to-one correspondence between data tag files and .binfiles , which is the mapping between primary index and data . That is, each column field [Column].binfile has a corresponding [Column].mrk2data mark file, which is used to record the offset information of the data in the .binfile . A row of marked data is represented by a tuple, which contains offset information of two integer values. They respectively indicate the starting offset of the compressed data block in the corresponding .bin compressed file within the data interval of this segment, and the starting offset of the uncompressed data after the compressed data block is decompressed. Marked data is different from primary index data in that it cannot reside in memory, but uses the LRU strategy to speed up its access.

So the data reading process is:

  1. First, according to the first-level index, find the corresponding data compression block information in the marked file, [Column].binand find the corresponding compressed data block in the file, read and decompress;
  2. From the decompressed data block, load the data into the memory at index_granularitythe granularity of and execute the query until the result data is found.

5 Query process

The essence of data query can be seen as a process of continuously reducing the data range. In the most ideal case, MergeTree can first use the partition index, primary index, and secondary index in turn to minimize the scope of data scanning. Then, with the help of data marking, the range of data that needs to be decompressed and calculated is reduced to a minimum.

insert image description here

  1. designated partition
  2. Specified fields ([Column].bin)
  3. Locate a record in the mark file ([Column].mrk2) according to the primary index (primary.idx)
  4. Scan the mark file of the corresponding field to obtain two offset information
  5. Locate a compressed data block according to the first offset (the offset of the compressed data block where the search data is located in this .binfile )
  6. Read data into memory to perform decompression
  7. Find the corresponding data in the decompressed data in the memory according to the second offset (find the offset of the data after decompression of the compressed data block)

Of course, the above is an ideal situation. If a query does not hit any indexes (partitions, primary and secondary indexes), then MergeTree cannot reduce the data range in advance. However, ClickHouse can still use data tags to read multiple compressed data blocks simultaneously in the form of multi-threading to improve query speed.

ref

Why is ClickHouse so fast?
How Clickhouse works

Guess you like

Origin blog.csdn.net/adminpd/article/details/128010479