The most detailed and complete OLAP clickhouse notes in the whole network|clickhouse document|clickhouse secret document (2)--clickhouse core concept

Free video tutorial  https://www.51doit.com/  or contact the blogger on WeChat 17710299606

1 Architecture design

1.1 Single node architecture

 

After installing CH on a single node, after starting the service of CH, you can use the client interactive window provided by CH to operate, and you can manipulate databases, tables, and data in tables...

At the same time, CH also supports docking with other data sources! CH can cache the data in the entire table in memory, which is convenient for quick query, analysis and calculation results, and it can also store the data in the execution directory of the local disk, and the table storage of different engines The location and structure are different!

1.2 Cluster architecture

The ClickHouse cluster mode relies on ZooKeeper to work. The role of ZK is mainly divided into two categories: distributed DDL execution, and status synchronization between the active and standby nodes of the ReplicatedMergeTree table.

2 Basic concepts

2.1  Columnar storage

 

Compared with row storage, column storage has many excellent characteristics in analysis scenarios.

1) As mentioned earlier, analysis scenarios often need to read a large number of rows but a few columns. In the row storage mode, data is stored continuously in rows, and all column data is stored in a blo CK . The columns that are not involved in the calculation are also read out when IO, and the read operation is severely enlarged. In the column storage mode, you only need to read the columns participating in the calculation, which greatly reduces the IO cost and speeds up the query.

2) The data in the same column belong to the same type, and the compression effect is significant. Column storage often has a compression ratio of up to ten times or even higher, which saves a lot of storage space and reduces storage costs.

3) A higher compression ratio means a smaller data size, and it takes less time to read the corresponding data from the disk.

4) Free choice of compression algorithm. The data in different columns have different data types, and the applicable compression algorithms are also different. You can select the most suitable compression algorithm for different column types.

5) The high compression ratio means that the memory of the same size can store more data, and the system cache effect is better.

Official data show that by using column storage, in some analysis scenarios, an acceleration effect of 100 times or more can be obtained.

2.2  Vectorization (algorithm)

ClickHouse not only stores data in columns, but also performs calculations in columns . Traditional OLTP databases usually use row-by-row calculations. The reason is that point-checking is the main reason for transaction processing, and the amount of SQL calculation is small, so the benefits of implementing these technologies are not obvious. However, in the analysis scenario, the amount of calculation involved in a single SQL may be extremely large, and processing each row as a basic unit will bring serious performance loss:

1) Corresponding functions must be called for each row of data, and function call overhead takes a high proportion;

2) The storage layer stores data in columns and is organized in columns in the memory, but the computing layer is processed in rows, which cannot make full use of the pre-reading ability of the CPU cache, causing serious CPU Cache misses;

3) Process by line, unable to use efficient SIMD instructions;

ClickHouse implements a vectorized execution engine. For columnar data in memory, a SIMD instruction is called once per batch (rather than once per row), which not only reduces the number of function calls and cache misses, but also fully Taking advantage of the parallel capability of SIMD instructions, the calculation time is greatly reduced. The vector execution engine can usually bring several times the performance improvement.

(The full name of SIMD is Single Instruction Multiple Data, single instruction multiple data stream, which can copy multiple operands and pack them in a set of instructions in a large register . In a synchronous manner, the same instruction is executed at the same time. )

2.3  Table

The table that Lenovo learned before, with table name and table structure

2.4 Fragmentation

The ClickHouse cluster is composed of shards, and each shard is composed of replicas (Replica) . This layered concept is very common in some popular distributed systems. For example, in the concept of Elasticsearch, an index consists of shards and replicas, and replicas can be regarded as a special kind of shards. If an index consists of 5 shards and the base of the copy is 1, then the index will have 10 shards in total (each shard corresponds to 1 copy).

If you use the same idea to understand ClickHouse's sharding, then you will probably end up here. Certain designs of ClickHouse are always unique, and clustering and sharding are one of them. There are a few distinctive features here.

1 node of ClickHouse can only have 1 shard , which means that if 1 shard and 1 copy are to be implemented, at least 2 service nodes need to be deployed.

Fragmentation is just a logical concept (similar to the concept of region in Hbase, the range of table data) , and its physical bearing is still borne by the copy.

 

2.5  partition

ClickHouse supports the PARTITION BY clause. When building a table, you can specify data partition operations according to any legal expression, such as toYYYYMM() to partition data by month, toMonday() to partition data by week, and Enum type The column directly uses each value as a partition and so on.

Similar to the partition table in hive

 

​​​​​​​2.6  copy

Data storage copy, high availability in cluster mode!!

​​​​​​​2.7 Engine

It is the type of table. Different tables have different characteristics. Different engines determine the data storage location and data storage characteristics of the table!!

 

 

 

Guess you like

Origin blog.csdn.net/qq_37933018/article/details/108020062