Introduction to ClickHouse, an open source database for column storage

Table of contents

1. What is ClickHouse?

2. Why is ClickHouse fast?

1. IO level

2. CPU instruction set level

3. Stand-alone parallel reading level

4. Distributed level 

3. Introduction to ClickHouse's table engine (ClickHouse's storage engine)

4. ClickHouse's table engine-MergeTree

5. Cluster of ClickHouse

5.1.1 Using the Distribute table engine to write multiple ways to achieve replication (1)

5.1.2 Using the Distribute table engine to write multiple ways to achieve replication (2) configuration example

5.1.3 Using the Distribute table engine to write multiple ways to achieve replication (3) - the process of reading data on the distribution table (Distribute)​​

5.1.4 Using the Distribute table engine to write multiple ways to achieve replication (4) - the process of writing data on the distribution table (Distribute)   

5.2.1 Replication is achieved by using the replication table with ZooKeeper (1)

5.2.2 Replication by means of replication table and ZooKeeper (2) – configure cluster topology​

5.2.3 Replication by using the replication table with ZooKeeper (3) - Establishing the ReplicateMergeTree and the Distribute table of the Local table​

5.2.4 Replication using the replication table with ZooKeeper (4) – the process of reading data on the distribution table (Distribute) 

5.2.5 Replication using the replication table with ZooKeeper (5) – the process of writing data on the distribution table (Distribute) 


1. What is ClickHouse?

ClickHouse is an open source DBMS system for column storage, which can generate data analysis reports in real time using SQL statements. Mainly for AP applications. Clickhouse is the work of Russian Yandex company. It was open sourced in May 2016 and written in C++ language. Yandes, known as the "Baidu" of Russia, is a "big" company.

Applicable scene:

  • The vast majority of requests are read access.
  • Data is written in sizable batches (greater than 1000 rows) rather than updated as a single row; or there is no update operation at all.
  • Data is added to the database but not modified
  • Read many rows and few columns
  • wide table with many columns
  • Relatively few queries (typically hundreds of queries per second or less per server)
  • High throughput for a single query (up to billions of rows per second per server)
  • For simple queries, about 50 ms of latency is allowed
  • Column values ​​are fairly small: numbers and short strings (e.g. 60 bytes per URL)
  • no transaction required
  • The requirements for data consistency are low.
  • Each query has one big table , the others are small tables
  • The query results are significantly smaller than the source data; in other words, the data is filtered or aggregated so the results fit in a single server's RAM.

shortcoming:

  • There is no full transaction support.
  • Secondary indexes are not supported
  • Lack of high frequency, low latency ability to modify or delete existing data. It can only be used to delete or modify data in batches.
  • Sparse indexes make ClickHouse unsuitable for point queries that retrieve a single row by its key.

2. Why is ClickHouse fast?

1. IO level

Compared with row-based storage, column-based storage only involves reading the columns involved in querying. Obviously, this time greatly reduces the number of IO times/overhead during querying. Of course, it will be more troublesome when writing data records after columnar storage is adopted.

2. CPU instruction set level

Modern CPUs have extended instruction sets that support SIMD. SIMD: Single instruction multiple data flow, which means that multiple data can be processed simultaneously in the same instruction cycle. (For example: the comparison of multiple data units can be completed within one instruction cycle). From the early MMX to the current SSE. Because ClickHouse adopts columnar storage, the CPU cache can be used very efficiently, that is, the data of the same column can be efficiently read into the ScanLine of the CPU cache, and then loaded into the SIMD register (if This cannot be done with row storage).

3. Stand-alone parallel reading level

In a traditional database (eg: MySQL), due to the limitations of various business scenarios, a query request is processed by a thread on the server side. But in ClickHouse, when a query request arrives in the ClickHouse Server instance, the table can be read in parallel, that is, when a query is sent to the server, it is processed in parallel by multiple threads.

We will talk about the table engine of ClickHouse later, where the table engine of the MergeTree series divides the data table into multiple separate parts, and the data in each part can be organized independently (each part has its own independent index). This provides support for parallel table reading with one query request. Because of this, ClickHouse consumes a lot of CPU resources and memory resources to process a query, so it is not suitable for high-concurrency scenarios .

4. Distributed level 

There is not much to explain at this level. Basically, the current mainstream distributed database systems all have the same characteristics. It is nothing more than sharding and multi-copy sharding for performance level expansion. Multi-copy can be used for load balancing and high performance. available.

3. Introduction to ClickHouse's table engine (ClickHouse's storage engine)

  • Table Engines for Integrations
  1.  Link to other databases or storage systems
  2.  ODBC,JDBC,MySQL,MongoDB,HDFS,S3,Kafka,EmbeddedRocksDB,RabbitMQ,PostgreSQL,SQLite,Hive
  • MergeTree---The data is organized according to the tree level. At the same time, as the data is written, the background of the server will perform Merge according to certain rules.
  1.  High load, fast data write
  2.  Support for replicas, shards, indexes, etc.
  • Log---suitable for one-time write and multiple read scenarios, it can be used as some intermediate tables for query analysis.
  1.  lightweight engine
  2.  Suitable for fast writing to many small tables and then reading them at once (1 million pieces of data in small tables)
  • Special Engines
  1.  Distributed: does not store data by itself, multi-node distributed processing, automatic parallel reading, and uses remote table indexes. It needs to be combined with other engines for data sharding and sharing, which can be regarded as a certain view
  2.  Dictionary Table Engine: Dictionary data as tables
  3.  Memory Table Engine: Data is stored in RAM, concurrent processing is synchronized, reading and writing do not block each other, indexing is not supported, and parallel reading is not supported. The data completely exists in the memory (the data will no longer exist after the server is shut down), which is suitable for statistical analysis scenarios of small data tables with extremely high performance requirements.

It is said that in actual applications, more than 90% of the table engines use the MergeTree series for storage. We will focus on this type of engine later.

4. ClickHouse's table engine-MergeTree

1. The MergeTree engine provides indexing based on date and primary key, and also provides the function of updating data in real time (for example, when writing data, you can query the written data without blocking.), mergetree is the most advanced table engine in clickhouse, not to be confused with the merge engine. The engine accepts parameters in the following form: a column of type date, an optional sampling expression, a tuple defining the primary key of the table, and the interval size of the index.

MergerTree example without sampling expressions:      

MergeTree(EventDate,(CounterID,EventDate),8192)

MergeTree containing sample expressions:      

MergeTree(EventDate,intHash32(UserId),(CounterID,EventDate,intHash32(UserID)),8192)

The data table with MergeTree as the engine must contain an independent Date field. For example, the EventDate field. This date field must be of Date type (not DateTime type).

A primary key can be a tuple of arbitrary expressions (usually a tuple of column names), or a single field. The sampling expression (optional) can be any expression. Once this expression is set, then this expression must be in the primary key. The example above uses the hash of UserID intHash32 as the sampling expression, designed to shuffle data entries within CounterID and EventDate at near random. In other words, when we use the SAMPLE clause in the query, we can get a nearly randomly distributed list of users.

The so-called sampling expression. It needs to cooperate with the query statement, and the purpose is to sample in the full data set according to the specified rules to obtain the sampled sample data set. We ignore this when we introduce the storage structure of MergeTree later.

2. Let's give a specific example of creating a MergeTree table

CREATE TABLE `ontime` (  
`Year` UInt16,  
`Quarter` UInt8,  
`Month` UInt8,  
`DayofMonth` UInt8,  
`DayOfWeek` UInt8,  
`FlightDate` Date,  
`UniqueCarrier` FixedString(7),
...
)ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)

2.1 Next, let's look at how the ontime table of the MergeTree type organizes data on disk.

 Under the data directory of the ontime table created above, it is divided into multiple parts. Each part corresponds to a subdirectory. The naming rules of the directory are as follows: (Start-Date)_(End-Date)_XX_XX_0

As mentioned earlier, the MergeTree table must have a date field, and the data table is divided into multiple different Parts according to this date field. Each Part internally stores data from Start Date to End Date.

The maximum granularity of MergeTree split data is month, that is to say, the data of different months cannot be in the same Part. The data in each Part is organized independently and has its own independent index. This provides support for a single query to read table data in parallel. The last string of numbers in the naming rules is used by ClickHouse for multi-version control.

2.2 In each part subdirectory of the ontime table, store the following files, columns.txt records column information; each column has a bin file and mrk file, where the bin file is the actual data, primary.idx stores the primary key information, The structure is the same as mrk, similar to a sparse index. Each column data file is sorted and stored according to the lexicographical order of the primary key.

 

2.3 The following describes the process of the following sparse index method from primary.idx to column .mrk and then to column .bin.

The figure above shows the specific structure of the mrk file and the primary file. It can be seen that the data is sorted according to the primary key, and many blocks are separated every certain size. A piece of data will also be extracted from each block as an index and placed in the primary.idx and mrk files of each column. When using mergetree to query, the most critical step is to locate the block. Here, there are different methods depending on whether the query column is in the primary key. The performance of the query based on the primary key will be better, but when the non-primary key is queried, due to the relationship of column storage, although a full scan will be performed, the performance is not that bad. So indexes are not as critical in clickhouse as they are in mysql. In actual use, it is generally necessary to add query conditions by date to ensure the performance of non-primary key queries. After finding the corresponding block, it is to search for data in the block, obtain the required rows, and then assemble the required columns of data. 

2.4 Background Merge process:

When inserting data (usually in batches), the inserted data will be created in a new Part. At the same time, the process of merging will be carried out periodically in the background. When merging, many parts will be selected, usually the smallest parts, and then merged into a large sorted part. In other words, the entire merge sort process is performed when data is inserted into the table . This merge will cause the table to always be composed of a small number of sorted parts, and the merge itself does not do much work.

During the process of inserting data, data belonging to different months will be divided into different parts, and these parts belonging to different months will never be merged together.

These parts will have a size threshold when merging, so there will be no too long merging process.


2.5 There are also SummingMergeTree and AggregatingMergeTree in the MergeTree series of table engines, which add some statistical functions on the basis of MergeTree, so I won’t go into details.
It is also worth mentioning that, in the MergeTree series, there are table engines ReplicatedMergeTree, ReplicatedSummingMergeTree, and ReplicatedAggregatingMergeTree with built-in replication functions. These table engines add multi-copy replication functions on the basis of the original ones (but the replication functions need to be combined with ZooKeeper can only be achieved together). This will be discussed later when we introduce the cluster function.
Clickhouse supports two replication methods, one is the multi-write replication mode provided by the Distribute table engine , and the other is the replication mode provided by this replication table ( ReplicateXXXXMergeTree) . ( This point must be paid attention to, which is very important when the cluster is built ).

5. Cluster of ClickHouse

The clustering method of modern databases is nothing more than: fragmentation + multiple copies.
Here we take a look at how to play the ClickHouse cluster? Here we list two ways to build a cluster, and the modes of the clusters are the same. Divided into 3 shards (Shard), each shard has 3 copies.
The above sample data table uses the flight information data (ontime table) provided on the ClickHouse official website.
Every instance in the ClickHouse cluster knows the complete cluster topology (via the configuration file). Therefore, the client can connect to any ClickHouse instance for query and data batch writing.
The two methods distinguished here are mainly due to the different ways of copying between copies. One is to use the Distribure table engine to implement multiple writes (data consistency cannot be guaranteed); the other is to use the ReplicateXXXXMergeTree replication table engine to cooperate with ZooKeeper to achieve replication.
Note: Although two replication modes are used, to implement fragmentation, the Distribute engine needs to be defined on top of XXXXMergeTree or ReplicateXXXMergeTree.
The ClickHouse cluster selects APs that meet the CAP principle. That is to ensure availability and sacrifice consistency.

5.1.1 Using the Distribute table engine to write multiple ways to achieve replication (1)

This mode of replication is jokingly called "the poor man's replication mode." That is to say, this mode can be used when you cannot buy more servers and deploy ZooKeeper clusters when your project funds are tight.

5.1.2 Using the Distribute table engine to write multiple ways to achieve replication (2) configuration example

 Create Local table ( MergeTree ) and Distribute table

 Configure cluster topology information

5.1.3  Using the Distribute table engine to write multiple ways to achieve replication (3) - the process of reading data on the distribution table (Distribute) 

5.1.4 Using the Distribute table engine to write multiple ways to achieve replication (4) - the process of writing data on the distribution table (Distribute)   

As mentioned earlier, each ClickHouse instance of the ClickHouse cluster knows the complete cluster topology ( each ClickHouse instance has a Distribute engine instance ) , so the client can access any ClickHouse instance to read distribution table data.

When writing data on the all table , the CH data flow is as follows:

1. The Distribute engine of the main server selects the appropriate shard to write data.

( The main Server is the ClickHouse instance that received the query command from the client )

2. The Distribute engine of the master server sends the written data to each replica instance of the shard

( This is the Distribute multi- write replication method, but the Distribute engine will not ensure that every copy is written successfully, which may lead to data inconsistency )

5.2.1 Replication is achieved by using the replication table with ZooKeeper (1)

This is a cluster construction mode recommended by ClickHouse , but this mode needs to rely on an additional ZooKeeper cluster

5.2.2  Replication by means of replication table and ZooKeeper (2) – Configure cluster topology

 

5.2.3 Replication by using the replication table with ZooKeeper (3) - Establishing the ReplicateMergeTree and the Distribute table of the Local table

5.2.4  Replication using the replication table with ZooKeeper ( 4) – the process of reading data on the distribution table (Distribute) 

The data reading process is the same as the previous copy mode

5.2.5 Replication using the replication table with ZooKeeper ( 5) – the process of writing data on the distribution table (Distribute) 

As mentioned earlier, each ClickHouse instance of the ClickHouse cluster knows the complete cluster topology ( each ClickHouse instance has a Distribute engine instance ) , so the client can access any ClickHouse instance to read distribution table data.

When writing data on the all table , the CH data flow is as follows:

1. The Distribute engine of the main server selects the appropriate shard to write data.

( The main Server is the ClickHouse instance that received the query command from the client )

2. The Distribute engine of the master server will select a "healthy" replica instance in the shard , and then send the write data to it

( The so-called healthy broadly refers to the inventory, and the current load is small )
3.
Then the copy cooperates with ZooKeeper to perform background asynchronous data replication

( Data consistency can be guaranteed to a certain extent through ZooKeeper )

Guess you like

Origin blog.csdn.net/changlina_1989/article/details/124456088