clickhouse (eight, unique storage structure and distributed table)

Storage structure

In the following examples, we will introduce the most commonly used *MergeTree(merge tree) subclass engine of clickhouse .

Logical division

Taking a distributed table as an example, the ck data is stored in multiple shard shards under the cluster. If the shard is not on a node, that is, the data will be distributed to multiple machines. The data in each fragment will be divided according to the partition specified when the table is built, and in a single partition, if the data capacity exceeds a certain threshold, it will be split again.

# 表结构:
${ck_data}/metadata/path_to_table/*.sql
# 实际数据存放目录:
${ck_data}/data/path_to_table/${partition_*}/**
# 装卸数据目录
${ck_data}/data/path_to_table/detached

Columnar storage

Clickhouse is a real columnar database management system, and there is basically no additional data except the data itself.
Let's take a look at the unique computing advantages of clickhouse.

  • Multi-server distributed processing
    In ClickHouse, data can be stored on different shards. Each shard is composed of a set of fault-tolerant replicas. Queries can be processed on all shards in parallel by multiple servers.
  • Vector engine
    In order to efficiently use the CPU, ck can be processed as a vector (a part of a column) during calculation. Compared with the actual data processing cost, the vectorization processing has a lower forwarding cost. This can use the CPU more efficiently.

Sparse index

The most powerful table engine in Clickhouse is undoubtedly the MergeTree engine and other engines in the series (*MergeTree). The basic idea of ​​the MergeTree engine series is as follows. When you have a huge amount of data to be inserted into the table, you need to efficiently write data fragments in batches, and hope that these data fragments will be merged in the background according to certain rules. Compared to constantly modifying (rewriting) data into storage during insertion, this strategy is much more efficient. Main advantages:

  • The stored data is sorted by the primary key.
    This allows you to create a small sparse index for fast retrieval of data.
  • Partitions are allowed, if a partition key is specified.
    In the case of the same data set and the same result set, some partitioned operations in ClickHouse will be faster than ordinary operations. When the partition key is specified in the query, ClickHouse will automatically intercept the partition data. This also effectively increases query performance.
  • Support data copy.
    The ReplicatedMergeTree series of tables is used for this.

Look at the official website use case. We use (CounterID, Date) as the primary key. The sorted index icon will look like this:

全部数据  :      [-------------------------------------------------------------------------]
CounterID:      [aaaaaaaaaaaaaaaaaabbbbcdeeeeeeeeeeeeefgggggggghhhhhhhhhiiiiiiiiikllllllll]
Date:           [1111111222222233331233211111222222333211111112122222223111112223311122333]
标记:            |      |      |      |      |      |      |      |      |      |      |
                a,1    a,2    a,3    b,3    e,2    e,3    g,1    h,2    i,1    i,3    l,3
标记号:          0      1      2      3      4      5      6      7      8      9      10

If you specify the query as follows:, the
CounterID in ('a', 'h')server will read the data with the tag number in the interval [0, 3) and [6, 8).
CounterID IN ('a', 'h') AND Date = 3, The server will read the data with the tag number in the interval [1, 3) and [7, 8).
Date = 3, The server will read the data with the tag number in the interval [1, 10]. The above example shows that using an index is usually more efficient than a full table description.
Sparse indexes will cause additional data reads. When reading data of a single primary key interval range of each data block will be read up multiple index_granularity * 2rows additional data. In most cases, when index_granularity = 8192time, ClickHouse and performance does not degrade.
Sparse indexes allow you to manipulate tables with huge numbers of rows. Because these indexes are permanent memory (RAM). ClickHouse does not require unique primary keys. Therefore, you can insert multiple rows with the same primary key. Look at the actual syntax below:

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
    ...
)
ENGINE MergeTree() 
PARTITION BY toYYYYMM(EventDate) 
ORDER BY (CounterID, EventDate, intHash32(UserID)) 
SETTINGS index_granularity=8192
  • ENGINE-engine name and parameters.
    ENGINE = MergeTree(). The MergeTree engine has no parameters.
  • PARTITION BY — Partition key.
    To partition by month, you can use the expression toYYYYMM(date_column), where date_column is a Date column. Here the partition name format will be "YYYYMM".
  • ORDER BY — The sort key of the table.
    It can be a tuple of a set of columns or any expression. For example: ORDER BY (CounterID, EventDate).
  • PRIMARY KEY-the primary key, if you want to set it to be different from the sort key.
    By default, the primary key is the same as the sort key (specified by the ORDER BY clause). Therefore, in most cases there is no need to specify a PRIMARY KEY clause.

Store source code implementation

Most of the logic related to the storage part is placed under /src/ Storages .

Table engine

The top level abstraction of the table is IStorage, and different implementations of this interface become different table engines. For example, StorageMergeTree, StorageMemory, etc., the instances of these classes are tables.
This interface contains many common methods, such as read, write, alter, drop etc.


  • The readStreams method of the read table can return multiple IBlockInputStream objects to allow parallel processing of data. These multiple data block input streams can read data in parallel from a table. Then you can use different transformations to encapsulate these data streams (such as expression evaluation , Data filtering) can be calculated separately.
	virtual Pipes read(
        const Names & /*column_names*/,
        const SelectQueryInfo & /*query_info*/,
        const Context & /*context*/,
        QueryProcessingStage::Enum /*processed_stage*/,
        size_t /*max_block_size*/,
        unsigned /*num_streams*/)
    {
    
    
        throw Exception("Method read is not supported by storage " + getName(), ErrorCodes::NOT_IMPLEMENTED);
    }

    /** The same as read, but returns BlockInputStreams.
     */
    BlockInputStreams readStreams(
            const Names & /*column_names*/,
            const SelectQueryInfo & /*query_info*/,
            const Context & /*context*/,
            QueryProcessingStage::Enum /*processed_stage*/,
            size_t /*max_block_size*/,
            unsigned /*num_streams*/);

  • write
virtual BlockOutputStreamPtr write(
        const ASTPtr & /*query*/,
        const Context & /*context*/)
    {
    
    
        throw Exception("Method write is not supported by storage " + getName(), ErrorCodes::NOT_IMPLEMENTED);
    }

data flow

  • The data block stream is
    used to process data. We use the data stream of data blocks to read data from somewhere, perform data conversion, or write data to somewhere. IBlockInputStream has a read method to get the next data block. IBlockOutputStream has a write method to send a block of data somewhere.
    For example, when you pull data from AggregatingBlockInputStream, it will read all the data from the data source, aggregate it, and then return a summary data stream for you. Another example: UnionBlockInputStream receives many input data sources and some threads. It starts multiple threads to read data in parallel from multiple data sources.
  • A data block
    is a container that represents a subset of a table in memory. It is also a collection of triples: (IColumn, IDataType, columnname)

Storage HA

High-availability storage is essential for production. Let's take a look at the distributed storage of ck. The difference from hive is still big. First of all, it needs to use distributed tables to realize the query . It is worth noting that ck's distributed table does not directly store data, but is similar to the existence of views. Reading is automatically parallel. When reading, the index (if any) of the remote server table will be used.

Official website configuration

Assume 4 nodes example01-01-1, example01-01-2, example01-02-1, example01-02-2. The cluster name is logs.

<remote_servers>
    <logs>
        <shard>
            <!-- Optional. Shard weight when writing data. Default: 1. -->
            <weight>1</weight>
            <!-- Optional. Whether to write data to just one of the replicas. Default: false (write data to all replicas). -->
            <internal_replication>false</internal_replication>
            <replica>
                <host>example01-01-1</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>example01-01-2</host>
                <port>9000</port>
            </replica>
        </shard>
        <shard>
            <weight>2</weight>
            <internal_replication>false</internal_replication>
            <replica>
                <host>example01-02-1</host>
                <port>9000</port>
            </replica>
            <replica>
                <host>example01-02-2</host>
                <secure>1</secure>
                <port>9440</port>
            </replica>
        </shard>
    </logs>
</remote_servers>

We create the test1 table on all 4 nodes of the logs cluster, partitioned according to totalDate.

CREATE TABLE default.test1 on cluster logs 
(`uid` Int32, `totalDate` String ) 
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/test1', '{replica}') 
PARTITION BY totalDate ORDER BY totalDate SETTINGS index_granularity = 8192;

Then create a distributed table test1_all in the cluster


-- 建分布式表指向test1
CREATE TABLE default.test1_all on cluster logs 
as test1
ENGINE = Distributed(logs, default, test1, rand())

Then you can write some test data to the distributed table, and then go to the specific node to check the original table for verification. Because there are similar examples in series six , I won't repeat them here.

Highly available configuration

ck recommends the use of replication table + internal synchronization implementation. Let's first look at the internal_replicationproperties in the above configuration . When set to false, the data inserted into the distributed table is inserted into the two local tables, because the consistency of the copy will not be checked, and over time, the copy data may be somewhat different.
Copy table , ck data copy is provided at the table level, not at the server level, so there can be both replicated and non-replicated tables in the server. This is another big difference between ck and hive.
Let's look at the four replication modes below

  1. For non-replicated tables, internal_replication=false
    If there is no problem during the insert, the data on the two local tables remain synchronized. We call it "the poor man's replication" because replication is prone to divergence in the case of network problems, and there is no easy way to determine which one is the correct replication.
  2. For non-replicated tables, internal_replication=true
    data is only inserted into a local table, but there is no mechanism to transfer it to another table. Therefore, local tables on different hosts see different data, and unexpected data may appear when querying distributed tables. Obviously, this is an incorrect way to configure the ClickHouse cluster .
  3. Copy table, internal_replication=true
    The data inserted into the distributed table is only inserted into one of the local tables, but it is transferred to the table on the other host through the replication mechanism. Therefore, the data on the two local tables remains synchronized. This is the recommended configuration .
  4. Copy table, internal_replication=false
    data is inserted into two local tables, but the mechanism of copy table at the same time guarantees that duplicate data will be deleted. The data will be copied from the first node inserted to the other nodes. After other nodes get the data, if they find that the data is duplicated, the data will be discarded. In this case, although the replication remains synchronized, no errors occur. However, due to the repeated replication of the stream, the write performance will be significantly reduced . So this configuration should actually be avoided and configuration 3 should be used.

Guess you like

Origin blog.csdn.net/yyoc97/article/details/106128124
Recommended