ClickHouse study notes (1): Overview of ClickHouse architecture (Why is ClickHouse so fast?)

1. Overview of ClickHouse

1.1 Brief introduction

    ClickHouseIt is a database with MPParchitecture . It does not adopt Hadoopthe master-slave architecture in the ecology, but uses the result of a multi-master peer-to-peer network. It is also ROLAPa solution . Official website docs address: https://clickhouse.com/docs/en/intro

1.2. Explanation of terms

1.2.1, MPP architecture

MMP: Massively Parallel Processing (Massively Parallel Processing), is to distribute tasks to multiple servers and nodes in parallel. After the calculation is completed on each node, the results of the respective parts are aggregated to obtain the final result. A database using the MPP architecture is called an MPP database.
figure 1

Features of the MPP architecture :

  1. MPP supports a shared-nothing distributed architecture
  2. In an MPP, each processor handles a different part of the task.
  3. Each processor has its own set of disks
  4. Each node is only responsible for processing rows on its own disk
  5. Easily expandable by simply adding nodes
  6. Data horizontal partitioning, strong compression ability
  7. MPP processors communicate with each other using some form of message-passing interface
  8. In MPP, each processor uses its own operating system (OS) and memory.

    Both MPPDB and Hadoop distribute the calculations to nodes for independent calculations and combine the results (distributed computing), but they have their own advantages, disadvantages and scope of application due to different theories and technical routes adopted. The comparison between the two technologies and traditional database technologies is as follows:
figure 2
In general, the specific and applicable scenarios of Hadoop and MPP technologies are:

● Hadoop has advantages in processing unstructured and semi-structured data, and is especially suitable for application requirements such as massive data batch processing.
● MPP is suitable to replace the big data processing under the existing relational data organization, and has high efficiency.

1.2.2, vectorized execution engine

    向量化执行引擎(Vectorized execution engine), for columnar data in memory, one batch calls SIMD instructions (instead of calling once for each row), which not only reduces the number of function calls and cache misses, but also can give full play to the parallel capabilities of SIMD instructions , greatly shortening the calculation time. The vector execution engine can usually bring several times the performance improvement.
    A simple understanding is to eliminate the optimization of the program loop, speed up the heap machine, and realize vectorized execution, which requires the use of SIMD instructions of the CPU.

1.2.3 、SIMD

SIMD(Single Instruction Multiple Data), using a single instruction stream and multiple data streams, that is to say, one operation instruction can execute multiple data streams. A simple example is the addition and subtraction of vectors, which is not suitable for scenarios with many branch judgments .

1.2.4 、OLAP

OLTP: On-line Transaction Processing translates to online transaction processing. It requires high real-time performance and strong stability to ensure that the data is updated successfully in a timely manner. From a database point of view, OLTP is mainly about adding, deleting, and modifying data.

OLAP: On-line Analytical Processing translates to online analytical processing. It is necessary to centralize and unify the analysis of business data. Generally, the data is stored in a data warehouse to provide OLAP analysis in a unified manner. OLAP is mainly the application of data warehouse, which is the query of data.
image 3

ROLAP: Relational OLAP, relational online analytical processing. As the name implies, it uses relational construction directly, and the data model often uses star schema or snowflake schema ( star schema and snowflake schema ). Represented by ROLAP are traditional relational databases, MPP distributed databases, and Hadoop-based Spark/Impala, which are characterized by being able to connect detailed data and summary data at the same time, calculate the data in real time according to the needs of users, and return it to the user. Because of the real-time computing technology used, the disadvantages of ROLAP are also obvious— when the number of calculations reaches a certain level or the number of concurrency reaches a certain level, performance problems will definitely occur .

    Represented by traditional relational databases such as Teradata, Oracle, etc., due to the poor scalability of the traditional architecture, the hardware requirements are very high. When the amount of calculated data reaches tens of millions, the calculation of the database will appear. Delay prevents users from getting a timely response, let alone increasing concurrency.
    The MPP distributed database (GreenPlum/GBase/Vertica) solves part of the scalability problem, and the requirements for hardware devices are also slightly reduced (there are still certain hardware requirements), and the supported data volume (GB, TB level) has been greatly improved. When the cluster has hundreds or thousands of nodes, there will be a performance bottleneck (no matter how many nodes are added, the performance improvement will not be obvious), and the expansion cost is also expensive.

    Hadoop-based Spark/Impala has very low requirements for deployment hardware (common servers are enough, but it mainly relies on memory computing to shorten the response time, so it has high memory requirements), and the cost of node expansion is relatively low. However, when the calculation amount reaches a certain level or the concurrency reaches a certain level, it cannot respond in seconds, and problems such as memory overflow are prone to occur.

MOLAP: Multidimensional OLAP, multidimensional online analytical processing. Represented by MOLAP analysis are Cognos, SSAS, Kylin, etc. The design concept is to pre-calculate the customer's needs and save them in the form of results (for example, if a table is divided into 10 dimensions and 5 measures, then the customer's demand There will be 2 to the 10th power of possibilities, and then calculate and store these possibilities in advance), when the customer makes a request, find the corresponding result and return it. The feature is that it returns very quickly when the requirement is hit (so MOLAP is very suitable for common fixed analysis scenarios), the data volume supported under the same resources is larger, and the concurrency is supported. The disadvantage is that the more dimensions the table has, the more complex it is . , the larger the disk storage space required, the more time it will take to build the cube .
HOLAP: Hybrid OLAP, OLAP with hybrid architecture. This idea can be understood as the integration of ROLAP and MOLAP.

1.3. Application scenarios

Applicable scenarios :

  1. The vast majority of requests are for read access
  2. Data needs to be updated in large batches (greater than 1000 rows), rather than a single row update; or no update operation at all
  3. Data is just added to the database, no modification necessary
  4. When reading data, a large number of rows are extracted from the database, but only a small number of columns are used
  5. The table is "wide", that is, it contains a large number of columns
  6. Relatively low query frequency (typically hundreds of queries per second or less per server)
  7. For simple queries, about 50 ms of latency is allowed
  8. Column values ​​are relatively small numbers and short strings (for example, only 60 bytes per URL)
  9. Requires high throughput (up to billions of rows per second per server) when processing a single query
  10. no transaction required
  11. Data consistency requirements are low
  12. Only one large table will be queried in each query. Except for one large table, the rest are small tables
  13. The query result is significantly smaller than the data source. That is, the data has filtering or aggregation. The returned result does not exceed the memory size of a single server

Inapplicable scenarios :

  1. Does not support real delete/update support does not support transactions (looking forward to support in subsequent versions)
  2. Secondary indexes are not supported
  3. Limited SQL support, join implementations are different
  4. Does not support window functions
  5. Metadata management requires manual intervention to maintain

2. ClickHouse core features

2.1. Complete DBMS functions

ClickHouse has complete management functions, not just a database. As a DBMS, it has some basic functions.
DDL: Data Definition Language, data definition language, can dynamically create, modify or delete databases, tables and views without restarting the service.
DML: Data Manipulation Language, data manipulation language, can dynamically add, delete, modify and query data.
权限控制: You can set database or table operation permissions according to user granularity to ensure data security.
数据备份与恢复: Provides data backup export and import recovery mechanisms to meet the requirements of the production environment.
分布式管理: Provide cluster mode and self-manage multiple database nodes.

2.2. Column storage and data compression

    If you want to make the query faster, the simplest and most effective way is to reduce the data scanning range and the size of the data transmission. Columnar storage and data compression can help us achieve the above two points.

Columnar storage : Avoid redundant data scans when scanning specified columns.
Data compression : Match and scan data according to a certain step size, and perform code conversion when duplicate parts are found, reducing the pressure on IO and storage.

Example:
Before compression : abcdefghi_bcdefghi
After compression : abcdefghi_(9,8)

    The essence of compression is to match and scan data according to a certain step size, and perform encoding conversion when duplicate parts are found. For example, (9,8) in the above example means that if you move forward 9 bytes from the underscore, it will match a duplicate item with a length of 8 bytes, that is, bcdefghi.
    Of course, the real compression algorithm is more complicated than that in our example. The 默认使用 LZ4 算法压缩overall data compression ratio reaches 8:1 (17PB before compression and 2PB after compression). In addition to reducing the pressure on IO and storage, columnar storage also paves the way for vectorized execution.

2.3. Vectorized execution engine

    The simple understanding of the vectorized execution engine is to eliminate the optimization of the program loop, speed up the heap machine, and data-level parallelism. Achieving vectorized execution takes advantage of the CPU's SIMD instructions. SIMD 的全称是 Single Instruction Multiple Data,即用单挑指令操作多条数据. In the current computer system concept, it is an implementation of improving performance through data parallelism (others include instruction-level parallelism and thread-level parallelism). Its principle is to realize data parallel operations at the CPU register level.
    In the architecture of a computer system, a storage system is a hierarchy. The storage hierarchy for a typical server computer is shown in the following diagram:
Figure 2.3

    As can be seen from the figure above, from left to right, the farther away from the CPU, the slower the data access speed. Accessing data from registers is 300 times faster than accessing data from memory and 30 million times faster than accessing data from disk. Therefore, using the characteristics of CPU vectorized execution is of great significance to the performance improvement of the program.

ClinckHouse currently utilizes SSE4.2 instruction level to implement vectorized execution

2.4. Relational model and SQL query

  • Compared with other models such as documents and key-value pairs, the ClickHouse relational model has better description capabilities and can express the relationship between entities more clearly. In the field of OLAP, a large number of data modeling workers are based on relational models (star model, snowflake model and even wide table model). Therefore, the cost of migrating systems based on traditional relational databases or data warehouses to ClickHouse is very low.
  • Fully use SQL as the query language (support group by, order by, join, in and most standard SQL), SQL 解析方面 ClickHouse 是大小写敏感的the semantics represented by select a and select A are different.

2.5. Diverse table engines

    The original architecture of ClickHouse is implemented based on MySQL, the table engine design is similar to MySQL, and the storage engine is an independent interface. There are many types, and you can choose according to the business scenario. ClickHouse has more than 20 table engines in 6 categories including merge tree, memory, file, interface and others. It is very flexible to support specific scenarios through specific table engines. For simple scenes, simple engines can be directly used to reduce costs, and there are suitable options for complex scenes. (ClickHouse has many table engines, and I will leave an article dedicated to introducing table engines later)

2.6, multithreading and distributed

  • Vectorized execution improves performance through data-level parallelism, while multithreading improves performance through thread-level parallelism . Compared to the vectorized implementation of SIMD based on the underlying hardware implementation, thread-level parallelism is usually controlled by a higher-level software layer. Multithreading (thread-level parallelism) is complementary to vectorized execution (data-level parallelism).
  • Distribute the data to each server in advance, and push the calculation query of the data directly to the server where the data is located, because calculation movement is more cost-effective than data movement .

    In terms of data access, ClickHouse not only supports partitioning (vertical expansion, using the principle of multi-threading), but also supports fragmentation (horizontal expansion, using the principle of distribution), which can be said to apply multi-threading and distributed technology to the extreme.

2.7, multi-master architecture

    Distributed systems such as HDFS, Spark, HBase, and Elasticsearch all adopt the Master-Slave master-slave architecture, with a control node acting as the Leader to coordinate the overall situation. On the other hand, ClickHouse adopts the Multi-Master multi-master architecture. Each node in the cluster has the same role, and the client can access any node to get the same effect . It has the following advantages:

  1. The peer-to-peer role makes the system architecture simpler, no need to distinguish between master control nodes, data nodes, and computing nodes, and all nodes in the cluster have the same function.
  2. It naturally avoids the problem of single point of failure, and is very suitable for scenarios with multiple data centers and remote locations.

2.8. Real-time query (online query)

  • Similarities: Compared with other analytical databases, there are many similarities, such as support for massive query scenarios, columnar storage, data sharding, calculation pushdown and other special effects, indicating that CK has absorbed various advantages in its design.
  • In terms of price: other open source systems are slow, and commercial systems are expensive. ClickHouse is fast and free.

     Some places are translated into online query. I think it is more appropriate to translate into "real-time query". Compared with other analytical databases, such as "Vertica", SparkSQL, Hive, Elasticsearch, etc., ClickHouse is fast and free in big data analysis scenarios.

2.9, data fragmentation and distributed query

  • ClickHouse has a local table (Local Table) and a distributed table (Distributed Table), a local table is equivalent to a piece of data fragmentation, and a distributed table does not store data, but is only an access proxy for the local table.
  • Distributed tables are similar to sub-database middleware, and agents access multiple data shards to realize distributed queries.

3. ClickHouse architecture design

At present, the information released by ClickHouse is relatively scarce. For example, it is difficult to find complete information on the level of architecture, and there is not even an overall architecture diagram.
architecture design

3.1, Column and Field

    Column and Field are the most basic mapping units of ClickHouse data . A column of data in ClickHouse memory is represented by a Column object, and a row in a column (a row of data in a single column) is represented by a Filed object.
Column object : It is divided into two parts: interface and implementation. The IColumn interface defines methods for various relational operations on data; the specific implementation of the methods is implemented by corresponding objects according to different data types, such as ColumnString, ColumnArray, and ColumnTuple.
Field object : use the aggregation design pattern. Inside the Field object, 13 data types such as Null, UInt64, String and Array and corresponding processing logic are aggregated.

3.2、DataType

    It is responsible for the serialization and deserialization of data, but it is not directly responsible for reading the data, but it is obtained by Column or Field objects . In the implementation class of DataType, Column objects and Field objects of corresponding data types are aggregated. For example, DataTypeString will refer to ColumnString of string type, DataTypeArray will refer to Array type ColumnArray, and so on.

3.3, Block and Block flow

    The data of ClickHouse is oriented to the Block object, and it adopts the stream method. Although Column and Field constitute the basic mapping unit of data, they still lack some necessary information for actual operation, such as the type of data and the name of the column. So the Block object is designed, and the Block object can be regarded as a subset of the data table.

Block=data object (Column/Feild) + DataType + column name string

Stream operations have two top-level interfaces: IBlockInputStream is responsible for data reading and relational operations, and IBlockOutputStream is responsible for outputting data to the next link.

3.4、Table

    There is no so-called Table object in the underlying design of the data table, which directly uses the IStorage interface to refer to the data table.
    Table engine is a notable feature of ClickHouse. Different table engines are implemented by different subclasses, such as IStorageSystemOneBlock (system table), StorageMergeTree (merge tree table engine) and StorageTinyLog (log table engine).

3.5, Parser and Interpreter

    The Parser analyzer is responsible for creating an AST (Abstract syntax tree, Abstract Syntax Tree) object, while the Interpreter interpreter is responsible for interpreting the AST and further creating a query execution pipeline. Together with IStorage, they connect the entire data query process in series. The Parser analyzer can parse a SQL statement into an AST syntax tree in a recursive descent method. Different SQL statements will be parsed through different Parser implementation classes.

3.6、Functions 与 Aggregate Functions

    Ordinary functions, such as four arithmetic operations, day-ahead conversion, URL extraction functions, and IP address desensitization functions. There is no state, and the function effect acts on each row of data. In the specific execution process of the function, the method of vectorization is used to directly act on a whole column of data, instead of calculating line by line.

    Aggregate function, stateful, such as COUNT aggregate function, the state of AggregateFunctionCount is recorded with integer type UInt64. The state of the aggregation function supports serialization and deserialization, so it can be transmitted between distributed nodes to realize incremental calculation.

3.7、Cluster 与 Replication

The cluster is composed of shards, and the shards are composed of replicas. One node of ClickHouse can only have one shard. To implement 1 shard and 1 copy, at least two service nodes need to be deployed . Fragmentation is a logical concept, and the physical bearer is undertaken by replicas.
2 shards 2 copies
The above example is 2 shards and 2 copies, requiring 4 physical nodes.

4. Why is ClickHouse so fast?

  • Focus on the hardware : Focus on the hardware and be able to calculate rough performance before implementing it.
  • Algorithm first : ClickHouse has selected different algorithms for different usage scenarios, and performance is the primary consideration for algorithm selection.
  • Dare to try something new : Dare to try the latest and fastest algorithm, and find the most suitable algorithm implementation.
  • Continuous Improvement : Continuous testing and continuous improvement in all aspects.

Guess you like

Origin blog.csdn.net/u011047968/article/details/129138111