ClickHouse learning

ClickHouse is a columnar database management system (DBMS) for online analysis (OLAP).

ClickHouse was originally developed for  YandexMetrica  's second largest web analytics platform in the world  . It has been continuously used by the system as the core component of the system for many years. So far, the system has more than 13 trillion records in ClickHouse, and more than 20 billion events are processed every day. It allows dynamic query and report generation directly from raw data. This article briefly introduces ClickHouse's goals in its early development stage.

Yandex.Metrica generates real-time statistical reports for real-time access and connection sessions based on user-defined fields. Such requirements often require complex aggregation methods, such as deduplication of visiting users. The data for constructing the report is the new data received and stored in real time.

As of April 2014, Yandex.Metrica tracks approximately 12 billion events (user clicks and views) every day. In order to be able to create custom reports, we must store all these events. At the same time, these queries may need to scan millions of rows of data in a few hundred milliseconds, or scan hundreds of millions of rows of data in a few seconds.

 

Features of ClickHouse 

True columnar database management system 

In a true columnar database management system, there should be no additional data besides the data itself. This means that in order to avoid storing their length «number» next to the value, you must support fixed-length numeric types. For example, 1 billion UInt8 type data consumes about 1GB of space without compression. If this is not the case, this will have a strong impact on CPU usage. Even in the case of uncompressed, compact storage of data is very important, because the speed of decompression mainly depends on the size of the uncompressed data.

This is very noteworthy, because in some other systems, different columns can also be stored separately, but due to the optimization of other scenarios, it cannot effectively process analysis queries. For example: HBase, BigTable, Cassandra, HyperTable. In these systems, you can get a throughput of hundreds of thousands per second, but you cannot get a throughput of hundreds of millions of rows per second.

It should be noted that ClickHouse is not only a database, it is a database management system. Because it allows creating tables and databases, loading data, and running queries at runtime without the need to reconfigure or restart services.

data compression 

In some columnar database management systems (for example: InfiniDB CE and MonetDB) data compression is not used. However, if you want to achieve better performance, data compression does play a vital role.

Disk storage of data 

Many columnar databases (such as SAP HANA, Google PowerDrill) can only work in memory, this method will cause more equipment budget than actual. ClickHouse is designed for systems that work on traditional disks. It provides a lower storage cost per GB, but if SSD and memory are available, it will use these resources reasonably.

Multi-core parallel processing 

ClickHouse will use all available resources on the server to process large queries in parallel in the most natural way.

Multi-server distributed processing 

Almost none of the columnar database management systems mentioned above supports distributed query processing.
In ClickHouse, data can be stored on different shards. Each shard is composed of a set of fault-tolerant replicas. Queries can be processed on all shards in parallel. These are transparent to users

Support SQL 

ClickHouse supports a SQL-based declarative query language, which is compatible with the SQL standard in most cases.
Supported queries include GROUP BY, ORDER BY, IN, JOIN and non-correlated subqueries.
Window functions and related subqueries are not supported.

Vector engine¶

In order to use the CPU efficiently, data is not only stored in columns, but also processed in vectors (part of a column), so that the CPU can be used more efficiently.

Real-time data update 

ClickHouse supports defining primary keys in tables. In order to enable queries to quickly search for ranges in the primary key, data is always stored in MergeTree in an orderly manner in an incremental manner. Therefore, data can be continuously and efficiently written to the table, and there will be no locking behavior during the writing process.

index 

Sorting the data according to the primary key will help ClickHouse complete the search for a specific value or range of data within tens of milliseconds.

Suitable for online query 

Online query means processing the query and loading the results into the user's page with very low latency without any preprocessing of the data.

Support approximate calculation 

ClickHouse provides a variety of methods to speed up queries without sacrificing data accuracy:

  1. Various aggregation functions for approximate calculations, such as distinct values, medians, quantiles
  2. Approximate query based on partial samples of data. At this time, only a small percentage of the data will be retrieved from the disk.
  3. Instead of using all aggregation conditions, aggregation is performed by randomly selecting a limited number of data aggregation conditions. This reduces the use of computing resources while providing fairly accurate aggregation results when the data aggregation conditions meet certain distribution conditions.

Support data replication and data integrity 

ClickHouse uses asynchronous multi-master replication technology. When data is written to any available copy, the system will distribute the data to other copies in the background to ensure that the system maintains the same data on different copies. In most cases, ClickHouse can automatically recover after failure, and manual recovery is required in a few complicated situations.

For more information, see  Data Replication .

limit 

  1. There is no complete transaction support.
  2. It lacks the ability to modify or delete existing data with high frequency and low latency. It can only be used to delete or modify data in batches, but this complies with  GDPR .
  3. The sparse index makes ClickHouse not suitable for point queries that retrieve a single row by its key.

For specific learning, please go to the official: https://clickhouse.tech/docs/zh/

Guess you like

Origin blog.csdn.net/qq_37061368/article/details/109723320