what is clickhouse

1 Introduction

ClickHouse is a columnar storage database (DBMS: Database Management System) for Online Analytical Processing (OLAP: Online Analytical Processing) MPP architecture open sourced by Russia's Yandex in 2016. It can use SQL queries to generate analysis data reports in real time. The full name of ClickHouse is Click Stream, Data WareHouse.

2. Features of ClickHouse

An open-source column storage database management system that supports linear expansion, simple and convenient, high reliability, and
fast fault-tolerant running scores: 5 times faster than Vertica, 279 times faster than Hive, and 800 times faster than MySQL . The data level it can handle has reached 1 billion-level
multi-function: support data statistical analysis of various scenarios, support SQL-like query, remote replication deployment

3. Why is ClickHouse fast?

1. Data compression . Clickhouse is a columnar storage database. Each column of data has been compressed by lz4. Due to the extremely high repetition between column data, it has a very impressive compression ratio. When querying a column of data, the scanning speed Extremely fast, so that clickhouse can get a very extreme data compression ratio (20%)

2. The optimization at the cpu instruction level uses a large number of vectorized operation instructions, and is very good at using the cpu's L1, L2, and L3 caches to minimize the operation of reading memory or disk.

3. Different scenarios use different algorithms or data structures . For example, when the amount of data is small, it is stored in an array array. When the data is medium in size, it is stored in a hashset structure. When the amount of data is large, it is stored in a hyperloglog structure.

4. Different table engines , including merge tree table engine, aggregate merge tree table engine, memory table engine, etc. Different application scenarios can use different table engines to improve performance.

  1. For the use of internal high-performance algorithms , such as the algorithm selection of string search, clickhouse will choose the fastest algorithm.

4. Introduction to the advantages and disadvantages of Clickhouse

advantage:

In order to efficiently use the CPU, the data is not only stored in columns, but also processed in vectors;

Data compression has a large space, reducing IO; processing single query high throughput, each server can reach up to billions of rows per second;

The index is not a B-tree structure, and does not need to meet the leftmost principle; as long as the filter condition is included in the index column; even if the data in use is not in the index, due to various parallel processing mechanisms, the ClickHouse full table scan speed is very fast;

The writing speed is very fast, 50-200M/s, estimated according to 100Byte per row, which is roughly equivalent to the writing speed of 50W-200W/s. It is very suitable for a large number of data updates.

shortcoming:

does not support transactions;

High concurrency is not supported, the official recommendation is 100 qps, you can increase the number of connections by modifying the configuration file, but if the server is good enough;

SQL meets more than 80% of the grammar used in daily use, and the join writing method is quite special; the latest version already supports joins similar to SQL, but the performance is not good;

Try to write more than 1,000 records in batches, and avoid row-by-row or small-batch insert, update, and delete operations, because the bottom layer of ClickHouse will continue to do asynchronous data merging, which will affect query performance. This is done in real-time data writing to avoid as much as possible;

ClickHouse is fast because it adopts a parallel processing mechanism. Even a query will use half of the CPU of the server to execute, so ClickHouse cannot support high-concurrency usage scenarios. By default, the number of CPU cores used by a single query is half of the number of server cores. The number of server cores will be automatically identified, and this parameter can be modified through the configuration file.

5. Extension: What is a columnar storage database

In a traditional row-based database system, data is stored in the following order:

Row

WatchID

JavaEnable

Title

GoodEvent

EventTime

#0

89354350662

1

Investor Relations

1

2016-05-18 05:19:20

#1

90329509958

0

Contact us

1

2016-05-18 08:10:20

#2

89953706054

1

Mission

1

2016-05-18 07:38:00

#N

处于同一行中的数据总是被物理的存储在一起。

常见的行式数据库系统有:MySQL、Postgres和MS SQL Server。

在列式数据库系统中,数据按如下的顺序存储:

Row:

#0

#1

#2

#N

WatchID:

89354350662

90329509958

89953706054

JavaEnable:

1

0

1

Title:

Investor Relations

Contact us

Mission

GoodEvent:

1

1

1

EventTime:

2016-05-18 05:19:20

2016-05-18 08:10:20

2016-05-18 07:38:00

这些示例只显示了数据的排列顺序。来自不同列的值被单独存储,来自同一列的数据被存储在一起。

常见的列式数据库有: Vertica、 Paraccel (Actian Matrix,Amazon Redshift)、 Sybase IQ、 Exasol、 Infobright、 InfiniDB、 MonetDB (VectorWise, Actian Vector)、 LucidDB、 SAP HANA、 Google Dremel、 Google PowerDrill、 Druid、 kdb+。

不同的数据存储方式适用不同的业务场景,数据访问的场景包括:进行了何种查询、多久查询一次以及各类查询的比例;每种类型的查询(行、列和字节)读取多少数据;读取数据和更新之间的关系;使用的数据集大小以及如何使用本地的数据集;是否使用事务,以及它们是如何进行隔离的;数据的复制机制与数据的完整性要求;每种类型的查询要求的延迟与吞吐量等等。

系统负载越高,依据使用场景进行定制化就越重要,并且定制将会变的越精细。没有一个系统能够同时适用所有不同的业务场景。如果系统适用于广泛的场景,在负载高的情况下,要兼顾所有的场景,那么将不得不做出选择。是要平衡还是要效率?

6.OLAP场景的关键特征

  • 绝大多数是读请求

  • 数据以相当大的批次(> 1000行)更新,而不是单行更新;或者根本没有更新。

  • 已添加到数据库的数据不能修改。

  • 对于读取,从数据库中提取相当多的行,但只提取列的一小部分。

  • 宽表,即每个表包含着大量的列

  • 查询相对较少(通常每台服务器每秒查询数百次或更少)

  • 对于简单查询,允许延迟大约50毫秒

  • 列中的数据相对较小:数字和短字符串(例如,每个URL 60个字节)

  • 处理单个查询时需要高吞吐量(每台服务器每秒可达数十亿行)

  • 事务不是必须的

  • 对数据一致性要求低

  • 每个查询有一个大表。除了他以外,其他的都很小。

  • 查询结果明显小于源数据。换句话说,数据经过过滤或聚合,因此结果适合于单个服务器的RAM中

很容易可以看出,OLAP场景与其他通常业务场景(例如,OLTP或K/V)有很大的不同, 因此想要使用OLTP或Key-Value数据库去高效的处理分析查询场景,并不是非常完美的适用方案。例如,使用OLAP数据库去处理分析请求通常要优于使用MongoDB或Redis去处理分析请求。

7.列式数据库更适合OLAP场景的原因

列式数据库更适合于OLAP场景(对于大多数查询而言,处理速度至少提高了100倍),下面详细解释了原因(通过图片更有利于直观理解):

行式

列式

看到差别了么?下面将详细介绍为什么会发生这种情况。

输入/输出

  1. 针对分析类查询,通常只需要读取表的一小部分列。在列式数据库中你可以只读取你需要的数据。例如,如果只需要读取100列中的5列,这将帮助你最少减少20倍的I/O消耗。

  1. 由于数据总是打包成批量读取的,所以压缩是非常容易的。同时数据按列分别存储这也更容易压缩。这进一步降低了I/O的体积。

  1. 由于I/O的降低,这将帮助更多的数据被系统缓存。

例如,查询«统计每个广告平台的记录数量»需要读取«广告平台ID»这一列,它在未压缩的情况下需要1个字节进行存储。如果大部分流量不是来自广告平台,那么这一列至少可以以十倍的压缩率被压缩。当采用快速压缩算法,它的解压速度最少在十亿字节(未压缩数据)每秒。换句话说,这个查询可以在单个服务器上以每秒大约几十亿行的速度进行处理。这实际上是当前实现的速度。

8.性能对比图

9.clickhouse安装

https://blog.csdn.net/2301_76154806/article/details/128781183

安装完成后使用dbeaver连接测试.

Guess you like

Origin blog.csdn.net/2301_76154806/article/details/128768649