Talking about ClickHouse

21 Year Note Migration

Pre-knowledge

What is OLAP and OLTP

OLAP (OnLine Analytical Processing) online analysis system;
OLTP (OnLine Transaction Processing) online transaction processing system;

OLAP is mainly used to read data, analyze data, and assist in operational decision-making analysis .
After the data is written in batches at one time, analysts need to mine and analyze the data from various angles in order to discover the commercial value and business change trends; OLAP is mainly used for
offline analysis and does not require high timeliness.
The main open source products of OLAP include HDFS, HIVE, and Impala; the target application scenario is to write once and read multiple times;

OLTP is to add, delete, modify and check transactions .
For example, in the e-commerce system, the purchase of goods, inventory reduction, shopping cart order payment, etc.
OLTP is the main application of traditional relational databases, also known as transaction-oriented processing systems, generally used in big data online business systems, requiring real-time performance.

OLAP: read + one-time write;
OLTP: real-time read + real-time write

Understand it this way: OLTP can be regarded as the data source of OLAP in different forms .
In the early stage, we need to accumulate data through OLTP. When the data accumulates to a certain extent, we need to do a statistical analysis of what happened in the past to provide support for the company's decision-making. At this time, we are doing OLAP.

To sum up, simply put, OLAP is an extension of OLTP, allowing data to exert greater value .

In addition, traditional databases are mainly oriented to OLTP, while data warehouses are mainly oriented to OLAP, focusing on decision analysis.
The so-called data warehouse is a subject-oriented, integrated, non-renewable, and time-changing data collection that better supports the decision-making analysis and processing of enterprises or organizations.

insert image description here

Row Database vs Column Database

What are row and column formats?

Row-based database: store data in a row-related storage system;
column-based database: store data in a column-related storage system;

Suppose now we have data like the following figure:
insert image description here
row-based storage, which strings a row of data into a string of characters and stores it row by row, that is, columnar storage, which
insert image description here
stores a column of strings into a string and stores them column by column, that is,
insert image description here
columnar storage is more suitable for query, and each column is an index.

Advantages and disadvantages of row and column

Columnar database:

  • Each column is stored separately, which means that the data itself is an index, and the query is fast;
  • The data type of each column is consistent, which means more efficient compression;
  • Only the required columns in the select set are read into the memory, which greatly reduces the IO of the system.

The advantage of the column format is the disadvantage of the row format.

Row-based databases are good at random read operations , and column-based databases are better at querying large batches of data .
The columnar database itself is born for data analysis in a big data environment. It is suitable for large amounts of data rather than small batches of data, and suitable for query operations rather than addition, deletion, and modification .
Columnar databases have more advantages in parallel query processing and compression. The main application scenario of columnar databases is that 99% of the operations are query applications. Very suitable for OLAP.
Representatives of columnar databases include: [HANA], [Sybase IQ]], ParAccel, Sand/DNAAnalytics, and Vertica.

ClickHouse

What is a clickhouse?

Yandex, Russia's largest search engine, opened up a data analysis database on June 15, 2016. The name is ClickHouse. It is open source and free, and people in the circle jokingly call it "Katyusha database." Its icon: The running score of this columnar storage database is higher than that of many popular commercial database software, such as Vertica; the popularity is also higher
than
insert image description here
others
.
insert image description here

The structure of clickhouse

Tencent's ClickHouse usage architecture:
insert image description here
Toutiao's clickhouse cluster-based architecture:
insert image description here
You can see that there are many data sources around the clickhouse cluster, including offline data, real-time message middleware data, and some business data, etc. A data ETL service is also provided on the periphery, and the data is regularly migrated to the clickhouse local storage, and finally several analysis systems are put on the clickhouse cluster.

Advantages and disadvantages of clickhouse

The three major features of clickhouse: fast running score, many functions, and many restrictions .
Why do you say it runs fast?
100 million data:
ClickHouse is about 5 times faster than Vertica, 279 times faster than Hive, and 801 times faster than MySQL;
one billion data:
ClickHouse is about 5 times faster than Vertica, MySQL and Hive can no longer complete the task;

ClickHouse is fast because it adopts a parallel processing mechanism . By default, even one query will occupy half of the CPU of the server to execute, which is why ClickHouse cannot support high concurrency. (Of course, the number of occupied cores can be configured.)
In addition, its compression rate is high. Currently, the data compression of clickhouse can reach 30% to 40%.

Why do you say there are many restrictions?
It is said that clickhouse was born to serve petty bourgeoisie.

  • Currently only Ubuntu system is supported. Ubuntu virtual environment is required under windows;
  • High concurrency is not supported, and the official recommendation is that QBS (query rate per second) is 100;
  • Design and architecture documents are not provided, the design is very mysterious, only open source C++ source code;
  • Ignore the Hadoop ecology, use Local attached storage as storage, go your own way, and realize distributed deployment by yourself.
  • Transactions are not supported and there is no isolation level;
  • write slow;
  • It is difficult to get started, and the old Maozi is a bit rough (the so-called Katyusha database is partly derived from here);
  • The performance of join is poor, far inferior to its performance on a single table, so it is recommended to use wide tables instead of join. (clickhouse supports a column width of about 10,000 columns)

Features of clickhouse

computing layer

The so-called approximate calculation actually provides a bunch of well-written aggregate functions, which can provide rapid approximate calculation. Supports sampling processing of data. Complex data type support : ClickHouse also provides complex data types such as array, json, tuple, and set, which are also extremely effective;



service layer

At the storage layer, it implements functions such as ordered data storage, primary key index, sparse index, data partition and sharding, master-slave replication, and limited support for delete and update .

数据的有序存储指的是数据在建表时可以将数据按照某些列进行排序,排序之后,相同类型的数据在磁盘上有序的存储,在进行范围查询时所获取的数据都存储在一个或若干个连续的空间内,极大的减少了磁盘IO时间;
所谓数据分区分片,指的是在ClickHouse的部署模式上支持单机模式和分布式集群模式,在分布式中会把数据分为多个分片,并且分布到不同的节点上,它提供了丰富的分片策略,包含random随机分片(将写入数据随机分发到集群中的某个节点)、constant固定分片(将写入数据分发到某个固定节点)、columnvalue分片(将写入数据按某一列的值进行hash分片)、自定义表达式分片(将写入数据按照自定义的规则进行hash分片)
主备复制,可为一个主库配置多个副本,且默认配置下,任何副本都处于active模式,可以对外提供查询服务。

vectorization engine

ClickHouse implements the Vectorized execution engine, which can give full play to the performance of the CPU.

For columnar data in memory, a batch calls a SIMD (Single Instruction Multiple Data) instruction (instead of calling once for each row), which not only reduces the number of function calls and cache misses, but also fully utilizes the parallel capability of SIMD instructions, greatly reducing the calculation time. The vector execution engine can usually bring several times the performance improvement.

Clickhouse provides a MySQL data engine, which can directly map tables in MySQL to its own tables (so the bottom layer sometimes uses MySQL as the bottom layer). In addition, clickhouse also has its own data engine, which can execute normal SQL;

Use cases of clickhouse

Toutiao uses ClickHouse to analyze user behavior. There are thousands of ClickHouse nodes internally. The maximum number of nodes in a single cluster is 1200. The total data volume is dozens of PB, and the daily increase of raw data is about 300TB.

Tencent internally uses ClickHouse for game data analysis, and has established a complete monitoring and maintenance system for it.

Ctrip's internal access trial started in July 2018, and currently 80% of its business is running on ClickHouse. Every day, more than one billion data increments and nearly one million query requests are made.

Kuaishou is also using ClickHouse internally. The total storage capacity is about 10PB, and 200TB is added every day. 90% of the queries are less than 3S.

In foreign countries, there are hundreds of nodes inside Yandex for user click behavior analysis, and top companies such as CloudFlare and Spotify are also using it. Yandex has more than a dozen projects using ClickHouse, including: Yandex data analysis, email, advertising data analysis, user behavior analysis, and more.

In 2012, CERN used ClickHouse to save a large amount of experimental data generated by the particle collider. The annual data storage capacity is at the PB level, and it supports statistical analysis and query.

insert image description here

Advantages of clickhouse compared to other OLAP

Generally speaking:
Compared with commercial OLAP databases, click house is open source and free;
compared with cloud solutions, click house can be deployed on its own machine without cloud payment;
compared with Hadoop ecological DBMSs, click house does not rely on Hadoop ecology, and can realize real-time high-concurrency systems by itself, and also supports the deployment of distributed computer rooms.
Compared with several open source OLAP databases, clickhouse is more mature, more stable, and has larger application scenarios.
Compared with non-relational databases, ClickHouse can support direct query from original data. ClickHouse supports SQL-like language, which provides the convenience of traditional relational data.

The difference between clickhouse and kylin

  • Kylin is suitable for high concurrency and fixed mode query scenarios
  • ClickHouse is suitable for low concurrency, flexible and ad hoc query scenarios

Fixed mode query :
Calculate the customer's demand in advance and save it in the form of the result. When the customer raises the demand, find the corresponding result and return it.
The feature is that it returns very quickly when the requirements are met, supports a larger volume of data under the same resources, and supports more concurrency; the disadvantage is that the more dimensions and complexity of the
table, the larger the disk storage space required, and it takes a certain amount of time to build a cube.

Flexible ad-hoc query :
The data is calculated in real time according to the needs of the user and returned to the user, so the user is relatively flexible in use, and can choose a combination of dimensions at will for real-time calculation.

In addition, in principle, Kylin is based on the Hadoop platform. Through precomputation, MOLAP analysis (multi-dimensional online analytical processing) of the cube model is defined, that is, space is exchanged for time; and clickhouse is a relational online
analytical processing (ROLAP), which advocates achieving high-performance OLAP analysis by using the performance of the CPU to the extreme .

On the same point, both Kylin and ClickHouse can return OLAP (Online Analysis Query) query results at the sub-second level (most queries are returned within 5s) at the PB data level through SQL.

The Evolution of Big Data Technology

At the beginning, everyone found that native MapReduce was too difficult to write, and everyone urgently needed a high-level language layer to describe the calculation process of data (this process is like, after you have assembly, although you can do everything, it is really too cumbersome, so a bunch of high-level languages ​​​​such as C++ and Java were derived).

So there are pig and hive.

Pig describes MapReduce in a script-like manner, while Hive uses a SQL-like language, namely HQL.
Pig and hive respectively translate scripts and HQL into MapReduce programs and throw them to the computing engine for calculation, which frees developers from cumbersome MapReduce programs and lowers the threshold for use.
With Hive, people found that SQL is really simple, so Hive gradually grew into the core component of the big data warehouse.

But people found that Hive ran too slowly on MapReduce, so interactive SQL engines such as Impala, Presto, and Drill were born, sacrificing versatility and stability for processing speed. (It seems that MapReduce has not been completely replaced, but some optimizations have been made).
However, these interactive engines did not fully meet human expectations, so everyone decided to use a new generation of general-purpose computing engines Tez or Spark (instead of MapReduce) to run SQL, thus Hive on Tez / Spark and SparkSQL were born.

The above introduction is basically the framework of a formed data warehouse.
The bottom layer is HDFS, which runs MapReduce/Tez/spark for calculation, and then runs Hive and Pig on the high-level language layer. Or run Impala, Drill, and Presto directly on HDFS. This addresses most of the low to medium speed data processing needs.

references

The original manuscript is lost. . So this part of the reference can't be found. . .

Guess you like

Origin blog.csdn.net/wlh2220133699/article/details/131519819