The road to architecture exploration-first stop-clickhouse | JD Cloud technical team

I. Introduction

Architecture, the most familiar term in software development, is used throughout our daily development work, ranging from the entire project to the smallest functional components. To achieve the goals of high performance, high scalability, and high availability, we need the assistance of excellent architectural concepts. Therefore, I tried to write a series of articles on architecture to analyze those classic and excellent open source projects on the market, learn excellent architectural concepts to accumulate experience and thinking in architectural design, and have a deeper understanding when encountering the same problems in subsequent daily work. Know.

This chapter takes the real-time OALP engine Clickhouse (ck for short) as an example to introduce its scenario-oriented, architectural design, detailed implementation, etc., to gain an in-depth understanding of how it has become the performance king among OLAP engines.

2. Introduction to Clickhouse

Clickhouse is a columnar database management system for online analysis (OLAP) open sourced by Russia's Yandex (the website with the largest number of Internet users in Russia) in 2016. It is written in C++ language and is mainly used for online analysis and processing queries. It is generated in real time through SQL queries. Analyze data reports.

The main scenarios are Ad-hoc queries (ad hoc queries) that quickly support any indicator and any dimension and can achieve second-level feedback at the large data level.

3. Clickhouse architecture principle

Clickhouse is known for its excellent performance. In the relevant performance comparison report, ck's single-table SQL query performance is 2.3 times that of presto, 3 times that of impala, 7 times that of greenplum, and 48 times that of hive. It can be seen that ck's performance in single-table SQL queries is 2.3 times that of presto, 3 times that of impala, 7 times that of greenplum, and 48 times that of hive. Single table query is very good, so how does ck achieve efficient query?

1. Introduction

Before introducing the principle of ck query, let’s take the most common mysql as an example, how a simple query statement is executed, and then consider how ck should be optimized from the perspective of a ck architect. When mysql queries data, it will first read the data from the disk. The page (innodb storage unit) is stored in the memory, and then the query results are returned from the memory. Therefore, in our understanding, the SQL query (excluding steps such as grammar and lexical parsing, optimization, etc.) can be summarized as the following two points:

  1. Read data from disk to memory
  2. Parse data matching results in memory and return

In modern computers, the time spent by the CPU in calculations is much less than the time spent on disk IO. Therefore, most modern OLAP engines also choose to improve query performance by reducing disk IO. For example:

Reduce disk IO principle Example Column type
distributed Read data in parallel to reduce the amount of data read by a single node hive(texfile) Data skew, network time consuming, resource waste
Column storage Store each column separately and read it on demand hbase Suitable for columns using a single business

2. Architecture

Through the above derivation and analysis, we can conclude that the bottleneck of OLAP query lies in disk IO, so ck’s optimization method also draws on the above measures, using MPP architecture (massive parallel processing) + columnar storage, and other database products with similar architectural designs. There are many more. Why is ck performance so outstanding? Next, we will analyze the core features of ck in detail and further understand the ingenious architectural concept of ck architects.

2.1 Column storage

Row storage: Put the same row of data into the same data block, and store the data continuously between each data block.

Column storage: Put the same column of data into the same data block, and different columns can be stored separately.

As mentioned above, analytical queries often only require a few fields in a table. Column-Store only needs to read the columns queried by the user, while Row-Store will read all column data when reading each record. Read out, Column-Store is much more efficient than Row-Store in IO, so the performance is better.

2.2 block

The smallest unit that Clickhouse can process is a block. A block is a collection of rows. The default maximum is 8192 rows. Because each column is stored separately, each data file is more regular than row-based storage. By using the LZ4 compression algorithm for the block , the overall compression ratio is roughly 8:1. It can be seen that clickhouse realizes the batch processing function through excellent compression ratio and block structure . Compared with the situation of processing 1 row of data at a time under massive data storage, the number of IOs is greatly reduced , thus Achieved storage engine optimization.

2.3 LSM

The idea of ​​LSM : The incremental modifications to the data are kept in the memory. After reaching the specified limit, these modification operations are written to the disk in batches. Compared with the high performance of the write operation, reading needs to merge the recently modified data in the memory. To operate and historical data on the disk, you need to first check whether it is in the memory. If there is no hit, you need to access the disk file.

The principle of LSM : Split a large tree into N small trees. The data is first written into the memory. As the small trees get larger, the small trees in the memory will be flushed to the disk. The trees in the disk are merged regularly to form one large tree to optimize read performance.

Clickhouse implements data pre-sorting through LSM, thereby reducing the amount of disk reading. The principle is to sort out-of-order data through LSM, then write it to disk for storage, and regularly merge overlapping disk files. Clickhouse writing steps It can be summarized as the following points:

  1. Every batch of data is written, and the log is recorded first to ensure a high availability mechanism.
  2. After recording the log, store it in the memory for sorting, and then write the sorted results to the disk, recording the number of merges Level=0
  3. Periodically merge files with Level=0 or 1 on the disk and mark them for deletion. Subsequent physical deletion

2.4 Index

Clickhouse uses primary index (sparse index) + secondary index (hop index) to achieve index data positioning and query. The primary index records the first of each block, and each query based on the index field only needs to determine the query The number of blocks is enough to avoid traversing all the data in one query. As mentioned above, a block has 8192 rows, so 100 million pieces of data only require an index of 10,000 rows, so the first-level index takes up less storage and can be resident Memory, speeds up queries. The secondary index is constructed from the aggregated information of the data. Depending on the index type, the content of the aggregated information is also different. The purpose of the hop index is the same as that of the primary index, and it also helps reduce data scanning during queries. The scope and principle are all "exclusion methods", that is, as much as possible, exclude those index granularities that definitely do not meet the conditions.

On the other hand, it can be found that because the ck storage engine stores in ordered sets, there is no need to use the B+ tree sorting feature for positioning in the index structure. Therefore, in actual use, there is no need to satisfy the leftmost principle matching. As long as the filter conditions include the index column.

2.5 Vectorized execution

Vectorization (vectorization) , also called vectorized operation, also called array programming, is about one thing: turning multiple for loop calculations into one calculation. In order to achieve vectorized execution, the SIMD instructions of the CPU need to be utilized. The full name of SIMD is Single Instruction Multiple Data, which means using a single instruction to operate multiple pieces of data . In the concept of modern computer systems, it is a way to improve performance through data parallelism (others include instruction-level parallelism and thread-level parallelism). Its principle is to implement parallel operations on data at the CPU register level.

In the architecture of computer systems, the storage system is a hierarchical structure. The storage hierarchy of a typical server computer is shown in Figure 1. A practical experience tells us that the closer the storage medium is to the CPU, the faster the data is accessed.

From left to right, the further away from the CPU, the slower the data access speed. The speed of accessing data from registers is 300 times that of accessing data from memory, and 30 million times that of accessing data from disk. Therefore, using the characteristics of CPU vectorized execution is of great significance to improve the performance of the program. ClickHouse currently utilizes the SSE4.2 instruction set to implement vectorized execution.

4. Summary of Clickhouse

1. The give-and-take of clickhouse

Clickhouse has adopted many excellent designs in pursuit of ultimate performance. Such as column storage, batch processing, pre-sorting, etc. mentioned above. However, the architecture has two sides, and it also brings some shortcomings from one aspect to the other.

  • In terms of high-frequency real-time writing, because ck will directly write batch data into small files, high-frequency writing will cause a large number of small files to be generated and merged, affecting query performance. Therefore, ck officials also recommend large-scale low-frequency writing to improve Writing performance. In actual scenarios, it is recommended to introduce a data cache layer between the business and the database to achieve batch writing.
  • Regarding query concurrency issues, Clickhouse uses a parallel processing mechanism, that is, a query will use half of the CPU to execute. It will automatically identify the number of CPU cores during installation. Therefore, while taking advantage of fast query speed, it also brings about a lack of concurrency capabilities. If too many queries accumulate and reach the max_concurrent_queries threshold, a too many simultaneous queries exception will be reported. This is also a current limiting protection mechanism of ck. Therefore, pay attention to the troubleshooting of slow SQL during daily use, and the control of concurrent requests is to ensure that ck The key to high availability.

After we understand its principles, we can have a deeper understanding of clickhouse, and we can also explain the problems we have encountered in production work, and use it reasonably from the perspective of clickhouse architects, avoid disadvantages, and give full play to its characteristics.

2. Problems encountered by clickhouse in actual production

2.1 Impact of high load on zookeeper

At present, the clickhouse open source version of the ReplicatedMergeTree engine relies heavily on zookeeper to complete functions such as multi-copy master selection, data synchronization, and fault recovery. Zookeeper performs poorly under high load conditions, and may even cause problems such as failure to write copies and failure to synchronize data. Analyze Clickhouse's use of ZooKeeper, taking the copy replication process as an example. CK's frequent distribution of logs and data exchange to ZooKeeper is one of the causes of bottlenecks.

General solution:

JD Retail: Self-developed zookeeper alternative based on Raft distributed consensus algorithm.

2.2 Resource management and control issues

ClickHouse's resource management and control capabilities are not perfect enough, which may cause execution failures in scenarios with high insert and select concurrency, affecting the user experience. This is because the community version of ClickHouse currently only provides maximum memory control based on different users, and will kill the executed query when the threshold is exceeded.

Analysys performance comparison: https://zhuanlan.zhihu.com/p/54907288

Official website performance comparison: https://clickhouse.com/

Author: Jingdong Technology Li Danfeng

Source: JD Cloud Developer Community Please indicate the source when reprinting

Microsoft launches new "Windows App" .NET 8 officially GA, the latest LTS version Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Alibaba Cloud 11.12 The cause of the failure is exposed: Access Key Service (Access Key) exception Vite 5 officially released GitHub report : TypeScript replaces Java and becomes the third most popular language Offering a reward of hundreds of thousands of dollars to rewrite Prettier in Rust Asking the open source author "Is the project still alive?" Very rude and disrespectful Bytedance: Using AI to automatically tune Linux kernel parameter operators Magic operation: disconnect the network in the background, deactivate the broadband account, and force the user to change the optical modem
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10149306