Data analysis engines are blooming, why should you invest heavily in ClickHouse?

For more technical exchanges, job opportunities, and trial benefits, please pay attention to the WeChat public account of the ByteDance data platform, and reply to [1] to enter the official exchange group

In recent years, the competition of OLAP products has become increasingly fierce. At present, there are not only the more mature data analysis products of the previous generation such as Impala and Greenplum, but also the new generation of ClickHouse, Kylin, Druid, Doris, and StarRocks, which have their own characteristics in different scenarios. Analysis engine. Each of these products has its own advantages. Users need to have a comprehensive understanding of each product when making a selection, and require product knowledge to keep up with the latest version in order to accurately select the product suitable for their company.

ByteDance's Douyin, Toutiao and other products have grown rapidly, and the data that needs to be analyzed and processed has also increased exponentially, which has extremely high requirements for real-time analysis. When choosing an OLAP engine, Byte has also tried Kylin, Druid, Spark, etc., and has also done extensive research on other products. After continuous attempts and thinking, Bytes considered from the perspectives of performance, stability, and reusability, and finally chose ClickHouse as the main analysis engine to carry ByteDance's extensive business growth analysis work. At present, the total number of ClickHouse nodes in ByteDance has exceeded 18,000, the total amount of data managed exceeds 700PB, and the largest cluster scale is more than 2,400 nodes. It is one of the largest ClickHouse users in the country and even the world.

ByteDance's OLAP Evolution

At first, the biggest requirement was "fast", so the byte team tried Kylin, which has the advantage of providing millisecond-level query latency. However, at the same time, Kylin also has problems such as the need for pre-aggregation, the need to define the data model in advance, and the inability to perform interactive analysis. As the amount of data increases, it will lead to slow return results. Then the team hopes to use Spark to solve the problem. However, Spark also has many problems that plague the team, such as the query speed is not fast enough, the resource utilization rate is high, the stability is not good enough, and the data cannot be supported for a longer time.

After careful consideration, Byte decided to choose the OLAP analysis engine from the following perspectives:

One is very simple and simple requirements for OLAP: high availability and strong performance. No matter how much reuse and identities are added to OLAP, the core and primary requirement is to store enough data, be stable enough, and find data very quickly. This is the first requirement-to be easy to use, that is, to meet the performance requirements of interactive analysis under massive data, and to achieve a second-level response.

The second is reuse. On the basis of ease of use, the team hopes to use a set of technology stacks to solve most or even all problems as much as possible, which requires the engine to be customizable, allowing developers to build various scenario-oriented scenarios on this technology stack. Applications.

The third is ease of use, which allows users to use the product more autonomously.

In the end, after researching and testing various open source engines on the market at that time, the team finally chose ClickHouse as the OLAP query engine, and began to iterate based on this.

 

Introduction to ClickHouse

ClickHouse is a columnar database management system (DBMS) for online analytics (OLAP). Open sourced in 2016 and known for its powerful performance. It has the characteristics of columnar storage, vectorized execution engine, high compression ratio, and multi-core parallel computing.

1. Strong performance

Known as the fastest OLAP engine, the performance comparison of the same server in the order of 100 million data is as follows:

2. Feature-rich

ClickHouse supports various scenarios for statistical analysis of data:

  • Support SQL-like query;

  • Support many library functions (such as IP conversion, URL analysis, etc., estimated calculation/HyperLoglog, etc.);

  • Support array (Array) and nested data structure (Nested Data Structure);

  • Support database geo-replication deployment.

3. Fast data import speed

ClickHouse uses a large-scale parallel computing framework, with ultra-high-throughput real-time write capabilities, in the order of 50-200M per second.

ClickHouse adopts a structure like LSM Tree, and compacts periodically in the background after data is written. Through the LSM tree-like structure, ClickHouse writes all data sequentially by appending, and the data segment cannot be changed after writing. In the background compaction, multiple segments are merged and sorted and written back to disk sequentially. The sequential write feature makes full use of the throughput capacity of the disk.

4. Good prospects for development

Since its open source in 2016, ClickHouse has grown rapidly with its extreme performance several times that of other top interactive analytics databases. At present, ClickHouse has obtained 24.2K Stars on Github, 1000+ Contributors.

 

Disadvantages of ClickHouse

No data engine is perfect, and in the process of heavy use, Byte also found some shortcomings of ClickHouse:

1. Poor correlation query ability

The advantage of ClickHouse lies in the single-table query performance, but in some scenarios that require flexible query, the lack of ClickHouse's multi-table association ability is exposed, and it is difficult to meet such scenarios.

2. Depends on Zookeeper

Zookeeper is mainly used for synchronization of replica table data (ReplicatedMergeTree engine) and distributed table (Distributed) operations in ClickHouse. However, improper use of Zookeeper can easily cause instability of the ClickHouse cluster.

3. Does not support upsert

ClickHouse only supports batch deletion or modification of data, ReplacingMergeTree needs to rely on merge to remove duplicates asynchronously.

4. Complex operation and maintenance

When ClickHouse expands or shrinks, it is very inconvenient to create a new table and import data again. ClickHouse clusters cannot automatically sense cluster topology changes, nor can they automatically balance data. When the amount of data in the cluster is large, there are too many replicated tables and distributed tables, trying to achieve table dimensions, or data balance between clusters will lead to high operation and maintenance costs.

Jump to the official website of Volcano Engine ByteHouse for details! Welcome to download the white paper "From ClickHouse to ByteHouse" to learn more~

 

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5588928/blog/5553446