Data storage comparison between Elasticsearch and Clickhouse | JD Cloud technical team

1 background

Jingxingda Technology Department adopts JDQ+Flink+Elasticsearch architecture in the community group buying scenario to create real-time data reports. With the development of the business, Elasticsearch began to expose some disadvantages. It is not suitable for large-scale data queries, and the high-frequency paging export leads to downtime and high storage costs.

Elasticsearch query statements have high maintenance costs, and data inaccuracy occurs in aggregation computing scenarios. Clickhouse is a columnar database, which is naturally suitable for OLAP scenarios. Similar to SQL syntax, it reduces development and learning costs, uses fast compression algorithms to save storage costs, and uses vector execution engine technology to greatly reduce calculation time. So for this comparison, switch from Elasticsearch to Clickhouse.

2 OLAP

OLAP means On-Line Analytical Processing, and Clickhouse is a typical OLAP online analytical database management system (DBMS). OLAP mainly performs complex analysis and summary operations on data. For example, our business system makes summary statistics on all transportation group orders of the day every day, and calculates the delivery rate of each province. This operation belongs to OLAP data processing. Similar to OLAP, there is also an OLTP type of data processing, which means On-Line Transaction Processing. In the OLTP scenario, the amount of concurrent user operations will be large, requiring the system to respond to data operations in real time, and need to support transactions. Mysql, Oracle, SQLServer, etc. are all OLTP databases.

2.1 Features of OLTP scenarios

  • Wide tables, that is, each table contains a large number of columns
  • For reads, quite a few rows are fetched from the database, but only a small subset of the columns.
  • relatively few queries (typically hundreds of queries per second or less per server)
  • The query results are significantly smaller than the source data. In other words, the data is filtered or aggregated so the results fit in a single server's RAM
  • Mostly read requests
  • Data is updated in sizable batches (>1000 rows) instead of a single row; or not updated at all.
  • For simple queries, about 50 milliseconds of latency is allowed
  • The data in the columns is relatively small: numbers and short strings (e.g. 60 bytes per URL)
  • Requires high throughput (billions of rows per second per server) when processing a single query
  • business is not necessary
  • Low requirements for data consistency

3 Features

3.1 Elasticsearch

  • Search: Applicable to inverted index, each field can be indexed and can be used for search, realize second-level response in near real-time under massive data, open source search engine based on Lucene, provide search for full-text search, highlight, search recommendation, etc. ability. Baidu search, Taobao product search, log search, etc.
  • Data analysis: Elasticsearch provides a large number of data analysis APIs and rich aggregation capabilities, supporting data analysis and processing on the basis of massive data. Statistical order volume, crawler crawling a certain product data of different e-commerce companies, and data analysis through Elasticsearch (historical prices, purchasing power, etc. of each platform)

3.2 Clickhouse

  • columnar storage
  • Compression algorithm: Use lz4 and zstd algorithm data compression, high compression ratio to reduce data size, reduce disk IO, and reduce CPU usage.
  • Index: Sort the data according to the primary key, clickhouse can complete the search for specific data or range of data in a large amount of data within tens of milliseconds.
  • Multi-core parallel processing: ClickHouse will use all available resources on the server to complete a query with all its strength.
  • Support for SQL: A declarative query language based on SQL, in many cases identical to the ANSI SQL standard. Supports group by, order by, from, join, in and non-correlated subqueries, etc.
  • Vector engine: In order to use the CPU efficiently, the data is not only stored in columns, but also processed in vectors (a part of the column), which can use the CPU more efficiently.
  • Real-time data update: data is always incrementally stored in MergeTree in an orderly manner. Data can be continuously and efficiently written to the table without any operations such as locking. Write traffic at 50M-200M/s
  • Suitable for online query: fast response and extremely low latency
  • Rich aggregate calculation functions

4 Our business scenarios

  1. For large and wide tables, read a large number of rows and a small number of columns for index aggregation calculation query, and the result set is relatively small. The data tables are all wide tables processed by Flink, with many columns. When querying or analyzing data, a few columns are often selected as dimension columns, and other few columns are used as index columns to perform aggregation calculations on the entire table or data within a certain range. This process scans a large number of rows of data, but only a few columns are used.
  2. A large number of list pagination query and export
  3. Data in Flink is appended and written in large quantities without updating
  4. Sometimes an indicator calculation requires a full table scan for aggregation calculations
  5. Rarely does full text search

Conclusion: Data report and data analysis scenarios are typical OLAP scenarios. In business scenarios, the columnar storage database Clickhouse has more advantages than Elasticsearch. Elasticsearch has more advantages in full-text search, but our full-text search scenario is less.

5 cost

  • Learning cost: Clickhouse's SQL syntax is simpler than Elasticsearch's DSL. Almost all back-end R&D has Mysql development experience, and the learning cost is lower.
  • Development, testing, and maintenance costs: Clickhouse uses SQL syntax, which is similar to the Mysql development model, and it is easier to write unit tests. Elasticsearch uses the Java API to concatenate query statements, which is complex and difficult to read and maintain.
  • O&M cost: Unknown, Clickhouse costs less than Elasticsearch in log scenarios on the Internet.
  • Server cost:
  • The data compression ratio of Clickhouse is higher than that of Elasticsearch. The disk space occupied by ES for the same business data is 3-10 times that of Clickhouse, with an average of 6 times. see picture 1
  • Clickhouse takes less CPU and memory than ES

Conclusion: In the case of the same amount of data, the storage space used by Elasticsearch is 3-10 times that of Clickhouse, with an average of 6 times. In terms of comprehensive learning, development, testing, and maintenance, Clickhouse is more friendly than Elasticsearch

6 tests

6.1 Server configuration

The following are all tested based on the configuration in the figure below

6.2 Write pressure test

The following is based on the wms_order_sku table, and the test results obtained by double-writing Elasticsearch and Clickhouse 1000W+ data through Flink under the condition of business stability

  • CPU usage: Elasticsearch has a high CPU usage, while Clickhouse uses very little CPU. See Figure 2

  • Memory usage: Elasticsearch memory increases and frequent GC, while Clickhouse memory usage is relatively low and relatively stable. See Figure 3

  • Write throughput: CH stand-alone write speed is about 50~200MB/s. If the data written is 1kb per line, the write speed is 5-20W/s. Figure 4 (write throughput) is the Internet Elasticsearch and The comparison chart of the data written by Clickhouse, CH write performance is 5 times that of Elasticsearch under the same data sample. Since our current Flink tasks are double-written, considering the mutual influence, we will supplement the pressure test results later.

Conclusion: Elasticsearch consumes more memory and CPU than Clickhouse when writing data in batches. The memory consumed by Elasticsearch is 5.3 times that of Clickhouse, and the CPU consumed is 27.5 times that of Clickhouse. Throughput is 5 times that of Elasticsearch

6.3 Query performance (single concurrency test)

The following scenarios are high-frequency scenarios that appear in our data reports and data analysis, so the query performance test is based on this

Data Comparison

  • Clickhouse itself does not have a big difference in query performance when the cluster configuration is doubled. CH2 (48C 182GB) is 14% slower than CH1 (80C 320GB) on average. See Figure 5

  • Elasticsearch query performance is greatly affected when the cluster configuration is twice as bad. ES2 (46C 320GB) is 40% slower than ES1 (78C 576GB) on average. See Figure 6

  • Compared with CH2 (48C 182GB) and ES2 (46C 320GB), the number of CPU cores of ES2 and CH2 is similar, and the ES2 memory is 1.75 times that of CH2, and the response speed of CH2 is 12.7 times that of ES2. See Figure 7

Conclusion: Elasticsearch is slower than Clickhouse when querying data, and the response speed of Clickhouse is 12.7 times that of Elasticsearch when the configuration is similar. In particular, the time-based multi-field aggregation query is 32 times faster than Clickhouse. The query response speed of Clickhouse is less affected by the size of the cluster configuration.

6.4 Query pressure test (high concurrency test, data comes from the Internet)

Since it is more complicated and time-consuming to prepare high-concurrency tests, we will conduct query stress tests based on our business data and business scenarios later. The following data comes from tests conducted by the Internet in the user portrait scenario (data volume: 262933269), which is very similar to our scenario.

Conclusion: Clickhouse does not support high concurrency enough, and the official recommendation is that the maximum QPS is 100. The throughput is not as friendly as Elasticsearch under high concurrency

7 summary

The advantages and disadvantages of Clickhouse and Elasticsearch compared to Clickhouse.

advantage:

  • The cost of hardware resources is lower, and under the same scenario, Clickhouse occupies less resources.
  • The labor cost is lower, and newcomers are more friendly and easier to intervene in learning, developing unit testing and testing.
  • In the OLAP scenario, Clickhouse is more suitable than Elasticsearch. The aggregation calculation is more refined and faster than Elasticsearch, and it saves server computing resources.
  • The writing performance is higher, which is 5 times that of Elasticsearch under the same circumstances, and the server resources consumed during writing are smaller.
  • Elasticsearch frequently GCs in the case of a large number of exports, which may seriously cause downtime, and is not as stable as Clickhouse.
  • The average query performance is 12.7 times that of Elasticsearch, and the query performance of Clickhouse is less affected by server configuration
  • Clickhouse can get better performance under the same monthly server consumption.

shortcoming:

  • It is not as good as Elasticsearch in full-text search, and it is not as good as Elasticsearch in high-concurrency queries.

Author: JD Logistics Ma Hongyan

Content source: JD Cloud developer community

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/8900125
Recommended