An article to learn to tune ClickHouse

foreword

This article belongs to the column "Big Data Technology System". This column is original by the author. Please indicate the source for the citation. Please point out the deficiencies and mistakes in the comment area, thank you!

For the directory structure and references of this column, please refer to Big Data Technology System


1. Configuration optimization

The general optimization configuration of ClickHouse is shown in the following table. Most of the configurations need to be optimized according to the actual building conditions online. For specific configurations that need to be optimized, please refer to the official documentation:

https://clickhouse.tech/docs/en/operations/settings/query-complexity

https://clickhouse.tech/docs/en/operations/settings/

configuration name Recommended configuration illustrate
max_server_memory_usage_to_ram_ratio 90% of machine memory Ratio of occupied physical machine memory
max_memory_usage Reasonable adjustment based on single query memory usage and concurrency The maximum amount of memory used by a single query
background_pool_size Twice the number of CPU cores Number of threads for background Merge operations
max_parts_in_total 1000000 The maximum number of parts in a single machine
parts_ to_delay_insert 3000 After the number of active parts under a single partition exceeds this value, new writes to parts_to_throw_insert will be delayed
old_parts_lifetime 0 means delete the old part immediately, adjust according to business needs Time to keep old parts after background merge and data expiration
max_ concurrent_queries Adjust according to machine resources The maximum number of queries for a MergeTree
max_bytes_before_external_group_by It is recommended to enable, the specific value is half of max_memory_usage Group by process allows data to be flushed

2. Query optimization

When users query data, they can refer to the following points to optimize SQL:

  1. Use the explain command to view the execution plan and confirm whether the query plan is reasonable.
  2. First filter the data (reduce I/O) and then perform operations such as join.
  3. In the join operation, the big table comes first and the small table comes after.
  4. It is recommended to use large and wide tables for queries, and do not perform multiple joins.
  5. When business permits, you can use approximate functions instead of exact functions, for example, use uniq instead of count distinct
  6. When two distributed tables are joined, they can be sharded to the same node according to the same rules before writing data.
  7. When the subquery is a distributed table, the GLOBAL key is required.

3. Table related optimization

When creating a table, users can refer to the following points:

  1. Try not to use string type fields.
  2. Use default values ​​instead of null values.
  3. The fact table that can be partitioned should be partitioned as much as possible.
  4. Secondary indexes can be used.
  5. Configure TTL when the business permits, and delete unnecessary data.
  6. Try to write more than 1,000 pieces of data to reduce the pressure of background Merge.

Guess you like

Origin blog.csdn.net/Shockang/article/details/128179495