foreword
This article belongs to the column "Big Data Technology System". This column is original by the author. Please indicate the source for the citation. Please point out the deficiencies and mistakes in the comment area, thank you!
For the directory structure and references of this column, please refer to Big Data Technology System
1. Configuration optimization
The general optimization configuration of ClickHouse is shown in the following table. Most of the configurations need to be optimized according to the actual building conditions online. For specific configurations that need to be optimized, please refer to the official documentation:
https://clickhouse.tech/docs/en/operations/settings/query-complexity
https://clickhouse.tech/docs/en/operations/settings/
configuration name | Recommended configuration | illustrate |
---|---|---|
max_server_memory_usage_to_ram_ratio | 90% of machine memory | Ratio of occupied physical machine memory |
max_memory_usage | Reasonable adjustment based on single query memory usage and concurrency | The maximum amount of memory used by a single query |
background_pool_size | Twice the number of CPU cores | Number of threads for background Merge operations |
max_parts_in_total | 1000000 | The maximum number of parts in a single machine |
parts_ to_delay_insert | 3000 | After the number of active parts under a single partition exceeds this value, new writes to parts_to_throw_insert will be delayed |
old_parts_lifetime | 0 means delete the old part immediately, adjust according to business needs | Time to keep old parts after background merge and data expiration |
max_ concurrent_queries | Adjust according to machine resources | The maximum number of queries for a MergeTree |
max_bytes_before_external_group_by | It is recommended to enable, the specific value is half of max_memory_usage | Group by process allows data to be flushed |
2. Query optimization
When users query data, they can refer to the following points to optimize SQL:
- Use the explain command to view the execution plan and confirm whether the query plan is reasonable.
- First filter the data (reduce I/O) and then perform operations such as join.
- In the join operation, the big table comes first and the small table comes after.
- It is recommended to use large and wide tables for queries, and do not perform multiple joins.
- When business permits, you can use approximate functions instead of exact functions, for example, use uniq instead of count distinct
- When two distributed tables are joined, they can be sharded to the same node according to the same rules before writing data.
- When the subquery is a distributed table, the GLOBAL key is required.
3. Table related optimization
When creating a table, users can refer to the following points:
- Try not to use string type fields.
- Use default values instead of null values.
- The fact table that can be partitioned should be partitioned as much as possible.
- Secondary indexes can be used.
- Configure TTL when the business permits, and delete unnecessary data.
- Try to write more than 1,000 pieces of data to reduce the pressure of background Merge.