A long article with ten thousand characters details the exploration and practice of ClickHouse in real-time data of Jingxingda | JD Cloud technical team

1 Introduction

Jingxingda Technology Department adopts JDQ+Flink+Elasticsearch architecture in the community group buying scenario to create real-time data reports. With the development of the business, Elasticsearch began to expose some disadvantages. It is not suitable for large-scale data queries. The high-frequency deep paging export leads to ES downtime, cannot accurately deduplicate statistics, and the performance drops significantly when multiple fields are aggregated and calculated. So ClickHouse was introduced to deal with these drawbacks.

The data writing link is that the business data (binlog) is processed and converted into MQ messages in a fixed format. Flink subscribes to different topics to receive table data from different production systems, performs association, calculation, filtering, supplementary basic data, etc. table, and finally write the processed DataStream data stream into ES and ClickHouse. The query service is exposed to the outside through JSF and the logistics gateway for display. Since ClickHouse uses all its computing power for one query, it is not good at high-concurrency queries. We add caches to some real-time aggregation indicator interfaces, or query ClickHosue calculation indicators in scheduled tasks and store them in ES. Some indicators no longer check ClickHouse in real time but check the calculated indicators in ES to resist concurrency, and this method can be greatly improved. Improve development efficiency, easy maintenance, and can unify the index caliber.

I have experienced various difficulties in the process of introducing ClickHouse, and spent a lot of energy to explore and solve them one by one. I will record them here and hope to provide some direction guidance for students who have not been in touch with ClickHouse to avoid detours. If there are mistakes in the text, I hope More instructions are included, and everyone is welcome to discuss topics related to ClickHouse. This article is long but full of dry goods, please allow 40~60 minutes for reading.

2 problems encountered

As mentioned above, we have encountered many difficulties. The following problems are the focus of this article.

  • What table engine should we use
  • How Flink writes to ClickHouse
  • Why is it 1~2 minutes slower to query ClickHouse than to query ES?
  • Whether to write to a distributed table or a local table
  • Why is the CPU usage of only one shard high?
  • How to locate which SQLs are consuming CPU, so many slow SQLs, how do I know which SQL is causing it
  • Found slow SQL, how to optimize it
  • How to resist high concurrency and ensure the availability of ClickHouse

3 Table engine selection and query scheme

Before choosing a table engine and query scheme, clarify the requirements first. As mentioned in the preface, we are constructing wide tables in Flink, which will involve data update operations in business, and the same business order number will be written to the database multiple times. ES's upsert supports this kind of operation that needs to overwrite the previous data. There is no upsert in ClickHouse, so it is necessary to explore a solution that can support upsert. With this requirement, let's take a look at ClickHouse's table engine and query scheme.

ClickHouse has many table engines, and the table engine determines how data is stored, how it is loaded, and what characteristics the data table has. At present, the ClickHouse table engine is divided into four series, namely Log, MergeTree, Integration, and Special.

  • Log series: suitable for scenarios with a small amount of data (less than one million rows), and does not support indexes, so it is not efficient for range queries.
  • Integration series: mainly used to import external data to ClickHouse, or directly operate external data in ClickHouse, supporting Kafka, HDFS, JDBC, Mysql, etc.
  • Special series: For example, Memory stores data in the internal memory, and the data will be lost after restarting. The query performance is excellent, and File directly uses local files as data storage, etc. Most of them are customized for specific scenarios.
  • MergeTree series: The MergeTree family has a variety of engine variants. As the most basic engine in the family, MergeTree provides capabilities such as primary key indexing, data partitioning, data copying, and data sampling, and supports extremely large amounts of data writing. Other engines in the family are in Based on the MergeTree engine, each has its own strengths.

Log, Special, and Integration are mainly used for special purposes, and the scenarios are relatively limited. Among them, the performance characteristics of ClickHouse are MergeTree and its family table engine, which is also the official main storage engine. It supports almost all the core functions of ClickHouse, and this series of table engines will be used in most scenarios of the production environment. Our business is no exception and requires the use of primary key indexes. The daily data increment is more than 25 million increments, so the MergeTree series is the target we need to explore.

The table engine of the MergeTree series is designed to insert a large amount of data. The data is quickly written one by one in the form of data fragments. In order to avoid too many data fragments, ClickHouse will merge them in the background according to certain rules to form new segments. Compared with continuously modifying the data already stored on disk when inserting, this strategy of merging after inserting and then merging is much more efficient. This feature of repeated merging of data fragments is also the origin of the name of the MergeTree series (merge tree family). In order to avoid forming too many data fragments, batch writes are required. The MergeTree series includes MergeTree, ReplacingMergeTree, CollapsingMergeTree, VersionedCollapsingMergeTree, SummingMergeTree, and AggregatingMergeTree engines. These engines are introduced below.

3.1 MergeTree: merge trees

MergeTree supports all ClickHouse SQL syntax. Most of the functions are similar to the MySQL we are familiar with, but some functions are quite different, such as the primary key. The primary key of the MergeTree series is not used for deduplication. In MySQL, there cannot be two data with the same primary key in a table, but in ClickHouse is allowed.

In the table creation statement below, the order number, product quantity, creation time, and update time are defined. The data is partitioned according to the creation time, orderNo is used as the primary key (primary key), and orderNo is also used as the sort key (order by). By default, the primary key and the sort key are the same. In most cases, it is not necessary to specify the primary key. In this example, only In order to illustrate the relationship between the primary key and the sort key. Of course, the sort key can be different from the primary key field, but the primary key must be a subset of the sort key, such as the primary key (a,b), the sort key must be (a,b, , ), and the fields that make up the primary key must be in the sort key field far left in .

CREATE TABLE test_MergeTree (  orderNo String,  number Int16,  createTime DateTime,  updateTime DateTime) ENGINE = MergeTree()PARTITION BY createTimeORDER BY  (orderNo)PRIMARY KEY (orderNo);insert into test_MergeTree values('1', '20', '2021-01-01 00:00:00', '2021-01-01 00:00:00');insert into test_MergeTree values('1', '30', '2021-01-01 00:00:00', '2021-01-01 01:00:00');

Note that the primary key orderNo of the two data written here is both 1. In this scenario, we create an order first, and then update the order’s product quantity to 30 and the update time. At this time, the actual order quantity of the business is 1, and the product pieces The amount is 30.

Inserting data with the same primary key will not cause conflicts, and both data with the same primary key exist in the query data. The following figure is the query result. Since each insertion will form a part, the first insert generates a 1609430400_1_1_0 data partition file, and the second insert generates a 1609430400_2_2_0 data partition file. The background has not yet triggered the merge, so the clickhouse-client The display results are separated into two tables (the graphical query tools DBeaver and DataGrip cannot be seen as two tables, and the ClickHouse environment can be built through docker to execute statements through the client mode, and there is a CK environment document at the end of the article).

The expected result should be that the number is updated from 20 to 30, and the updateTime is also updated to the corresponding value. There is only one row of data for the same business primary key, but two are retained in the end. This processing logic in Clickhouse will cause the data we query to be incorrect. For example, count the number of orders by deduplication, count(orderNo), and count the number of orders placed sum(number).

Let's try to merge two rows of data.

After the forced segment merging, there are still two pieces of data, which is not the data we expected to retain the last item with a quantity of 30. But the two rows of data are merged into one table, the reason is that the partitionID of 1609430400_1_1_0 and 1609430400_2_2_0 are the same and merged into a file of 1609430400_1_2_1. After the merger is completed, 1609430400_1_1_0 and 1609430400_2_2_0 will be deleted in the background after a certain period of time (8 minutes by default). The figure below shows the naming rules of the partition file, partitionID: 1609430400 = 2021-01-01 00:00:00, MinBolckNum, MaxBolckNum: are the smallest data block and the largest data block, and are an integer auto-increment number. Level: 0 can be understood as the number of times the partition has been merged. The default value is 0, and 1 will be added to the new partition generated after each merge.

Based on the above, it can be seen that although MergeTree has a primary key, it is not similar to the unique deduplication function used by MySQL to keep records. It is only used for query acceleration. Even after manual merging, data rows with the same primary key still exist, and cannot be used by business. Deduplication of documents leads to incorrect results of count(orderNo) and sum(number), which do not apply to our needs.

3.2 ReplacingMergeTree: Replace Merge Tree

Although MergeTree has a primary key, it cannot deduplicate data with the same primary key. Our business scenarios cannot have duplicate data. ClickHouse provides the ReplacingMergeTree engine for deduplication, which can delete duplicate data when merging partitions. The deduplication I understand is divided into two aspects, one is physical deduplication, that is, duplicate data is directly deleted, and the other is query deduplication, which does not process physical data, but the query results have already filtered out duplicate data.

The example is as follows, the ReplacingMergeTree table building method is not particularly different from MergeTree, but the ENGINE is changed from MergeTree to ReplacingMergeTree([ver]), where ver is the version column, which is an optional item. The supported types on the official website are UInt, Date Or DateTime, but I experimented with the Int type as well (ClickHouse 20.8.11). ReplacingMergeTree deduplicates physical data during data merging, and the deduplication strategy is as follows.

  • If the ver version column is not specified, the last inserted row is kept among the rows with the same primary key.
  • If the ver version column has been specified, the following example specifies the version column as the version column, and deduplication will retain the row with the largest version value, regardless of the data insertion order.

<!---->

CREATE TABLE test_ReplacingMergeTree (  orderNo String,  version Int16,  number Int16,  createTime DateTime,  updateTime DateTime) ENGINE = ReplacingMergeTree(version)PARTITION BY createTimeORDER BY  (orderNo)PRIMARY KEY (orderNo);1) insert into test_ReplacingMergeTree values('1', 1, '20', '2021-01-01 00:00:00', '2021-01-01 00:00:00');2) insert into test_ReplacingMergeTree values('1', 2, '30', '2021-01-01 00:00:00', '2021-01-01 01:00:00');3) insert into test_ReplacingMergeTree values('1', 3, '30', '2021-01-02 00:00:00', '2021-01-01 01:00:00');-- final方式去重select * from test_ReplacingMergeTree final;-- argMax方式去重select argMax(orderNo,version) as orderNo, argMax(number,version) as number,argMax(createTime,version),argMax(updateTime,version) from test_ReplacingMergeTree;

The figure below shows the results of three queries after the execution of the first two insert statements. None of the three query methods have any impact on the physically stored data. The final and argMax methods only deduplicate the query results.

  • Ordinary query: query results are not deduplicated, physical data is not deduplicated (partition files are not merged)
  • final deduplication query: the query result has been deduplicated, but the physical data has not been deduplicated (partition files have not been merged)
  • argMax deduplication query: the query result has been deduplicated, but the physical data has not been deduplicated (partition files have not been merged)

Among them, both the final and argMax query methods filter out duplicate data. Our examples are all operations based on local tables. There is no difference in the results between final and argMax. However, if the experiment is based on distributed tables and the two data fall into different data fragments (note that this is not a data partition), then final and argMax results will differ. The final result will not be deduplicated, because final can only deduplicate the local table, and cannot deduplicate the cross-shard data, but the result of argMax is deduplicated. argMax takes out the latest data we want to query by comparing the size of the second parameter version to achieve the purpose of filtering out duplicate data. The principle is to put the data of each shard into the memory of the same shard for comparison and calculation, so Supports deduplication across shards.

Because the merge in the background is performed at an uncertain time, execute the merge command, and then use the normal query to find that the result is deduplicated data, version=2, number=30 is the data we want to keep.

Execute the third insert statement. The primary key of the third statement is the same as the first two, but the partition field createTime field is different. The first two are 2021-01-01 00:00:00, and the third is 2021-01-02 00 :00:00, according to the above understanding, the data with version = 3 will be retained after the forced merge meeting. After performing a common query, we found that the data with version = 1 and 2 were merged and deduplicated, and 2 was retained, but the data with version = 3 still existed. The reason for this is that ReplacingMergeTree deletes duplicate data in units of partitions. The partition fields createTime and partitionID of the first two inserts are the same, so they are merged into the 1609430400_1_2_1 partition file, and the third insert is inconsistent with the first two, and cannot be merged into a partition file, and physical deduplication cannot be achieved. Finally, through the final deduplication query, it is found that deduplication can be supported, and argMax also has the same effect, which is not shown.

ReplacingMergeTree has the following characteristics

  • Use the primary key as the unique key for judging duplicate data, and support inserting data with the same primary key.
  • The logic to delete duplicate data will be triggered when merging partitions. But the timing of merging is uncertain, so there may be duplicate data when querying, but it will eventually be deduplicated. You can call optimize manually, but it will cause a lot of reading and writing of data, so it is not recommended for production use.
  • Duplicate data is deleted in units of data partitions. When partitions are merged, duplicate data in the same partition will be deleted, and duplicate data in different partitions will not be deleted.
  • You can use the final and argMax methods to deduplicate queries. In this way, you can get correct query results regardless of whether data has been merged or not.

Best use of ReplacingMergeTree

  • Ordinary select query: for offline queries with low timeliness, ClickHouse can be used to automatically merge and cooperate, but it is necessary to ensure that the same business document falls in the same data partition, and the distributed table also needs to be guaranteed to be in the same shard. The most efficient and computational resource-saving query method.
  • Final mode query: you can use final for real-time query. Final is local deduplication. It is necessary to ensure that the same primary key data falls in the same shard (Shard), but it does not need to fall in the same data partition. This method is less efficient, but Compared with ordinary select, it will consume some performance. If the where condition hits the primary key index, secondary index, and partition fields well, then the efficiency can be fully used.
  • argMax query: you can use argMax for real-time query. The requirement of argMax is the lowest, and it can deduplicate any query. However, due to its implementation method, the efficiency will be much lower, and it will consume a lot of performance. It is not recommended to use it. In 9.4.3 later, the pressure test data will be compared with final.

Among the above three usage schemes, ReplacingMergeTree with final mode query is in line with our needs.

3.3 CollapsingMergeTree/VersionedCollapsingMergeTree: Collapsing Merge Tree

Collapsing and merging trees is no longer illustrated by example. You can refer to the example on the official website.

CollapsingMergeTree records the state of the data row by defining a sign bit field. If the sign bit is 1 ("status" line), it means that this is a valid data line, and if the sign bit is -1 ("cancel" line), it means that this line of data needs to be deleted. It should be noted that only data with the same primary key may be folded.

  • If sign=1 has at least one row more than sign=-1, keep the last row with sign=1.
  • If sign=-1 is at least one more row than sign=1, keep the first row with sign=-1.
  • If sign=1 has as many rows as sign=-1, and the last row is sign=1, keep the data of the first row of sign=-1 and the last row of sign=1.
  • If sign=1 is as many lines as sign=-1 and the last line is sign=-1, keep nothing.
  • In other cases, ClickHouse will not report an error but will print an alarm log. In this case, the result of the query is uncertain and unpredictable.

Pay attention when using CollapsingMergeTree

1) Like ReplacingMergeTree, folding data is not triggered in real time, but only when partitions are merged. Duplicate data will still be queried before the merge. There are two solutions

  • Use optimize to force merging, and it is also not recommended to use force merging that is extremely inefficient and consumes resources in a production environment.
  • Rewrite the query method by using group by with a signed sign column. This approach increases the encoding cost of using

2) In terms of writing, the program that needs to write data records the data in the "Status" line by deleting or modifying data through the "Cancel" line, which greatly increases the storage cost and the complexity of programming. Flink will re-run the data when it goes online or in some cases, and the data rows recorded in the program will be lost. It may cause the sign=1 and sign=-1 to be inconsistent and cannot be merged. This is an unacceptable problem for us.

CollapsingMergeTree also has a disadvantage. It has strict requirements on the order of writing. If it is written in the normal order, first write the line with sign=1 and then write the line with sign=-1, and it can be merged normally. If the order is reversed, then cannot be merged normally. ClickHouse provides VersionedCollapsingMergeTree, which solves the order problem by increasing the version number. But other features are exactly the same as CollapsingMergeTree and cannot meet our needs

3.4 Table Engine Summary

We have introduced in detail the four table engines of the MergeTree series, MergeTree, ReplacingMergeTree, CollapsingMergeTree, and VersionedCollapsingMergeTree. There are also SummingMergeTree and AggregatingMergeTree that are not introduced. SummingMergeTree is a table engine designed for not caring about detailed data but only about aggregated data. MergeTree can also meet this requirement of only focusing on summarized data. It can be satisfied by using group by with sum and count aggregation functions, but performing real-time aggregation for each query will increase a lot of overhead. We have both detailed data requirements and summary indicator requirements, so SummingMergeTree cannot meet our needs. AggregatingMergeTree is an upgraded version of SummingMergeTree, which is essentially the same. The difference is that SummingMergeTree performs sum aggregation on non-primary key columns, while AggregatingMergeTree can specify various aggregation functions. Also can not meet the demand.

In the end, we chose the ReplacingMergeTree engine. The distributed table is sharded through the business primary key sipHash64 (docId) to ensure that the data of the same business primary key falls in the same shard. At the same time, the business document creation time is used to partition by month/day. Cooperate with final to deduplicate queries. During the Double Eleven period, the data of this solution increased by 3000W per day, and the  cluster CPU utilization rate of the peak business database QPS93, 32C 128G 6 slices and 2 copies was up to 60%, and the system was stable as a whole. All practical optimizations below are also based on the ReplacingMergeTree engine.

4 How Flink writes to ClickHouse

4.1 Flink version problem

Flink supports writing data into the JDBC database through the JDBC Connector, but the writing methods of the JDBC Connector of different versions of Flink are quite different. Because Flink made a major refactoring of the JDBC Connector in version 1.11:

  • The package name before version 1.11 was flink-jdbc
  • The package name after version 1.11 (included) is flink-connector-jdbc

The two supports for writing to ClickHouse Sink in different ways in Flink are as follows:

At first we used the 1.10.3 version of Flink, and flink-jdbc does not support data stream writing. We need to upgrade the Flink version to 1.11.x and above to use flink-connector-jdbc to write data to ClickHouse.

4.2 Construct ClickHouse Sink

/** * 构造Sink * [@param](https://my.oschina.net/u/2303379) clusterPrefix clickhouse 数据库名称 * [@param](https://my.oschina.net/u/2303379) sql   insert 占位符 eq:insert into demo (id, name) values (?, ?) */public static SinkFunction getSink(String clusterPrefix, String sql) {    String clusterUrl = LoadPropertiesUtil.appInfoProcessMap.get(clusterPrefix + CLUSTER_URL);    String clusterUsername = LoadPropertiesUtil.appInfoProcessMap.get(clusterPrefix + CLUSTER_USER_NAME);    String clusterPassword = LoadPropertiesUtil.appInfoProcessMap.get(clusterPrefix + CLUSTER_PASSWORD);    return JdbcSink.sink(sql, new CkSinkBuilder<>(),            new JdbcExecutionOptions.Builder().withBatchSize(200000).build(),             new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()                    .withDriverName("ru.yandex.clickhouse.ClickHouseDriver")                    .withUrl(clusterUrl)                    .withUsername(clusterUsername)                    .withPassword(clusterPassword)                    .build());}

Use the JdbcSink.sink() api of flink-connector-jdbc to construct a Flink sink. JdbcSink.sink() input parameters have the following meanings

  • sql: SQL statements in the form of placeholders, for example: insert into demo (id, name) values ​​(?, ?)
  • new CkSinkBuilder<>(): the implementation class of the org.apache.flink.connector.jdbc.JdbcStatementBuilder interface, which mainly maps the data in the stream to java.sql.PreparedStatement to construct PreparedStatement, and the details will not be repeated.
  • The third input parameter: the execution strategy of flink sink.
  • The fourth input parameter: jdbc driver, connection, account number and password.

  • When using it, just addSink directly in the DataStream stream.

5 Flink writes ClickHouse strategy

Flink writes to ES and Clikhouse at the same time, but when performing data query, it is found that ClickHouse is always slower than ES. It is suspected that the processing such as ClickHouse merging will take some time, but these merging operations of ClickHouse will not affect the query. Later, I checked the Flink write strategy code and found that there was a problem with the strategy we used.

In the code above (4.2), new JdbcExecutionOptions.Builder().withBatchSize(200000).build() is the writing strategy. ClickHouse recommends batch writing no less than 1,000 lines in order to improve writing performance, or no more than one line per second. write request. The strategy is to write 200,000 rows of records once, and Flink will also write and commit when performing Checkpoint. Therefore, when the amount of data is accumulated to 20W or the Flink memory checkpoint, there will be data in ClickHouse. Our ES sink strategy is 1000 lines or 5s for write submission, so writing to ClickHouse is slower than writing to ES.

There is a disadvantage of submitting when reaching 20W or checkpoint. When the amount of data is small and less than 20W, the checkpoint time interval is t1, and the checkpoint time is t2. Then the longest time from receiving the JDQ message to writing to ClickHouse The interval is t1+t2, completely dependent on the checkpoint time, sometimes the slowest data backlog is 1~2min. Then optimize the writing strategy of ClickHouse, new JdbcExecutionOptions.Builder().withBatchIntervalMs(30 * 1000).build() is optimized to submit once every 30s. In this way, if the checkpoint is slow, the 30s submission strategy can be triggered. Otherwise, the submission at the time of the checkpoint is also a relatively compromised strategy, which can be adjusted according to its own business characteristics. When debugging the submission time, it is found that if the interval is too small, zookeeper’s The cpu usage rate will increase, and the zk usage rate will increase from less than 5% to about 10% when submitting once in 10s.

The org.apache.flink.connector.jdbc.internal.JdbcBatchingOutputFormat#open processing logic in Flink is shown in the figure below.

6 Whether to write to a distributed table or a local table

Let me talk about the results first, we are writing to the distributed table.
Both online information and ClickHouse cloud service colleagues suggest writing to the local table. A distributed table is actually a logical table that does not store real physical data. For example, if you query a distributed table, the distributed table will send the query request to the local table of each fragment for query, and then collect the results of the local table of each fragment, and return them after summarizing. When writing to a distributed table, the distributed table will store the written data in different shards according to certain rules. If writing to the distributed table is just pure network forwarding, the impact is not great, but writing to the distributed table is not purely forwarding. The actual situation is shown in the figure below.

There are three shards S1, S2, and S3, and the client connects to the S1 node to write to the distributed table.

  1. Step 1: Write 1000 pieces of data into the distributed table. The distributed table will distribute 300 pieces of data to S1, 200 pieces to S2, and 500 pieces to S3 according to the routing rules.
  2. Step 2: The client sends 1000 pieces of data, and the 300 pieces of data belonging to S1 are directly written to the disk, and the data of S2 and S3 are also written to the temporary directory of S1
  3. Step 3: S2 and S3 receive the zk change notification, generate a task to pull the temporary directory data corresponding to the current shard in S1, and put the task in a queue, and wait for a certain time to pull the data to its own node.

From the writing method of the distributed table, we can see that all the data will be dropped on the disk connected to the fragment by the client. If the amount of data is large, disk IO will cause a bottleneck. Moreover, MergeTree series engines have merging behavior, which themselves have write amplification (a piece of data is merged multiple times), which occupies a certain amount of disk performance. I saw on the Internet that the cases of writing to local tables are all daily increments of tens of billions, hundreds of billions. There are two main reasons why we choose to write to the distributed table. One is simple, because writing to the local table requires modifying the code and specifying which node to write to. The other is that there is no serious performance bottleneck when writing to the local table during the development process. . During the Double Eleven period, the daily increase of 3000W (after consolidation) rows did not cause write pressure. If a subsequent bottleneck occurs, writing to the distributed table may be abandoned.

7 Why only a certain shard has a high CPU usage

7.1 Uneven data distribution leads to high CPU on some nodes

The picture above is a problem encountered in the process of connecting to ClickHouse, in which the CPU usage of node 7-1 is very high, and the difference between different nodes is very large. Later, through SQL positioning, it was found that the amount of data on different nodes is also very different, and the amount of data on node 7-1 is the largest, resulting in the number of data rows that node 7-1 needs to process is very large compared with other nodes, so the CPU will be relatively high. a lot of. Because we use the grid site code, the data sharding strategy of the distributed table after the sorting bin code is hashed, but the base of the sorting bin code and the website code is relatively small, resulting in insufficient dispersion after the hash, resulting in this data skew phenomenon. Later, the business primary key was used as hash, which solved the problem of high CPU of some nodes.

7.2 A node triggers a merge, resulting in high CPU of this node

7-4 nodes (master node and replica), the CPU is much higher than other nodes without any sign, after excluding emergencies such as new business online and big promotions, locate slow SQL, and analyze the slow query of each node through query_log , see Section 8 for specific sentences.

By comparing the slow SQL of the two nodes, it is found that the query conditions of the following SQL are quite different.

SELECT    ifNull(sum(t1.unTrackQty), 0) AS unTrackQtyFROM    wms.wms_order_sku_local AS t1 FINAL PREWHERE t1.shipmentOrderCreateTime > '2021-11-17 11:00:00'    AND t1.shipmentOrderCreateTime <= '2021-11-18 11:00:00'    AND t1.gridStationNo = 'WG0000514'    AND t1.warehouseNo NOT IN ('wms-6-979', 'wms-6-978', '6_979', '6_978')    AND t1.orderType = '10'WHERE    t1.ckDeliveryTaskStatus = '3'

But we have a doubt, the same statement, the same number of executions, and there is no difference in the amount of data and the number of parts of the two nodes, why the number of rows scanned by the 7-4 node is 5 times that of the 7-0, and this reason If you find it, you should be able to locate the root cause of the problem.
Next, we use clickhouse-client to perform SQL queries, enable trace-level logs, and view the execution process of SQL. For specific execution methods and query log analysis, refer to section 9.1 below. Here we directly analyze the results.

The above two figures can be analyzed

  • Node 7-0: Scanned 4 part partition files, with a total of 94W lines, and took 0.089s
  • Node 7-4: Scanned 2 part partition files, including a part491W line, a total of 502W lines, and took 0.439s

Obviously, the partition 202111_0_408188_322 of the 7-4 node is abnormal, because we are partitioned by month, and the 7-4 node does not know why the partition merged, which caused the data we retrieved on November 17 to fall on this large partition , so the query will filter all the data from the beginning of November to the 18th, which is different from the 7-0 node. The above SQL is queried through the gridStationNo = 'WG0000514' condition, so this problem is solved after creating a secondary index on the gridStationNo field.

After adding the secondary index, node 7-4: Scanned 2 part partition files, with a total of 38W lines, and took 0.103s.

7.3 Physical machine failure

This is rare, but it happened once

8 How to locate which SQL is consuming CPU

I think there are two ways to troubleshoot the problem. One is whether the SQL execution frequency is too high, and the other is to judge whether there is slow SQL execution. Frequent execution or slow query will consume a lot of CPU computing resources. The following two cases are used to illustrate two effective methods for troubleshooting high CPU. Although the following two methods are different in operation, the core is to analyze and locate by analyzing query_log.

8.1 grafana positioning high-frequency execution SQL

Some requirements were launched in December. Recently, it was found that the CPU usage rate was relatively high compared to the CPU usage rate. It is necessary to check which SQLs are causing the problem.

It can be seen from the self-built grafana monitoring in the above figure (building documents) that several query statements are executed at a very high frequency. Through SQL to locate the code logic of the query interface, it is found that a front-end interface request and a back-end interface will execute multiple similar conditions. SQL statement, but the business status is different. This kind of statement that needs to count different types and different states can be optimized by conditional aggregation, which will be described in detail in Section 9.4.1. After optimization, the execution frequency of statements is greatly reduced.

8.2 High number of scanned rows/high memory usage: query_log_all analysis

The previous section said that the high frequency of SQL execution leads to high CPU usage. What should I do if the SQL execution frequency is very low but the CPU is still high. The SQL execution frequency is low, and there may be a large number of scanned data rows, which consumes a lot of resources such as disk IO, memory, and CPU. In this case, it is necessary to use another method to troubleshoot this very bad SQL (T ⌓T).

ClickHouse itself has a system.query_log table, which is used to record the execution logs of all statements. The following figure shows some key field information of the table

-- 创建query_log分布式表CREATE TABLE IF NOT EXISTS system.query_log_allON CLUSTER defaultAS system.query_logENGINE = Distributed(sht_ck_cluster_pro,system,query_log,rand());-- 查询语句select     -- 执行次数    count(), -- 平均查询时间    avg(query_duration_ms) avgTime,    -- 平均每次读取数据行数    floor(avg(read_rows)) avgRow,    -- 平均每次读取数据大小    floor(avg(read_rows) / 10000000) avgMB,    -- 具体查询语句    any(query),    -- 去除掉where条件,用户group by归类    substring(query, positionCaseInsensitive(query, 'select'), positionCaseInsensitive(query, 'from')) as queryLimitfrom system.query_log_all/system.query_logwhere event_date = '2022-01-21'  and type = 2group by queryLimitorder by avgRow desc;

query_log is a local table. It is necessary to create a distributed table, query the query logs of all nodes, and then execute the query analysis statement. The execution effect is shown in the figure below. It can be seen from the figure that the average number of rows scanned by several statements has reached the level of 100 million. There may be problems with this statement. Unreasonable statements such as indexes and query conditions can be analyzed by scanning the number of rows. The high CPU of a certain node in 7.2 is solved by locating the problematic SQL statement in this way, and then further investigation.

9 How to optimize slow queries

The SQL optimization of ClickHouse is relatively simple, and most of the query time is spent on disk IO. You can refer to this small experiment to understand. The core optimization direction is to reduce the amount of data processed by a single query of ClickHouse, that is, to reduce disk IO. The following introduces the slow query analysis method, table creation statement optimization method, and some query statement optimization.

9.1 Use service logs for slow query analysis

Although ClickHouse has provided native EXPLAIN for viewing query plans after version 20.6, the information provided is not very helpful for us to optimize slow SQL. Before version 20.6, we can get more information by using the background service log. We analyze. Compared with EXPLAIN, I prefer to use the method of viewing service logs for analysis. This method requires the use of clickhouse-client to execute SQL statements. At the end of the article, there is a CK environment document built through docker. The higher version of EXPLAIN provides fine-grained information such as the number of parts scanned by SQL statements and the number of data rows that ESTIMATE can query. For the usage of EXPLAIN, please refer to the official documentation.
Use a slow query for analysis, and locate the following slow SQL through query_log_all in 8.2.

select    ifNull(sum(interceptLackQty), 0) as interceptLackQtyfrom wms.wms_order_sku_local final    prewhere productionEndTime = '2022-02-17 08:00:00'    and orderType = '10'where shipmentOrderDetailDeleted = '0'  and ckContainerDetailDeleted = '0'

Using clickhouse-client, the send_logs_level parameter specifies the log level as trace.

clickhouse-client -h 地址 --port 端口 --user 用户名 --password 密码 --send_logs_level=trace

Execute the above slow SQL in the client, and the server prints the log as follows. The log volume is large, and some rows are omitted without affecting the integrity of the overall log.

[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.036317 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Debug> executeQuery: (from 11.77.96.163:35988, user: bjwangjiangbo) select ifNull(sum(interceptLackQty), 0) as interceptLackQty from wms.wms_order_sku_local final prewhere productionEndTime = '2022-02-17 08:00:00' and orderType = '10' where shipmentOrderDetailDeleted = '0' and ckContainerDetailDeleted = '0'[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.037876 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> ContextAccess (bjwangjiangbo): Access granted: SELECT(orderType, interceptLackQty, productionEndTime, shipmentOrderDetailDeleted, ckContainerDetailDeleted) ON wms.wms_order_sku_local[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.038239 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Debug> wms.wms_order_sku_local (SelectExecutor): Key condition: unknown, unknown, and, unknown, unknown, and, and, unknown, unknown, and, and[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.038271 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Debug> wms.wms_order_sku_local (SelectExecutor): MinMax index condition: unknown, unknown, and, unknown, unknown, and, and, unknown, unknown, and, and[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.038399 [ 1340 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> wms.wms_order_sku_local (SelectExecutor): Not using primary index on part 202101_0_0_0_3[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.038475 [ 1407 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> wms.wms_order_sku_local (SelectExecutor): Not using primary index on part 202103_0_17_2_22[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.038491 [ 111 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> wms.wms_order_sku_local (SelectExecutor): Not using primary index on part 202103_18_20_1_22..................................省去若干行(此块含义为:在分区内检索有没有使用索引).................................................[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.039041 [ 1205 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> wms.wms_order_sku_local (SelectExecutor): Not using primary index on part 202202_1723330_1723365_7[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.039054 [ 159 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> wms.wms_order_sku_local (SelectExecutor): Not using primary index on part 202202_1723367_1723367_0[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.038928 [ 248 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> wms.wms_order_sku_local (SelectExecutor): Not using primary index on part 202201_3675258_3700711_1054[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.039355 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Debug> wms.wms_order_sku_local (SelectExecutor): Selected 47 parts by date, 47 parts by key, 9471 marks by primary key, 9471 marks to read from 47 ranges[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.039495 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> MergeTreeSelectProcessor: Reading 1 ranges from part 202101_0_0_0_3, approx. 65536 rows starting from 0[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.039583 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> MergeTreeSelectProcessor: Reading 1 ranges from part 202101_1_1_0_3, approx. 16384 rows starting from 0[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.040291 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> MergeTreeSelectProcessor: Reading 1 ranges from part 202102_0_2_1_4, approx. 146850 rows starting from 0..................................省去若干行(每个分区读取的数据行数信息).................................................[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.043538 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> MergeTreeSelectProcessor: Reading 1 ranges from part 202202_1723330_1723365_7, approx. 24576 rows starting from 0[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.043604 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> MergeTreeSelectProcessor: Reading 1 ranges from part 202202_1723366_1723366_0, approx. 8192 rows starting from 0[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.043677 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> MergeTreeSelectProcessor: Reading 1 ranges from part 202202_1723367_1723367_0, approx. 8192 rows starting from 0..................................完成数据读取,开始进行聚合计算.................................................[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.047880 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> InterpreterSelectQuery: FetchColumns -> Complete[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.263500 [ 1377 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> AggregatingTransform: Aggregating[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.263680 [ 1439 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> Aggregator: Aggregation method: without_key..................................省去若干行(数据读取完成后做聚合操作).................................................[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.263840 [ 156 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> AggregatingTransform: Aggregated. 12298 to 1 rows (from 36.03 KiB) in 0.215046273 sec. (57187.69187876137 rows/sec., 167.54 KiB/sec.)[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.264283 [ 377 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> AggregatingTransform: Aggregated. 12176 to 1 rows (from 35.67 KiB) in 0.215476999 sec. (56507.191284950095 rows/sec., 165.55 KiB/sec.)[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.264307 [ 377 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Trace> Aggregator: Merging aggregated data..................................完成聚合计算,返回最终结果.................................................┌─interceptLackQty─┐│              563 │└──────────────────┘...................................数据处理耗时,速度,信息展示................................................[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.265490 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Information> executeQuery: Read 73645604 rows, 1.20 GiB in 0.229100749 sec., 321455099 rows/sec., 5.22 GiB/sec.[chi-ck-t8ebn40kv7-3-0-0] 2022.02.17 21:21:54.265551 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} <Debug> MemoryTracker: Peak memory usage (for query): 60.37 MiB.1 rows in set. Elapsed: 0.267 sec. Processed 73.65 million rows, 1.28 GB (276.03 million rows/s., 4.81 GB/s.)

Now analyze what information can be obtained from the above log. First of all, the query statement does not use the primary key index. The specific information is as follows

2022.02.17 21:21:54.038239 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} wms.wms_order_sku_local (SelectExecutor): Key condition: unknown, unknown, and, unknown, unknown, and, and, unknown, unknown, and, and

The partition index is also not used, the specific information is as follows

2022.02.17 21:21:54.038271 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} wms.wms_order_sku_local (SelectExecutor): MinMax index condition: unknown, unknown, and, unknown, unknown, and, and, unknown, unknown, and, and

This query scans a total of 36 parts and 9390 MarkRange. By querying the system.parts system partition information table, it is found that the current table has a total of 36 active partitions, which is equivalent to a full table scan.

2022.02.17 21:44:58.012832 [ 1138 ] {f1561330-4988-4598-a95d-bd12b15bc750} wms.wms_order_sku_local (SelectExecutor): Selected 36 parts by date, 36 parts by key, 9390 marks by primary key, 9390 marks to read from 36 ranges

A total of 73645604 rows of data were read in this query, which is also the total number of data rows in this table. It took 0.229100749s to read, and a total of 1.20GB of data was read.

2022.02.17 21:21:54.265490 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} executeQuery: Read 73645604 rows, 1.20 GiB in 0.229100749 sec., 321455099 rows/sec., 5.22 GiB/sec.

The maximum memory consumed by this query statement is 60.37MB

2022.02.17 21:21:54.265551 [ 618 ] {ea8f56fe-cf2b-4260-8f44-a006458bdab3} MemoryTracker: Peak memory usage (for query): 60.37 MiB.

Finally, the following information is summarized. This query took a total of 0.267s, processed 7365W data, a total of 1.28GB, and gave the data processing speed.

1 rows in set. Elapsed: 0.267 sec. Processed 73.65 million rows, 1.28 GB (276.03 million rows/s., 4.81 GB/s.)

Through the above, two serious problems can be found

  • No primary key index used: results in a full table scan
  • Partitioned indexes are not used: resulting in a full table scan

Therefore, it is necessary to add a primary key field or a partition index to the query condition for optimization.

shipmentOrderCreateTime is the partition key, after adding this condition, see the effect.

By analyzing the log, we can see that the primary key index is not used, but the partition index is used, the number of scan fragments is 6, MarkRange 186, a total of 1,409,001 rows of data are scanned, the memory used is 40.76MB, and the size of the scanned data is greatly reduced to save a lot of server resources. And the query speed is improved, from 0.267s to 0.18s.

9.2 Table creation optimization

9.2.1 Try not to use Nullable types

From a practical point of view, setting it to Nullable has little impact on performance, probably because our data volume is relatively small. However, the official has clearly pointed out that try not to use the Nullable type, because the Nullable field cannot be indexed, and the Nullable column has an additional file to store the Null mark in addition to a file that stores normal values.

Using Nullable almost always negatively affects performance, keep this in mind when designing your databases.

CREATE TABLE test_Nullable(  orderNo String,  number Nullable(Int16),  createTime DateTime) ENGINE = MergeTree()PARTITION BY createTimeORDER BY  (orderNo)PRIMARY KEY (orderNo);

Take the above table creation statement as an example, the number column will generate two additional files number.null.*, occupying additional storage space, while the orderNo column does not have an additional storage file marked with null.

In our actual application, we will inevitably encounter such fields that may be null. In this case, an impossible value can be used as the default value. For example, if the status field is all 0 or above, then you can set Use -1 for the default value instead of nullable.

9.2.2 Partition Granularity

The partition granularity is set according to the characteristics of the business scenario, and should not be too coarse or too fine. Our data is generally strictly divided according to time, so it is divided into partitions by day and month. If the index granularity is too fine and divided by minutes or hours, a large number of partition directories will be generated, let alone PARTITION BY create_time directly, which will result in an astonishingly large number of partitions. Almost every piece of data has a partition, which will seriously affect performance. If the index granularity is too coarse, the data volume of a single partition will be relatively large. The problem in section 7.2 above is also related to the index granularity. By partitioning by month, the data volume of a single partition reaches 500W, and the data range is from 1 to 18. Only query The amount of data for the two days of the 17th and 18th, but the monthly partition is optimized. After the partition is merged, the extra data from the 1st to the 16th that is not relevant has to be processed. If the partition is divided by day, there will be no CPU surge. Therefore, it is necessary to create according to your own business characteristics, and maintain a principle that the query only processes data within the scope of this query condition, and does not additionally process irrelevant data.

9.2.3 Distributed table selection of appropriate fragmentation rules

Taking the above 7.1 as an example, the sharding rules selected by the distributed table are unreasonable, resulting in serious data skew falling to a few shards. Instead of exerting the computing power of the entire cluster of the distributed database, the pressure is put on a small number of machines. In this way, the performance of the overall cluster will definitely not improve, so choose the appropriate sharding rules according to the business scenario. For example, we optimize sipHash64 (warehouseNo) to sipHash64 (docId), where docId is the only identifier in the business.

9.3 Performance testing, comparing optimization effects

Before talking about query optimization, let me talk about a small tool, a clickhouse-benchmark performance testing tool provided by clickhouse. The environment is the same as that mentioned above. The CK environment is built through docker. The pressure test parameters can refer to the official document. Here I give a simple single Concurrency test example.

clickhouse-benchmark -c 1 -h 链接地址 --port 端口号 --user 账号 --password 密码 <<< "具体SQL语句"

In this way, you can understand the SQL-level QPS and TP99 information, so that you can test the performance difference before and after statement optimization.

9.4 Query Optimization

9.4.1 Conditional aggregation function reduces the number of scanned data rows

Assume that an interface wants to count the "quantity of inbound parts", "quantity of effective outbound orders" and "quantity of rechecked parts" of a certain day.

-- 入库件量select sum(qty) from table_1 final prewhere type = 'inbound' and dt = '2021-01-01';-- 有效出库单量select count(distinct orderNo) final from table_1 prewhere type = 'outbound' and dt = '2021-01-01' where and status = '1' ;-- 复核件量select sum(qty) from table_1 final prewhere type = 'check' and dt = '2021-01-01';

To output three indicators from one interface requires the above three SQL statements to query table_1 to complete, but it is not difficult to find that dt is consistent, the difference lies in the two conditions of type and status. Assuming that dt = '2021-01-1' each query needs to scan 100W rows of data, then one interface request will scan 300W rows of data. After the conditional aggregation function is optimized, the three queries are changed to one, and the number of scanned rows will be reduced to 100W rows, so the computing resources of the cluster can be greatly saved.

select sumIf(qty, type = 'inbound'), -- 入库件量countIf(distinct orderNo, type = 'outbound' and status = '1'), -- 有效出库单量sumIf(qty, type = 'check') -- 复核件量prewhere dt = '2021-01-01';

The conditional aggregation function is relatively flexible and can be used freely according to your own business situation. Remember that one purpose is to reduce the overall scanning volume and achieve the purpose of improving query performance.

9.4.2 Secondary Index

The table engine of the MergeTree series can specify the hop index.
The hop count index means that after the data segment is divided into small blocks according to the granularity (index_granularity specified when creating the table), the small blocks of the granularity_value number are combined into a large block, and index information is written to these large blocks, which is helpful for use Skip a large amount of unnecessary data when filtering where, reducing the amount of data that SELECT needs to read.

CREATE TABLE table_name(    u64 UInt64,    i32 Int32,    s String,    ...    INDEX a (u64 * i32, s) TYPE minmax GRANULARITY 3,    INDEX b (u64 * length(s)) TYPE set(1000) GRANULARITY 4) ENGINE = MergeTree()...

The index in the above example allows ClickHouse to reduce the amount of read data when executing the following queries.

SELECT count() FROM table WHERE s < 'z'SELECT count() FROM table WHERE u64 * i32 == 10 AND u64 * length(s) >= 1234

Supported Index Types

  • minmax: In the unit of index granularity, it stores the min and max values ​​calculated by the specified expression; it can help to quickly skip blocks that do not meet the requirements and reduce IO in equivalence and range queries.
  • set(max_rows): In the unit of index granularity, store the distinct value set of the specified expression, which is used to quickly judge whether the equivalent query hits the block and reduce IO.
  • ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): After the string is segmented into ngrams, a bloom filter can be built to optimize query conditions such as equivalent, like, and in.
  • tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed): Similar to ngrambf_v1, the difference is that it does not use ngram for word segmentation, but uses punctuation for word segmentation.
  • Bloom_filter([false_positive]): Build a bloom filter for the specified column to speed up the execution of query conditions such as equal value, like, and in.

Example of creating a secondary index

Alter table wms.wms_order_sku_local ON cluster default ADD INDEX belongProvinceCode_idx belongProvinceCode TYPE set(0) GRANULARITY 5;Alter table wms.wms_order_sku_local ON cluster default ADD INDEX productionEndTime_idx productionEndTime TYPE minmax GRANULARITY 5;

Rebuild partition index data: The data inserted before creating the secondary index cannot go through the secondary index, and the index data of each partition needs to be rebuilt to take effect

-- 拼接出所有数据分区的MATERIALIZE语句select concat('alter table wms.wms_order_sku_local on cluster default ', 'MATERIALIZE INDEX productionEndTime_idx in PARTITION '||partition_id||',')from system.partswhere database = 'wms' and table = 'wms_order_sku_local'group by partition_id-- 执行上述SQL查询出的所有MATERIALIZE语句进行重建分区索引数据

9.4.3 Final replaces argMax for deduplication

Compare the performance gap between the final and argMax methods, as shown in the following SQL

-- final方式select count(distinct groupOrderCode), sum(arriveNum), count(distinct sku) from tms.group_order final prewhere siteCode = 'WG0001544' and createTime >= '2022-03-14 22:00:00' and createTime <= '2022-03-15 22:00:00' where arriveNum > 0 and test <> '1'-- argMax方式select count(distinct groupOrderCode), sum(arriveNumTemp), count(distinct sku) from (select argMax(groupOrderCode,version) as groupOrderCode, argMax(arriveNum,version) as arriveNumTemp, argMax(sku,version) as sku from tms.group_order prewhere siteCode = 'WG0001544' and createTime >= '2022-03-14 22:00:00' and createTime <= '2022-03-15 22:00:00' where arriveNum > 0 and test <> '1' group by docId)

The final mode of TP99 is obviously much better than the argMax mode

9.4.4 prewhere instead of where

The syntax of ClickHouse supports additional prewhere filter conditions, which will be judged before the where conditions, which can be regarded as a more efficient where, and its role is to filter data. When the prewhere filter condition is added to the sql filter condition, the storage scan will be carried out in two stages. First, the column value storage block that depends on the prewhere expression is read to check whether there is a record that meets the condition, and other columns that meet the condition Read it out, take the following SQL as an example, where the prewhere method will first scan the type and dt fields, and take out the columns that meet the conditions. When no record meets the conditions, the data in other columns can be skipped and not read. It is equivalent to further narrowing the scanning range on the basis of Mark Range. Compared with where, prewhere will process less data and have higher performance. It may not be easy to understand this passage,

-- 常规方式select count(distinct orderNo) final from table_1 where type = 'outbound' and status = '1' and dt = '2021-01-01';-- prewhere方式select count(distinct orderNo) final from table_1 prewhere type = 'outbound' and dt = '2021-01-01' where and status = '1' ;

In the previous section, we talked about using final for deduplication optimization. Use final to deduplicate and use prewhere to optimize query conditions. There is a pitfall to be aware of. Prewhere will be executed prior to final. Therefore, during the processing of fields with variable values ​​such as status, data rows in the intermediate state can be queried. resulting in inconsistencies in the final data.

As shown in the figure above, the business data of docId:123_1 is written three times, and the data up to version=103 is the latest version data. When we use where to filter the variable value field of status, the results of statement 1 and statement 2 are as follows.

--语句1:使用where + status=1 查询,无法命中docId:123_1这行数据select count(distinct orderNo) final from table_1 where type = 'outbound' and dt = '2021-01-01' and status = '1';--语句2:使用where + status=2 查询,可以查询到docId:123_1这行数据select count(distinct orderNo) final from table_1 where type = 'outbound' and dt = '2021-01-01' and status = '2';

After we introduce prewhere, the writing method of statement 3: when prewhere filters the status field, the data with status=1 and version=102 will be filtered out, resulting in incorrect query results. The correct way to write is statement 2, using prewhere to optimize immutable fields.

-- 语句3:错误方式,将status放到prewhereselect count(distinct orderNo) final from table_1 prewhere type = 'outbound' and dt = '2021-01-01' and status = '1';-- 语句4:正确prewhere方式,status可变字段放到where上select count(distinct orderNo) final from table_1 prewhere type = 'outbound' and dt = '2021-01-01' where and status = '1' ;

Other restrictions: prewhere is currently only available for table engines of the MergeTree series

9.4.5 Column pruning, Partition pruning

ClickHouse is very suitable for wide tables that store large amounts of data, so we should avoid using SELECT * operations, which are very impactful operations. The columns should be trimmed and only select the columns you need, because the fewer fields, the less IO resources will be consumed, and the higher the performance will be.
Partition pruning is to read only the required partitions and control the query range of partition fields.

9.4.6 where, group by order

The order of columns in where and group by should be consistent with the order of columns in order by in the table creation statement, and placed at the front so that they have continuous and uninterrupted common prefixes, otherwise query performance will be affected.

-- 建表语句create table group_order_local(    docId              String,    version            UInt64,    siteCode           String,    groupOrderCode     String,    sku                String,    ... 省略非关键字段 ...     createTime         DateTime) engine = ReplicatedReplacingMergeTree('/clickhouse/tms/group_order/{shard}', '{replica}', version)PARTITION BY toYYYYMM(createTime)ORDER BY (siteCode, groupOrderCode, sku);--查询语句1select count(distinct groupOrderCode) groupOrderQty, ifNull(sum(arriveNum),0) arriveNumSum,count(distinct sku) skuQtyfrom  tms.group_order finalprewhere createTime >= '2021-09-14 22:00:00' and createTime <= '2021-09-15 22:00:00'and siteCode = 'WG0000709'where arriveNum > 0 and test <> '1'--查询语句2 (where/prewhere中字段)select count(distinct groupOrderCode) groupOrderQty, ifNull(sum(arriveNum),0) arriveNumSum,count(distinct sku) skuQtyfrom  tms.group_order finalprewhere siteCode = 'WG0000709' and createTime >= '2021-09-14 22:00:00' and createTime <= '2021-09-15 22:00:00'where arriveNum > 0 and test <> '1'

The table creation statement ORDER BY (siteCode, groupOrderCode, sku), statement 1 did not meet the requirements and passed the pressure test QPS6.4, TP99 0.56s, statement 2 met the requirements and passed the pressure test QPS 14.9, TP99 0.12s

10 How to resist high concurrency and ensure the availability of ClickHouse

1) Reduce query speed and increase throughput

max_threads: Located in users.xml, it indicates the maximum number of CPUs that can be used by a single query. The default is the number of CPU cores. If the machine is 32C, 32 threads will be started to process the current request. You can lower max_threads and sacrifice the speed of a single query to ensure the availability of ClickHouse and improve concurrency. Can be configured through jdbc url

The figure below is based on the 32C128G configuration. Under the condition that the CK cluster can provide stable services and the CPU usage rate is 50%, a pressure test is done for max_threads. The interface level pressure test executes 5 SQLs per request and processes 508W lines of data. It can be seen that the smaller the max_threads, the better the QPS and the worse the TP99. You can adjust an appropriate configuration value according to your own business conditions.

2) The interface adds a cache for a certain period of time
3) The asynchronous task executes the query statement, and the aggregated index results are dropped into the ES, and the application queries the aggregated results in the ES
4) The materialized view solves this problem through pre-aggregation, but our Business scenarios do not apply

11 data collection

• Operations such asbuilding a database, creating a table, creating a secondary index, etc.

•Change the ORDER BY field, PARTITION BY, backup data, single table migration data and other operations

• Build clickhouse-client link ck clusterbased on docker

• Build grafanabased on docker to monitor SQL execution

•Test environment to build clickhouse by itself

Author: JD Logistics Ma Hongyan

Content source: JD Cloud developer community

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/9008112