Best practice of ClickHouse primary key index

In this article, we'll take a deep dive into ClickHouse indexing. We will elaborate and discuss this:

  • How does ClickHouse's index differ from traditional relational databases
  • How ClickHouse builds and uses the primary key sparse index
  • Best Practices for ClickHouse Indexing

You can choose to execute all Clickhouse SQL statements and queries given in this article on your own machine. How to install and build ClickHouse, please refer to Quick Start

NOTE

This post focuses on sparse indexes.

If you want to learn about secondary hop index , check out the tutorial .

dataset

In this article, we will use an anonymized web traffic dataset.

  • We will use a subset of the 8.87 million rows (events) in the sample dataset.
  • The uncompressed data size is 8.87 million events and about 700mb. When stored in ClickHouse, it is 200mb compressed.
  • In our subset, each row contains three columns representing Internet users (UserID column) who clicked on a URL (URL column) at a specific time (EventTime column).

With these three columns, we can already formulate some typical web analysis queries, such as:

  • What are the top 10 URLs clicked by a certain user the most?
  • Who are the top 10 users who clicked the most times on a certain URL?
  • What are the most frequent times (say, days of the week) when users click on a particular URL?

Test environment

All runtime data given in this document is running ClickHouse 22.2.1 locally on a MacBook Pro with an Apple M1 Pro chip and 16GB of RAM.

full table scan

To see how queries can be performed on a dataset without a primary key, we created a table by executing the following SQL DDL statement (using the MergeTree table engine):

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">CREATE</span> <span style="color:#569cd6">TABLE</span> hits_NoPrimaryKey
</span><span style="color:#9cdcfe"><span style="color:#d4d4d4">(</span>
</span><span style="color:#9cdcfe">    <span style="color:#d4d4d4">`</span>UserID<span style="color:#d4d4d4">`</span> UInt32<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    <span style="color:#d4d4d4">`</span>URL<span style="color:#d4d4d4">`</span> String<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    <span style="color:#d4d4d4">`</span>EventTime<span style="color:#d4d4d4">`</span> <span style="color:#569cd6">DateTime</span>
</span><span style="color:#9cdcfe"><span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ENGINE</span> <span style="color:#d4d4d4">=</span> MergeTree
</span><span style="color:#9cdcfe"><span style="color:#569cd6">PRIMARY</span> <span style="color:#569cd6">KEY</span> tuple<span style="color:#d4d4d4">(</span><span style="color:#d4d4d4">)</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

Next, insert a subset of the hit dataset into the table using the following insert SQL. This SQL loads a portion of a dataset from clickhouse.com using URL table functions and type inference :

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">INSERT</span> <span style="color:#569cd6">INTO</span> hits_NoPrimaryKey <span style="color:#569cd6">SELECT</span>
</span><span style="color:#9cdcfe">   intHash32<span style="color:#d4d4d4">(</span>c11::UInt64<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> UserID<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">   c15 <span style="color:#569cd6">AS</span> URL<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">   c5 <span style="color:#569cd6">AS</span> EventTime
</span><span style="color:#9cdcfe"><span style="color:#569cd6">FROM</span> url<span style="color:#d4d4d4">(</span><span style="color:#ce9178">'https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz'</span><span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">WHERE</span> URL <span style="color:#d4d4d4">!=</span> <span style="color:#ce9178">''</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

result:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">Ok.
</span>
<span style="color:#9cdcfe">0 rows in set. Elapsed: 145.993 sec. Processed 8.87 million rows, 18.40 GB (60.78 thousand rows/s., 126.06 MB/s.)
</span></code></span></span></span>

The ClickHouse client outputs the execution results and inserts 8.87 million rows of data.

Finally, to simplify the discussion later in this article, and to make the graphs and results reproducible, we optimize the table using the FINAL keyword:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">OPTIMIZE</span> <span style="color:#569cd6">TABLE</span> hits_NoPrimaryKey FINAL<span style="color:#d4d4d4">;</span>
</span></code></span></span></span>
NOTE

In general, it is neither necessary nor recommended to execute optimize immediately after loading data. For this example, it's obvious why this is needed.

Now we execute our first web analytics query. Here are the top 10 most clicked URLs by Internet users with user id 749927693:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">SELECT</span> URL<span style="color:#d4d4d4">,</span> <span style="color:#dcdcaa">count</span><span style="color:#d4d4d4">(</span>URL<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">as</span> Count
</span><span style="color:#9cdcfe"><span style="color:#569cd6">FROM</span> hits_NoPrimaryKey
</span><span style="color:#9cdcfe"><span style="color:#569cd6">WHERE</span> UserID <span style="color:#d4d4d4">=</span> <span style="color:#b5cea8">749927693</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">GROUP</span> <span style="color:#569cd6">BY</span> URL
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ORDER</span> <span style="color:#569cd6">BY</span> Count <span style="color:#569cd6">DESC</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">LIMIT</span> <span style="color:#b5cea8">10</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

result:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">┌─URL────────────────────────────┬─Count─┐
</span><span style="color:#9cdcfe">│ http://auto.ru/chatay-barana.. │   170 │
</span><span style="color:#9cdcfe">│ http://auto.ru/chatay-id=371...│    52 │
</span><span style="color:#9cdcfe">│ http://public_search           │    45 │
</span><span style="color:#9cdcfe">│ http://kovrik-medvedevushku-...│    36 │
</span><span style="color:#9cdcfe">│ http://forumal                 │    33 │
</span><span style="color:#9cdcfe">│ http://korablitz.ru/L_1OFFER...│    14 │
</span><span style="color:#9cdcfe">│ http://auto.ru/chatay-id=371...│    14 │
</span><span style="color:#9cdcfe">│ http://auto.ru/chatay-john-D...│    13 │
</span><span style="color:#9cdcfe">│ http://auto.ru/chatay-john-D...│    10 │
</span><span style="color:#9cdcfe">│ http://wot/html?page/23600_m...│     9 │
</span><span style="color:#9cdcfe">└────────────────────────────────┴───────┘
</span>
<span style="color:#9cdcfe">10 rows in set. Elapsed: 0.022 sec.
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">Processed 8.87 million rows,
</span></span><span style="color:#9cdcfe">70.45 MB (398.53 million rows/s., 3.17 GB/s.)
</span></code></span></span></span>

The ClickHouse client output shows that ClickHouse performed a full table scan! Each of the 8.87 million rows of our table is loaded into ClickHouse, which is not scalable.

To make this more efficient and faster, we need to use a table with a proper primary key. This will allow ClickHouse to automatically (based on the columns of the primary key) create a sparse primary index, which can then be used to significantly speed up the execution of our example query.

A table with a primary key

Create a table with joint primary key UserID and URL columns:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">CREATE</span> <span style="color:#569cd6">TABLE</span> hits_UserID_URL
</span><span style="color:#9cdcfe"><span style="color:#d4d4d4">(</span>
</span><span style="color:#9cdcfe">    <span style="color:#d4d4d4">`</span>UserID<span style="color:#d4d4d4">`</span> UInt32<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    <span style="color:#d4d4d4">`</span>URL<span style="color:#d4d4d4">`</span> String<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    <span style="color:#d4d4d4">`</span>EventTime<span style="color:#d4d4d4">`</span> <span style="color:#569cd6">DateTime</span>
</span><span style="color:#9cdcfe"><span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ENGINE</span> <span style="color:#d4d4d4">=</span> MergeTree
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe"><span style="color:#569cd6">PRIMARY</span> <span style="color:#569cd6">KEY</span> <span style="color:#d4d4d4">(</span>UserID<span style="color:#d4d4d4">,</span> URL<span style="color:#d4d4d4">)</span>
</span></span><span style="color:#9cdcfe"><span style="color:#569cd6">ORDER</span> <span style="color:#569cd6">BY</span> <span style="color:#d4d4d4">(</span>UserID<span style="color:#d4d4d4">,</span> URL<span style="color:#d4d4d4">,</span> EventTime<span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe">SETTINGS index_granularity <span style="color:#d4d4d4">=</span> <span style="color:#b5cea8">8192</span><span style="color:#d4d4d4">,</span> index_granularity_bytes <span style="color:#d4d4d4">=</span> <span style="color:#b5cea8">0</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

DDL details

The primary key in the above DDL statement will create a primary index based on the two specified key columns.


Insert data:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">INSERT</span> <span style="color:#569cd6">INTO</span> hits_UserID_URL <span style="color:#569cd6">SELECT</span>
</span><span style="color:#9cdcfe">   intHash32<span style="color:#d4d4d4">(</span>c11::UInt64<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> UserID<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">   c15 <span style="color:#569cd6">AS</span> URL<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">   c5 <span style="color:#569cd6">AS</span> EventTime
</span><span style="color:#9cdcfe"><span style="color:#569cd6">FROM</span> url<span style="color:#d4d4d4">(</span><span style="color:#ce9178">'https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz'</span><span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">WHERE</span> URL <span style="color:#d4d4d4">!=</span> <span style="color:#ce9178">''</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

result:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">0 rows in set. Elapsed: 149.432 sec. Processed 8.87 million rows, 18.40 GB (59.38 thousand rows/s., 123.16 MB/s.)
</span></code></span></span></span>


optimize table:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">OPTIMIZE</span> <span style="color:#569cd6">TABLE</span> hits_UserID_URL FINAL<span style="color:#d4d4d4">;</span>
</span></code></span></span></span>


We can use the following query to get metadata about a table:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">SELECT</span>
</span><span style="color:#9cdcfe">    part_type<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    path<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    formatReadableQuantity<span style="color:#d4d4d4">(</span><span style="color:#569cd6">rows</span><span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> <span style="color:#569cd6">rows</span><span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    formatReadableSize<span style="color:#d4d4d4">(</span>data_uncompressed_bytes<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> data_uncompressed_bytes<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    formatReadableSize<span style="color:#d4d4d4">(</span>data_compressed_bytes<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> data_compressed_bytes<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    formatReadableSize<span style="color:#d4d4d4">(</span>primary_key_bytes_in_memory<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> primary_key_bytes_in_memory<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    marks<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    formatReadableSize<span style="color:#d4d4d4">(</span>bytes_on_disk<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> bytes_on_disk
</span><span style="color:#9cdcfe"><span style="color:#569cd6">FROM</span> system<span style="color:#d4d4d4">.</span>parts
</span><span style="color:#9cdcfe"><span style="color:#569cd6">WHERE</span> <span style="color:#d4d4d4">(</span><span style="color:#569cd6">table</span> <span style="color:#d4d4d4">=</span> <span style="color:#ce9178">'hits_UserID_URL'</span><span style="color:#d4d4d4">)</span> <span style="color:#d4d4d4">AND</span> <span style="color:#d4d4d4">(</span>active <span style="color:#d4d4d4">=</span> <span style="color:#b5cea8">1</span><span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe">FORMAT Vertical<span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

result:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">part_type:                   Wide
</span><span style="color:#9cdcfe">path:                        ./store/d9f/d9f36a1a-d2e6-46d4-8fb5-ffe9ad0d5aed/all_1_9_2/
</span><span style="color:#9cdcfe">rows:                        8.87 million
</span><span style="color:#9cdcfe">data_uncompressed_bytes:     733.28 MiB
</span><span style="color:#9cdcfe">data_compressed_bytes:       206.94 MiB
</span><span style="color:#9cdcfe">primary_key_bytes_in_memory: 96.93 KiB
</span><span style="color:#9cdcfe">marks:                       1083
</span><span style="color:#9cdcfe">bytes_on_disk:               207.07 MiB
</span>

<span style="color:#9cdcfe">1 rows in set. Elapsed: 0.003 sec.
</span></code></span></span></span>

The client output shows:

  • Table data is stored in a specific directory in wide format , and each column has a data file and a mark file.
  • The table has 8.87 million rows of data.
  • The uncompressed data has 733.28 MB.
  • The compressed data is 206.94 MB.
  • There are 1083 primary key index entries and the size is 96.93 KB.
  • On disk, the table's data, tag files, and main index file take up a total of 207.07 MB.

Index design

In traditional relational database management systems, each table row contains a primary index. For our dataset, this would result in the main index—typically a B(+)-Tree data structure—containing 8.87 million entries.

Such an index allows quick location of specific rows, thereby improving the efficiency of lookups and updates. The average time complexity of searching for an entry in a B(+)-Tree data structure is O(log2n). For a table with 8.87 million rows, this means 23 steps are required to locate any index entry.

This capability comes at a price: additional disk and memory overhead, and higher insertion costs when new rows are added to tables and entries are added to indexes (and sometimes the B-Tree needs to be rebalanced).

The table engine in ClickHouse uses a different approach given the challenges associated with B-Tee indexes. The ClickHouse MergeTree Engine family of engines is designed and optimized to handle large amounts of data.

These tables are designed to receive millions of row inserts per second and store very large (100 petabyte) volumes of data.

Data is quickly written to tables batch by batch , and merge rules are applied in the background.

In ClickHouse, each data part (data part) has its own main index. When they are merged, the master indexes of the merged parts are also merged.

At large scale, disk and memory efficiency is very important. Therefore, instead of creating an index for each row, an index entry is built for a group of data rows (called a granule).

This sparse index is possible because ClickHouse stores a set of rows on disk in the order of the primary key columns.

Unlike directly locating individual rows (like a B-Tree-based index), a sparse primary index allows it to quickly (by doing a binary search on the index entries) identify groups of rows that might match a query.

Groups (granules) of potentially matching rows are then loaded into the ClickHouse engine in a parallel fashion in order to find matching rows.

This index design allows the main index to be small (it can and must fit entirely in main memory), while still significantly speeding up query execution times: especially for range queries that are common in data analytics use cases.

The following details how ClickHouse builds and uses its sparse master index. Later in this article, we'll discuss some best practices for how to select, remove, and sort table columns (primary key columns) used to build indexes.

Data is stored on

The tables created above are:

NOTE
  • If we only specify the sort key, then the primary key will be implicitly defined as the sort key.

  • To improve memory efficiency, we explicitly specify a primary key containing only the columns filtered by the query. Primary indexes based on primary keys are fully loaded into main memory.

  • For the consistency of the context and the maximum compression ratio, we define the sort key separately, which includes all the columns of the current table (related to the compression algorithm, generally after sorting, the compression rate is better).

  • If both a primary key and a sort key are specified, the primary key must be a prefix of the sort key.

Inserted rows are stored on disk in lexicographic order (smallest to greatest) by the primary key columns (and an additional EventTime column for the sort key).

NOTE

ClickHouse allows inserting multiple rows of data with the same primary key columns. In this case (see rows 1 and 2 in the figure below), the final order is determined by the specified sort key, here the value of the EventTime column.

As shown in the figure below: ClickHouse is a column storage database .

  • On disk, each table has a data file (*.bin) in which all values ​​for the column are stored in a compressed format, and
  • In this example, the 8.87 million rows are stored on disk in lexicographical ascending order of the primary key columns (and additional sort key columns)
    • UserID first,
    • and then the URL,
    • And finally the EventTime:

UserID.bin, URL.bin, and EventTime.bin are data files with UserID , URL , and EventTime columns.
 

NOTE
  • Because primary keys define the lexicographical ordering of rows on disk, a table can only have one primary key.

  • We number lines starting at 0 to align with the ClickHouse internal line numbering scheme, which is also used to log messages.

Data is organized into granules for parallel data processing

For data processing purposes, the column values ​​of a table are logically divided into multiple granules. Granules are the smallest indivisible datasets that flow into ClickHouse for data processing. This means, instead of reading individual rows, ClickHouse always reads (in a stream and in parallel) whole groups of rows (granules).

NOTE

Column values ​​are not physically stored in granules, granules are just a logical organization of column values ​​for query processing.

The figure below shows how the 8.87 million rows (column values) in the table are organized into 1083 granules, which is the result of the table's DDL statement including setting index_granularity (set to the default value of 8192).

The first (according to the physical order on disk) 8192 rows (their column values) logically belong to granule 0, then the next 8192 rows (their column values) belong to granule 1, and so on.

NOTE
  • The last granule (1082 granules) is less than 8192 rows.

  • We mentioned in "DDL Statement Details" at the beginning of this guide that we disabled adaptive index granularity (to simplify the discussion in this guide and to make the diagrams and results reproducible).

    Therefore, all grains in the example table (except the last one) have the same size.

  • For tables with adaptive index granularity (index granularity is adaptive by default), some granularities can be smaller than 8192 rows in size, depending on the row data size.

  • We marked some column values ​​in the primary key columns (UserID, URL) in orange.

    These orange marked column values ​​are the minimum values ​​for each primary key column in each granule. The exception here is the last particle (grain 1082 in the image above), which we marked with the largest value.

    As we'll see below, these orange-marked column values ​​will be entries in the table's primary index.

  • We number lines starting at 0 to align with the ClickHouse internal line numbering scheme, which is also used to log messages.

Each granule corresponds to an entry

The main index is created based on the grain shown in the image above. This index is an uncompressed flat array file (primary.idx) containing so called numeric index markers starting from 0.

The figure below shows that the index stores the minimum primary key column value (the value marked in orange in the figure above) for each granule. For example:

  • The first index entry ("mark 0" in the figure below) stores the minimum value of the primary key column of particle 0 in the figure above,
  • The second index entry ("mark 1" in the figure below) stores the minimum value of the primary key column for particle 1 in the figure above, and so on.

In our table, the index has a total of 1083 entries, 8.87 million rows and 1083 granules:

NOTE
  • The last index entry ("mark 1082" in the figure above) stores the maximum value of the primary key column of particle 1082 in the figure above.

  • Index entries (index tags) are not based on specific rows in the table, but on granules. For example, for index entry 'mark 0' in the image above, there is no row in our table with UserID 240.923 and URL "goal://metry=10000467796a411...", instead, there is a granule 0 for this table, In this granule, the minimum UserID value is 240.923 and the minimum URL value is "goal://metry=10000467796a411...", these two values ​​are from different rows.

  • The main index file is fully loaded into main memory. ClickHouse will error out if the file is larger than the available free memory space.

Primary key entries are called index markers because each index entry marks the beginning of a specific range of data. For example table:

  • UserID index marks:
    UserID values ​​stored in the main index are sorted in ascending order.
    'mark 1' in the figure above indicates that the UserID values ​​of all table rows in granule 1, and in all subsequent granules, are guaranteed to be greater than or equal to 4.073.710.

    As we will see later , when queries filter on the first column of the primary key, this global ordering enables ClickHouse to use the binary search algorithm on the index token of the first key column.

  • URL index marks:
    The primary key column UserID and URL have the same cardinality, which means that the index marks of all primary key columns after the first column usually only represent the data range of each granule.
    For example, all values ​​in the URL column in 'mark 0' are greater than or equal to goal://metry=10000467796a411..., but the URL in granule 1 is not the case, because 'mark 1' and 'mark 0' have Different UserID column values.

    We'll discuss the impact of this on query execution performance in more detail later.

The main index is used to select particles

Now, we can perform queries backed by the primary index.

The following calculates the 10 URLs with the most clicks by UserID 749927693.

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">SELECT</span> URL<span style="color:#d4d4d4">,</span> <span style="color:#dcdcaa">count</span><span style="color:#d4d4d4">(</span>URL<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> Count
</span><span style="color:#9cdcfe"><span style="color:#569cd6">FROM</span> hits_UserID_URL
</span><span style="color:#9cdcfe"><span style="color:#569cd6">WHERE</span> UserID <span style="color:#d4d4d4">=</span> <span style="color:#b5cea8">749927693</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">GROUP</span> <span style="color:#569cd6">BY</span> URL
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ORDER</span> <span style="color:#569cd6">BY</span> Count <span style="color:#569cd6">DESC</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">LIMIT</span> <span style="color:#b5cea8">10</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

result:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">┌─URL────────────────────────────┬─Count─┐
</span><span style="color:#9cdcfe">│ http://auto.ru/chatay-barana.. │   170 │
</span><span style="color:#9cdcfe">│ http://auto.ru/chatay-id=371...│    52 │
</span><span style="color:#9cdcfe">│ http://public_search           │    45 │
</span><span style="color:#9cdcfe">│ http://kovrik-medvedevushku-...│    36 │
</span><span style="color:#9cdcfe">│ http://forumal                 │    33 │
</span><span style="color:#9cdcfe">│ http://korablitz.ru/L_1OFFER...│    14 │
</span><span style="color:#9cdcfe">│ http://auto.ru/chatay-id=371...│    14 │
</span><span style="color:#9cdcfe">│ http://auto.ru/chatay-john-D...│    13 │
</span><span style="color:#9cdcfe">│ http://auto.ru/chatay-john-D...│    10 │
</span><span style="color:#9cdcfe">│ http://wot/html?page/23600_m...│     9 │
</span><span style="color:#9cdcfe">└────────────────────────────────┴───────┘
</span>
<span style="color:#9cdcfe">10 rows in set. Elapsed: 0.005 sec.
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">Processed 8.19 thousand rows,
</span></span><span style="color:#9cdcfe">740.18 KB (1.53 million rows/s., 138.59 MB/s.)
</span></code></span></span></span>

The output from the ClickHouse client shows that no full table scan was done and only 8.19 thousand rows flowed to ClickHouse.

If trace logging is turned on, the ClickHouse server logs will show that ClickHouse is performing a binary search on 1083 UserID index tokens to identify granules that may contain rows with a UserID column value of 749927693. This takes 19 steps and has an average time complexity of O(log2 n):

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">...Executor): Key condition: (column 0 in [749927693, 749927693])
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">...Executor): Running binary search on index range for part all_1_9_2 (1083 marks)
</span></span><span style="color:#9cdcfe">...Executor): Found (LEFT) boundary mark: 176
</span><span style="color:#9cdcfe">...Executor): Found (RIGHT) boundary mark: 177
</span><span style="color:#9cdcfe">...Executor): Found continuous range in 19 steps
</span><span style="color:#9cdcfe">...Executor): Selected 1/1 parts by partition key, 1 parts by primary key,
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">              1/1083 marks by primary key, 1 marks to read from 1 ranges
</span></span><span style="color:#9cdcfe">...Reading ...approx. 8192 rows starting from 1441792
</span></code></span></span></span>

We can see in the trace log above that 1 out of 1083 existing tokens satisfies the query.

Trace LogDetails

We can also reproduce this result by using EXPLAIN :

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">EXPLAIN</span> indexes <span style="color:#d4d4d4">=</span> <span style="color:#b5cea8">1</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">SELECT</span> URL<span style="color:#d4d4d4">,</span> <span style="color:#dcdcaa">count</span><span style="color:#d4d4d4">(</span>URL<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> Count
</span><span style="color:#9cdcfe"><span style="color:#569cd6">FROM</span> hits_UserID_URL
</span><span style="color:#9cdcfe"><span style="color:#569cd6">WHERE</span> UserID <span style="color:#d4d4d4">=</span> <span style="color:#b5cea8">749927693</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">GROUP</span> <span style="color:#569cd6">BY</span> URL
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ORDER</span> <span style="color:#569cd6">BY</span> Count <span style="color:#569cd6">DESC</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">LIMIT</span> <span style="color:#b5cea8">10</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

The result is as follows:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">┌─explain───────────────────────────────────────────────────────────────────────────────┐
</span><span style="color:#9cdcfe">│ Expression (Projection)                                                               │
</span><span style="color:#9cdcfe">│   Limit (preliminary LIMIT (without OFFSET))                                          │
</span><span style="color:#9cdcfe">│     Sorting (Sorting for ORDER BY)                                                    │
</span><span style="color:#9cdcfe">│       Expression (Before ORDER BY)                                                    │
</span><span style="color:#9cdcfe">│         Aggregating                                                                   │
</span><span style="color:#9cdcfe">│           Expression (Before GROUP BY)                                                │
</span><span style="color:#9cdcfe">│             Filter (WHERE)                                                            │
</span><span style="color:#9cdcfe">│               SettingQuotaAndLimits (Set limits and quota after reading from storage) │
</span><span style="color:#9cdcfe">│                 ReadFromMergeTree                                                     │
</span><span style="color:#9cdcfe">│                 Indexes:                                                              │
</span><span style="color:#9cdcfe">│                   PrimaryKey                                                          │
</span><span style="color:#9cdcfe">│                     Keys:                                                             │
</span><span style="color:#9cdcfe">│                       UserID                                                          │
</span><span style="color:#9cdcfe">│                     Condition: (UserID in [749927693, 749927693])                     │
</span><span style="color:#9cdcfe">│                     Parts: 1/1                                                        │
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">│                     Granules: 1/1083                                                  │
</span></span><span style="color:#9cdcfe">└───────────────────────────────────────────────────────────────────────────────────────┘
</span>
<span style="color:#9cdcfe">16 rows in set. Elapsed: 0.003 sec.
</span></code></span></span></span>

The client output shows that one of the 1083 granules is selected that may contain a row with a UserID column value of 749927693.

CONCLUSION

When the query filters on the first primary key that is part of the union primary key, ClickHouse runs the binary search algorithm on the primary key index token.

As discussed above, ClickHouse uses its sparse primary index to quickly (via a binary search algorithm) select granules that are likely to contain rows matching the query.

This is the first stage (granular selection) of ClickHouse query execution .

In the second phase (data reading) , ClickHouse locates the selected granules in order to stream all their rows into the ClickHouse engine in order to find the rows that actually match the query.

We will discuss the second stage in more detail in the next section.

Marker files are used to locate particles

The figure below depicts a portion of the main index file for the table above.

The 176th token was identified by performing a binary search of the indexed 1083 UserID tokens as described above. Therefore, its corresponding granule 176 may contain a row with a UserID column value of 749.927.693.

The specific process of particle selection

To confirm (or rule out) that some rows in granule 176 contain UserID column value 749.927.693, all 8192 rows belonging to this granule need to be read to ClickHouse.

In order to read this part of data, ClickHouse needs to know the physical address of particle 176 .

In ClickHouse, the physical locations of all the grains of our table are stored in tag files. Similar to the data files, there is a tag file for each table column.

The figure below shows three marker files, UserID.mrk, URL.mrk, and EventTime.mrk, which store the physical locations of granules for the UserID, URL, and EventTime columns of the table.

We have already discussed that the primary index is a flat uncompressed array file (primary.idx) containing index tokens numbered starting from 0.

Similarly, a marker file is also a flat uncompressed array file (*.mrk) containing markers numbered starting from 0.

Once ClickHouse has determined and selected the index markers of the granule that likely contain the matching rows required by the query, it can be looked up in the marker file array to obtain the physical location of the granule.

Each tag file entry for a particular column stores two locations as offsets:

  • The first offset ('block_offset' in the image above) is to locate the block within the compressed columnar data file containing the compressed version of the selected granule. This compressed block may contain several compressed granules. The located compressed file blocks are decompressed into memory when read.

  • The second offset of the marker file ("granule_offset" in the image above) provides the position of the granule within the decompressed data block.

All 8192 rows of data in the located particles will be loaded by ClickHouse for further processing.

Why do you need a MARK file

Why doesn't the main index directly contain the physical location of the particles corresponding to the index marks?

Because the scenario designed by ClickHouse is very large-scale data, it is very important to use disk and memory very efficiently.

The main index file needs to be placed in memory.

For our example query, ClickHouse used the primary index and selected individual granules that were likely to contain rows matching the query. Only for this one particle, ClickHouse needs to locate the physical location so that the corresponding row group can be read for further processing.

Also, only the UserID and URL columns need this offset information.

For columns not used in queries, such as EventTime, no offset information is required.

For our example query, Clickhouse only needs two physical location offsets for 176 particles in the UserID data file (UserID.bin), and two physical location offsets for 176 particles in the URL data file (URL.data).

The indirection provided by the mark file avoids directly storing entries in the main index for the physical locations of all 1083 granules of all three columns: thus avoiding unnecessary (possibly unused) data in main memory.

The diagram and text below illustrate our query example, how ClickHouse locates 176 particles in the UserID.bin data file.

As we discussed earlier in this article, ClickHouse chose primary index marker 176, so granule 176 likely contains the matching rows required by the query.

ClickHouse now uses the marker number (176) selected from the index to do a position array lookup in UserID.mark to obtain the two offsets for locating the particle 176.

As shown, the first offset locates the compressed file block in the UserID.bin data file that contains the compressed data for the particle 176 .

Once the located file block is decompressed into main memory, the second offset of the marker file can be used to locate the grain 176 in the uncompressed data.

ClickHouse needs to locate (read) granule 176 from the UserID.bin data file and the URL.bin data file in order to execute our example query (top 10 urls most clicked by Internet users with UserID 749.927.693).

The image above shows how ClickHouse locates the granules of the UserID.bin data file.

Simultaneously, ClickHouse performs the same operation on the particle 176 of the URL.bin data file. These two different granules are aligned and loaded into the ClickHouse engine for further processing, i.e. aggregate and calculate each group of URL values ​​for all rows with UserID 749.927.693, and finally output the 10 largest URL groups in descending order of count.

Query performance problems

When a query filters on a column that is part of a composite key and is the first primary key column, ClickHouse will run a binary search on the index tokens of the primary key columns.

But what happens when the query filters on part of the union primary key but not on the first key column?

NOTE

We discussed a scenario where instead of explicitly filtering on the first primary key column, the query filters on any key column after the first.

When a query filters on both the first primary key column and any key columns after the first primary key column, ClickHouse will run a binary search on the index token of the first primary key column.



We use a query to calculate the top 10 users who hit "http://public_search" the most:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">SELECT</span> UserID<span style="color:#d4d4d4">,</span> <span style="color:#dcdcaa">count</span><span style="color:#d4d4d4">(</span>UserID<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> Count
</span><span style="color:#9cdcfe"><span style="color:#569cd6">FROM</span> hits_UserID_URL
</span><span style="color:#9cdcfe"><span style="color:#569cd6">WHERE</span> URL <span style="color:#d4d4d4">=</span> <span style="color:#ce9178">'http://public_search'</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">GROUP</span> <span style="color:#569cd6">BY</span> UserID
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ORDER</span> <span style="color:#569cd6">BY</span> Count <span style="color:#569cd6">DESC</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">LIMIT</span> <span style="color:#b5cea8">10</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

turn out:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">┌─────UserID─┬─Count─┐
</span><span style="color:#9cdcfe">│ 2459550954 │  3741 │
</span><span style="color:#9cdcfe">│ 1084649151 │  2484 │
</span><span style="color:#9cdcfe">│  723361875 │   729 │
</span><span style="color:#9cdcfe">│ 3087145896 │   695 │
</span><span style="color:#9cdcfe">│ 2754931092 │   672 │
</span><span style="color:#9cdcfe">│ 1509037307 │   582 │
</span><span style="color:#9cdcfe">│ 3085460200 │   573 │
</span><span style="color:#9cdcfe">│ 2454360090 │   556 │
</span><span style="color:#9cdcfe">│ 3884990840 │   539 │
</span><span style="color:#9cdcfe">│  765730816 │   536 │
</span><span style="color:#9cdcfe">└────────────┴───────┘
</span>
<span style="color:#9cdcfe">10 rows in set. Elapsed: 0.086 sec.
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">Processed 8.81 million rows,
</span></span><span style="color:#9cdcfe">799.69 MB (102.11 million rows/s., 9.27 GB/s.)
</span></code></span></span></span>

The client output shows that ClickHouse almost performed a full table scan even though the URL column is part of the composite primary key! ClickHouse read 8.81 million rows from the table's 8.87 million rows.

If trace logging is enabled, the ClickHouse service log file shows that ClickHouse used a generic exclude search on 1083 URL index tokens in order to identify rows that might contain the URL column value "http://public_search".

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">...Executor): Key condition: (column 1 in ['http://public_search',
</span><span style="color:#9cdcfe">                                           'http://public_search'])
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">...Executor): Used generic exclusion search over index for part all_1_9_2
</span></span><span style="color:#9cdcfe">              with 1537 steps
</span><span style="color:#9cdcfe">...Executor): Selected 1/1 parts by partition key, 1 parts by primary key,
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">              1076/1083 marks by primary key, 1076 marks to read from 5 ranges
</span></span><span style="color:#9cdcfe">...Executor): Reading approx. 8814592 rows with 10 streams
</span></code></span></span></span>

We can see in the trace log example above that 1076 out of 1083 granules were selected (by flagging) as likely to contain rows with matching URL values.

This will result in 8.81 million rows being read into the ClickHouse engine (in parallel by using 10 streams) in order to identify rows that actually contain the URL value "http://public_search".

However, only 39 particles later contained matching rows.

While a primary index based on a joint primary key (UserID, URL) is useful for speeding up queries that filter rows with a specific UserID value, the index does not provide significant help for queries that filter rows with a specific URL value.

The reason is that the URL column is not the first primary key column, so ClickHouse uses a general exclusion search algorithm (instead of binary search) to find the index mark of the URL column. Unlike the UserID primary key column, the validity of its algorithm depends on the URL column base.

To illustrate, we give how the general exclusion search algorithm works:

Generic exclusion search algorithm

The following demonstrates how the ClickHouse general exclusion search algorithm  works when the previous key column has a higher or lower cardinality when selecting a granule by any column after the first .

As examples of these two cases, we will assume:

  • Searches for rows with URL value "W3".
  • The click table abstraction simplifies to just simple values ​​for UserID and UserID.
  • Same joint primary key (UserID, URL). This means that rows are sorted first by UserID value, and rows with the same UserID value are then sorted by URL.
  • The grain size is 2, that is, each grain contains two rows.

In the chart below, we have marked the minimum key column value for each granule in orange.

prefix primary key low cardinality

Assuming UserID has a low cardinality. In this case, the same UserID value is likely to be spread across multiple table rows and granules, and thus across index marks. For index tags with the same UserID, the URL values ​​of the index tags are sorted in ascending order (because table rows are sorted first by UserID and then by URL). This enables efficient filtering as described below:

In the figure above, there are three different scenarios for the grain selection process of our abstract sample data:

  1. If the (minimum) URL value of index token 0 is less than W3, and the URL value of the immediately following index token is also less than W3, index token 0 can be excluded because token 0, token 1 and token 2 have the same UserID value. Note that this exclusion precondition ensures that granule 0 and the next granule 1 consist entirely of U1 UserID values, so that ClickHouse can assume that the largest URL value in granule 0 is also less than W3 and exclude that granule.

  2. If the URL value of index tag 1 is less than (or equal to) W3, and the URL value of subsequent index tags is greater than (or equal to) W3, index tag 1 is selected because it means that granularity 1 may contain rows with URL W3).

  3. Index tokens 2 and 3 with URL values ​​greater than W3 can be excluded because the index tokens of the main index store the minimum key column value for each granule, so it is impossible for granules 2 and 3 to contain URL value W3.

Prefix primary key high cardinality

When UserID has high cardinality, the same UserID value is less likely to be spread across multiple table rows and granules. This means that URL values ​​for index tags are not monotonically increasing:

As you can see in the graph above, all tags with a URL value less than W3 are selected for loading their associated grain's row into the ClickHouse engine.

This is because although all index tokens in the graph belong to Scenario 1 described above, they do not satisfy the aforementioned exclusion precondition, that is, both immediately following index tokens have the same UserID value as the current token, so cannot to be excluded.

For example, consider index token 0, whose URL value is less than W3, and whose immediate successor index token also has a URL value less than W3. This cannot be ruled out because the two immediately following index tags 1 and 2 do not have the same UserID value as the current tag 0.

Note that the next two index tags need to have the same UserID value. This ensures that the current and next marked granules consist entirely of U1 UserID values. If it's just that the next tag has the same UserID, then the URL value for the next tag could come from a table row with a different UserID - which is indeed the case when you look at the diagram above, where W2 is from U2 and not from U1 OK.

This ultimately prevents ClickHouse from making assumptions about the maximum URL value in granule 0. Instead, it has to assume that granule 0 might contain a row with URL value W3, and is forced to select token 0.


The same applies to markers 1, 2 and 3.

in conclusion

When a query filters on some of the columns of the union primary key (but not the first key column), the general exclude search algorithm (instead of binary search) used by ClickHouse works best when the cardinality of the previous key columns is low.

In our example dataset, both key columns (UserID, URL) have similar high cardinality, and, as mentioned earlier, the general exclusion search algorithm is not very good when the previous key column of the URL column has high cardinality. efficient.

Look at the jump index

Because UserID and URL have high cardinality, filtering data based on URL is not particularly effective, and creating a secondary hop count index on the URL column will not improve much either.

For example, these two statements create and populate a minmax hop count index on the URL column of our table .

<span style="color:#161517"><span style="background-color:var(--ifm-alert-background-color)"><span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">ALTER</span> <span style="color:#569cd6">TABLE</span> hits_UserID_URL <span style="color:#569cd6">ADD</span> <span style="color:#569cd6">INDEX</span> url_skipping_index URL <span style="color:#569cd6">TYPE</span> minmax GRANULARITY <span style="color:#b5cea8">4</span><span style="color:#d4d4d4">;</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ALTER</span> <span style="color:#569cd6">TABLE</span> hits_UserID_URL MATERIALIZE <span style="color:#569cd6">INDEX</span> url_skipping_index<span style="color:#d4d4d4">;</span>
</span></code></span></span></span></span></span>

ClickHouse now creates an additional index to store—each group of 4 consecutive granules (note the GRANULARITY 4 clause in the ALTER TABLE statement above)—the minimum and maximum URL values:

The first index entry (mark 0 in the diagram above) stores the minimum and maximum URL values ​​for rows belonging to the first 4 granules of the table.

The second index entry (mark 1) stores the minimum and maximum URL values ​​for rows belonging to the next 4 granules in the table, and so on.

(ClickHouse also creates a special marker file for hop count indexes, which is used to locate groups of granules associated with index markers.)

Due to the similar cardinality of UserID and URL, this secondary hop count index cannot help exclude selected particles when performing query filtering on URLs.

The specific URL value being looked for ('http://public_search') is likely to be a value between the minimum and maximum values ​​that the index stores for each set of granules, causing ClickHouse to be forced to select this set of granules (as they may contain matching query row).

So if we want to significantly speed up the example query that filters rows with a specific URL, then we need to use a primary index optimized for that query.

Also, if we want to keep the good performance of the example query that filters rows with a specific UserID, then we need to use multiple primary indexes.

Here's how to achieve this.

Tuning with multiple primary key indexes

If we wanted to significantly speed up our two example queries—one filtering rows with a specific UserID, and one filtering rows with a specific URL—then we would need to use multiple primary indexes, by using one of these three methods:

  • Create a new table with a different primary key.
  • Create a materialized view.
  • Increase projection.

All three methods effectively copy the sample data into another table so that the table's primary index and row sort order are reorganized.

However, the three options differ in the degree to which the additional tables are routed for query and insert statements transparently to the user.

When creating a second table with a different primary key, the query must be explicitly sent to the version of the table that is most suitable for the query, and new data must be explicitly inserted into both tables to keep the tables in sync:

In a materialized view, the extra table is hidden and the data is automatically kept in sync between the two tables:

The projection method is the most transparent option, because in addition to automatically keeping the hidden additional table in sync with data changes, ClickHouse will also automatically select the most efficient table version for query:

Below we use real examples to discuss these three methods in detail.

Using a federated primary key index

We create a new additional table where we switch the order of the key columns in the primary key (compared to the original table):

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">CREATE</span> <span style="color:#569cd6">TABLE</span> hits_URL_UserID
</span><span style="color:#9cdcfe"><span style="color:#d4d4d4">(</span>
</span><span style="color:#9cdcfe">    <span style="color:#d4d4d4">`</span>UserID<span style="color:#d4d4d4">`</span> UInt32<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    <span style="color:#d4d4d4">`</span>URL<span style="color:#d4d4d4">`</span> String<span style="color:#d4d4d4">,</span>
</span><span style="color:#9cdcfe">    <span style="color:#d4d4d4">`</span>EventTime<span style="color:#d4d4d4">`</span> <span style="color:#569cd6">DateTime</span>
</span><span style="color:#9cdcfe"><span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ENGINE</span> <span style="color:#d4d4d4">=</span> MergeTree
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe"><span style="color:#569cd6">PRIMARY</span> <span style="color:#569cd6">KEY</span> <span style="color:#d4d4d4">(</span>URL<span style="color:#d4d4d4">,</span> UserID<span style="color:#d4d4d4">)</span>
</span></span><span style="color:#9cdcfe"><span style="color:#569cd6">ORDER</span> <span style="color:#569cd6">BY</span> <span style="color:#d4d4d4">(</span>URL<span style="color:#d4d4d4">,</span> UserID<span style="color:#d4d4d4">,</span> EventTime<span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe">SETTINGS index_granularity <span style="color:#d4d4d4">=</span> <span style="color:#b5cea8">8192</span><span style="color:#d4d4d4">,</span> index_granularity_bytes <span style="color:#d4d4d4">=</span> <span style="color:#b5cea8">0</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

Write 8.87 million rows of source table data:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">INSERT</span> <span style="color:#569cd6">INTO</span> hits_URL_UserID
</span><span style="color:#9cdcfe"><span style="color:#569cd6">SELECT</span> <span style="color:#d4d4d4">*</span> <span style="color:#569cd6">from</span> hits_UserID_URL<span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

result:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">Ok.
</span>
<span style="color:#9cdcfe">0 rows in set. Elapsed: 2.898 sec. Processed 8.87 million rows, 838.84 MB (3.06 million rows/s., 289.46 MB/s.)
</span></code></span></span></span>

Finally optimize:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">OPTIMIZE</span> <span style="color:#569cd6">TABLE</span> hits_URL_UserID FINAL<span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

Because we switched the order of the columns in the primary key, the inserted rows are now stored on disk in a different lexicographical order (compared to our original table), so the 1083 granules of the table also contain different values ​​than before:

The primary key index is as follows:

Now calculate the top 10 users who click the URL "http://public_search" most frequently, and the query speed at this time is significantly accelerated:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">SELECT</span> UserID<span style="color:#d4d4d4">,</span> <span style="color:#dcdcaa">count</span><span style="color:#d4d4d4">(</span>UserID<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> Count
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe"><span style="color:#569cd6">FROM</span> hits_URL_UserID
</span></span><span style="color:#9cdcfe"><span style="color:#569cd6">WHERE</span> URL <span style="color:#d4d4d4">=</span> <span style="color:#ce9178">'http://public_search'</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">GROUP</span> <span style="color:#569cd6">BY</span> UserID
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ORDER</span> <span style="color:#569cd6">BY</span> Count <span style="color:#569cd6">DESC</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">LIMIT</span> <span style="color:#b5cea8">10</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

result:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">┌─────UserID─┬─Count─┐
</span><span style="color:#9cdcfe">│ 2459550954 │  3741 │
</span><span style="color:#9cdcfe">│ 1084649151 │  2484 │
</span><span style="color:#9cdcfe">│  723361875 │   729 │
</span><span style="color:#9cdcfe">│ 3087145896 │   695 │
</span><span style="color:#9cdcfe">│ 2754931092 │   672 │
</span><span style="color:#9cdcfe">│ 1509037307 │   582 │
</span><span style="color:#9cdcfe">│ 3085460200 │   573 │
</span><span style="color:#9cdcfe">│ 2454360090 │   556 │
</span><span style="color:#9cdcfe">│ 3884990840 │   539 │
</span><span style="color:#9cdcfe">│  765730816 │   536 │
</span><span style="color:#9cdcfe">└────────────┴───────┘
</span>
<span style="color:#9cdcfe">10 rows in set. Elapsed: 0.017 sec.
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">Processed 319.49 thousand rows,
</span></span><span style="color:#9cdcfe">11.38 MB (18.41 million rows/s., 655.75 MB/s.)
</span></code></span></span></span>

Now there is no full table scan, and ClickHouse performs much more efficiently.

For the primary index on the original table (where UserID is the first key column and URL is the second key column), ClickHouse uses a generic exclude search on index tokens to perform this query, but this is not very efficient because UserID and The cardinality of URLs is similarly high.

With URL as the first column of the main index, ClickHouse now runs a binary search on indexed tokens. The corresponding trace log in the ClickHouse server log file:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">...Executor): Key condition: (column 0 in ['http://public_search',
</span><span style="color:#9cdcfe">                                           'http://public_search'])
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">...Executor): Running binary search on index range for part all_1_9_2 (1083 marks)
</span></span><span style="color:#9cdcfe">...Executor): Found (LEFT) boundary mark: 644
</span><span style="color:#9cdcfe">...Executor): Found (RIGHT) boundary mark: 683
</span><span style="color:#9cdcfe">...Executor): Found continuous range in 19 steps
</span><span style="color:#9cdcfe">...Executor): Selected 1/1 parts by partition key, 1 parts by primary key,
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">              39/1083 marks by primary key, 39 marks to read from 1 ranges
</span></span><span style="color:#9cdcfe">...Executor): Reading approx. 319488 rows with 2 streams
</span></code></span></span></span>

ClickHouse only selected 39 indexed tokens instead of 1076 when using the general exclusion search.

Note that the auxiliary tables are optimized to speed up the execution of the example query filtering on urls.

Like before we query and filter URLs, if we query and filter UserID on the auxiliary table now, the performance will also be poor, because now UserID is the second primary index key column, so ClickHouse will use the general exclusion search algorithm to find particles, which is for similar high Cardinal UserID and URL are not very efficient.

Click below for details:

Poor query filtering performance on UserID

Now we have two tables. Optimized query filtering for UserID and URL, respectively:

Using federated primary keys

Create a materialized view on the original table:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">CREATE</span> MATERIALIZED <span style="color:#569cd6">VIEW</span> mv_hits_URL_UserID
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ENGINE</span> <span style="color:#d4d4d4">=</span> MergeTree<span style="color:#d4d4d4">(</span><span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">PRIMARY</span> <span style="color:#569cd6">KEY</span> <span style="color:#d4d4d4">(</span>URL<span style="color:#d4d4d4">,</span> UserID<span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ORDER</span> <span style="color:#569cd6">BY</span> <span style="color:#d4d4d4">(</span>URL<span style="color:#d4d4d4">,</span> UserID<span style="color:#d4d4d4">,</span> EventTime<span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe">POPULATE
</span><span style="color:#9cdcfe"><span style="color:#569cd6">AS</span> <span style="color:#569cd6">SELECT</span> <span style="color:#d4d4d4">*</span> <span style="color:#569cd6">FROM</span> hits_UserID_URL<span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

result:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">Ok.
</span>
<span style="color:#9cdcfe">0 rows in set. Elapsed: 2.935 sec. Processed 8.87 million rows, 838.84 MB (3.02 million rows/s., 285.84 MB/s.)
</span></code></span></span></span>
NOTE
  • We switch the order of the key columns in the view's primary key (compared to the original table)
  • A materialized view is backed by a hidden table whose row order and primary index are defined based on the given primary key
  • We use the POPULATE keyword in order to immediately import the new materialized view with all 8.87 million rows from the source table hits_UserID_URL
  • If new rows are inserted in the source table hits_UserID_URL, then those rows are also automatically inserted in the hidden table
  • In fact, the implicitly created hidden table has the same row order and primary index as the auxiliary table we explicitly created above:

ClickHouse stores the column data file (.bin), marker file (.mrk2) and primary index (primary.idx) of hidden tables in a special folder in the data directory of the ClickHouse server:

The hidden table (and its primary index) behind the materialized view can now be used to significantly speed up the execution of our query filtering on the URL column:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">SELECT</span> UserID<span style="color:#d4d4d4">,</span> <span style="color:#dcdcaa">count</span><span style="color:#d4d4d4">(</span>UserID<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> Count
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe"><span style="color:#569cd6">FROM</span> mv_hits_URL_UserID
</span></span><span style="color:#9cdcfe"><span style="color:#569cd6">WHERE</span> URL <span style="color:#d4d4d4">=</span> <span style="color:#ce9178">'http://public_search'</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">GROUP</span> <span style="color:#569cd6">BY</span> UserID
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ORDER</span> <span style="color:#569cd6">BY</span> Count <span style="color:#569cd6">DESC</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">LIMIT</span> <span style="color:#b5cea8">10</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

result:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">┌─────UserID─┬─Count─┐
</span><span style="color:#9cdcfe">│ 2459550954 │  3741 │
</span><span style="color:#9cdcfe">│ 1084649151 │  2484 │
</span><span style="color:#9cdcfe">│  723361875 │   729 │
</span><span style="color:#9cdcfe">│ 3087145896 │   695 │
</span><span style="color:#9cdcfe">│ 2754931092 │   672 │
</span><span style="color:#9cdcfe">│ 1509037307 │   582 │
</span><span style="color:#9cdcfe">│ 3085460200 │   573 │
</span><span style="color:#9cdcfe">│ 2454360090 │   556 │
</span><span style="color:#9cdcfe">│ 3884990840 │   539 │
</span><span style="color:#9cdcfe">│  765730816 │   536 │
</span><span style="color:#9cdcfe">└────────────┴───────┘
</span>
<span style="color:#9cdcfe">10 rows in set. Elapsed: 0.026 sec.
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">Processed 335.87 thousand rows,
</span></span><span style="color:#9cdcfe">13.54 MB (12.91 million rows/s., 520.38 MB/s.)
</span></code></span></span></span>

The hidden table (and its primary index) behind the materialized view is actually the same as the auxiliary table we explicitly created, so the query is executed in the same way as the explicitly created table.

The corresponding trace log in the ClickHouse server log file confirms that ClickHouse is running a binary search on index tokens:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">...Executor): Key condition: (column 0 in ['http://public_search',
</span><span style="color:#9cdcfe">                                           'http://public_search'])
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">...Executor): Running binary search on index range ...
</span></span><span style="color:#9cdcfe">...
</span><span style="color:#9cdcfe">...Executor): Selected 4/4 parts by partition key, 4 parts by primary key,
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">              41/1083 marks by primary key, 41 marks to read from 4 ranges
</span></span><span style="color:#9cdcfe">...Executor): Reading approx. 335872 rows with 4 streams
</span></code></span></span></span>

Using joint primary key indexes

Projections are currently an experimental feature, so we need to tell ClickHouse:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">SET</span> optimize_use_projections <span style="color:#d4d4d4">=</span> <span style="color:#b5cea8">1</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

Create a projection on the original table:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">ALTER</span> <span style="color:#569cd6">TABLE</span> hits_UserID_URL
</span><span style="color:#9cdcfe">    <span style="color:#569cd6">ADD</span> PROJECTION prj_url_userid
</span><span style="color:#9cdcfe">    <span style="color:#d4d4d4">(</span>
</span><span style="color:#9cdcfe">        <span style="color:#569cd6">SELECT</span> <span style="color:#d4d4d4">*</span>
</span><span style="color:#9cdcfe">        <span style="color:#569cd6">ORDER</span> <span style="color:#569cd6">BY</span> <span style="color:#d4d4d4">(</span>URL<span style="color:#d4d4d4">,</span> UserID<span style="color:#d4d4d4">)</span>
</span><span style="color:#9cdcfe">    <span style="color:#d4d4d4">)</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

Materialized projection:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">ALTER</span> <span style="color:#569cd6">TABLE</span> hits_UserID_URL
</span><span style="color:#9cdcfe">    MATERIALIZE PROJECTION prj_url_userid<span style="color:#d4d4d4">;</span>
</span></code></span></span></span>
NOTE
  • The projection is creating a hidden table whose row order and primary index are based on the given order BY clause of the projection
  • We use the MATERIALIZE keyword in order to immediately import the hidden table with all 8.87 million rows from the source table hits_UserID_URL
  • If new rows are inserted in the source table hits_UserID_URL, then those rows are also automatically inserted in the hidden table
  • Queries are always (syntactically) against the source table hits_UserID_URL, but the hidden table will be used if its row order and primary index allow more efficient execution of the query
  • In fact, the implicitly created hidden table has the same row order and primary index as our explicitly created auxiliary table:

ClickHouse stores the hidden table's column data file (.bin), marker file (.mrk2) and primary index (primary.idx) in a special folder (marked in orange in the screenshot below), next to the source The table's data file, tag file, and main index file:

The hidden table (and its primary index) created by the projection can now be used (implicitly) to significantly speed up the execution of query filtering on the URL column. Note that the query is syntactically directed against the projected source table.

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:var(--ifm-pre-color)"><code><span style="color:#9cdcfe"><span style="color:#569cd6">SELECT</span> UserID<span style="color:#d4d4d4">,</span> <span style="color:#dcdcaa">count</span><span style="color:#d4d4d4">(</span>UserID<span style="color:#d4d4d4">)</span> <span style="color:#569cd6">AS</span> Count
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe"><span style="color:#569cd6">FROM</span> hits_UserID_URL
</span></span><span style="color:#9cdcfe"><span style="color:#569cd6">WHERE</span> URL <span style="color:#d4d4d4">=</span> <span style="color:#ce9178">'http://public_search'</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">GROUP</span> <span style="color:#569cd6">BY</span> UserID
</span><span style="color:#9cdcfe"><span style="color:#569cd6">ORDER</span> <span style="color:#569cd6">BY</span> Count <span style="color:#569cd6">DESC</span>
</span><span style="color:#9cdcfe"><span style="color:#569cd6">LIMIT</span> <span style="color:#b5cea8">10</span><span style="color:#d4d4d4">;</span>
</span></code></span></span></span>

result:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">┌─────UserID─┬─Count─┐
</span><span style="color:#9cdcfe">│ 2459550954 │  3741 │
</span><span style="color:#9cdcfe">│ 1084649151 │  2484 │
</span><span style="color:#9cdcfe">│  723361875 │   729 │
</span><span style="color:#9cdcfe">│ 3087145896 │   695 │
</span><span style="color:#9cdcfe">│ 2754931092 │   672 │
</span><span style="color:#9cdcfe">│ 1509037307 │   582 │
</span><span style="color:#9cdcfe">│ 3085460200 │   573 │
</span><span style="color:#9cdcfe">│ 2454360090 │   556 │
</span><span style="color:#9cdcfe">│ 3884990840 │   539 │
</span><span style="color:#9cdcfe">│  765730816 │   536 │
</span><span style="color:#9cdcfe">└────────────┴───────┘
</span>
<span style="color:#9cdcfe">10 rows in set. Elapsed: 0.029 sec.
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">Processed 319.49 thousand rows, 1
</span></span><span style="color:#9cdcfe">1.38 MB (11.05 million rows/s., 393.58 MB/s.)
</span></code></span></span></span>

Because the hidden table (and its primary index) created by the projection is effectively the same as the secondary table we explicitly created, the query executes in the same way as the explicitly created table.

A trace log in the ClickHouse server log file confirms that ClickHouse is running a binary search on indexed tokens:

<span style="color:var(--prism-color)"><span style="background-color:var(--ifm-pre-background)"><span style="color:#dc143c"><code><span style="color:#9cdcfe">...Executor): Key condition: (column 0 in ['http://public_search',
</span><span style="color:#9cdcfe">                                           'http://public_search'])
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">...Executor): Running binary search on index range for part prj_url_userid (1083 marks)
</span></span><span style="color:#9cdcfe">...Executor): ...
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">...Executor): Choose complete Normal projection prj_url_userid
</span></span><span style="color:#9cdcfe">...Executor): projection required columns: URL, UserID
</span><span style="color:#9cdcfe">...Executor): Selected 1/1 parts by partition key, 1 parts by primary key,
</span><span style="background-color:var(--docusaurus-highlighted-code-line-bg)"><span style="color:#9cdcfe">              39/1083 marks by primary key, 39 marks to read from 1 ranges
</span></span><span style="color:#9cdcfe">...Executor): Reading approx. 319488 rows with 2 streams
</span></code></span></span></span>

Remove invalid primary key columns

A primary index on a table with a joint primary key (UserID, URL) is useful to speed up query filtering on UserID. However, although the URL column is part of the composite primary key, this index does not provide significant help in speeding up URL query filtering.

The reverse is also true: a primary index on a table with a composite primary key (URL, UserID) speeds up query filtering on URL, but doesn't provide much support for query filtering on UserID.

Since the cardinality of the primary key columns UserID and URL is also high, queries filtering on the second key column will not benefit much from having the second key column in the index.

Therefore, it makes sense to remove the second key column from the primary index (thereby reducing memory consumption of the index) and use multiple primary indexes.

However, if the key columns in a composite primary key differ significantly in cardinality, it is beneficial for the query to sort the primary key columns in ascending cardinality.

The greater the cardinality difference between the primary key key columns, the more important is the order of the primary key key columns.

Guess you like

Origin blog.csdn.net/leesinbad/article/details/131604202