This article teaches you how to use ClickHouse’s dictionary (dictionary)

picture

Number of words in this article: 15160; estimated reading time: 38 minutes

Reviewer: Zhuang Xiaodong (Weizhuang)

picture

introduce

In this article, we will take the opportunity to remind users of the powerful role of dictionaries in accelerating queries - especially queries containing JOIN, as well as some usage tips. Additionally, all examples in this article can be reproduced in our play.clickhouse.com environment (see  blogs  database).

Data introduction

Our original table structure is like this, which records more than 100 years of weather information:

CREATE TABLE noaa
(
   `station_id` LowCardinality(String),
   `date` Date32,
   `tempAvg` Int32 COMMENT 'Average temperature (tenths of a degrees C)',
   `tempMax` Int32 COMMENT 'Maximum temperature (tenths of degrees C)',
   `tempMin` Int32 COMMENT 'Minimum temperature (tenths of degrees C)',
   `precipitation` UInt32 COMMENT 'Precipitation (tenths of mm)',
   `snowfall` UInt32 COMMENT 'Snowfall (mm)',
   `snowDepth` UInt32 COMMENT 'Snow depth (mm)',
   `percentDailySun` UInt8 COMMENT 'Daily percent of possible sunshine (percent)',
   `averageWindSpeed` UInt32 COMMENT 'Average daily wind speed (tenths of meters per second)',
   `maxWindSpeed` UInt32 COMMENT 'Peak gust wind speed (tenths of meters per second)',
   `weatherType` Enum8('Normal' = 0, 'Fog' = 1, 'Heavy Fog' = 2, 'Thunder' = 3, 'Small Hail' = 4, 'Hail' = 5, 'Glaze' = 6, 'Dust/Ash' = 7, 'Smoke/Haze' = 8, 'Blowing/Drifting Snow' = 9, 'Tornado' = 10, 'High Winds' = 11, 'Blowing Spray' = 12, 'Mist' = 13, 'Drizzle' = 14, 'Freezing Drizzle' = 15, 'Rain' = 16, 'Freezing Rain' = 17, 'Snow' = 18, 'Unknown Precipitation' = 19, 'Ground Fog' = 21, 'Freezing Fog' = 22),
   `location` Point,
   `elevation` Float32,
   `name` LowCardinality(String)
) ENGINE = MergeTree() ORDER BY (station_id, date)

Each row represents the measurement data of a certain weather station at a point in time - each row has a station_id . Taking advantage of the fact that the first two digits of  station_id  represent the country code, we can find the top 5 temperatures of a country by knowing its prefix and using the substring function. For example, Portugal:

SELECT
    tempMax / 10 AS maxTemp,
    station_id,
    date,
    location,
    name
FROM noaa
WHERE substring(station_id, 1, 2) = 'PO'
ORDER BY tempMax DESC
LIMIT 5

┌─maxTemp─┬─station_id──┬───────date─┬─location──────────┬─name───────────┐
│    45.8 │ PO000008549 │ 1944-07-30 │ (-8.4167,40.2)    │ COIMBRA        │
│    45.4 │ PO000008562 │ 2003-08-01 │ (-7.8667,38.0167) │ BEJA           │
│    45.2 │ PO000008562 │ 1995-07-23 │ (-7.8667,38.0167) │ BEJA           │
│    44.5 │ POM00008558 │ 2003-08-01 │ (-7.9,38.533)     │ EVORA/C. COORD │
│    44.2 │ POM00008558 │ 2022-07-13 │ (-7.9,38.533)     │ EVORA/C. COORD │
└─────────┴─────────────┴────────────┴───────────────────┴────────────────┘

5 rows in set. Elapsed: 0.259 sec. Processed 1.08 billion rows, 7.46 GB (4.15 billion rows/s., 28.78 GB/s.)

Unfortunately, this query requires a full table scan because it cannot take advantage of our primary key (station_id, date) .

Improve data model

Members of our community quickly came up with a simple optimization to improve the response time of the above query by reducing the amount of data read from disk. This can be accomplished by skipping the paradigm design principle and storing station_id in a separate table before modifying it into a subquery.

First, we review these recommendations so that readers can understand them. Below we create a site table and populate it directly by inserting data using the url function.

CREATE TABLE stations
(
    `station_id` LowCardinality(String),
    `country_code` LowCardinality(String),
    `state` LowCardinality(String),
    `name` LowCardinality(String),
    `lat` Float64,
    `lon` Float64,
    `elevation` Float32
)
ENGINE = MergeTree
ORDER BY (country_code, station_id)

INSERT INTO stations
SELECT
    station_id,
    substring(station_id, 1, 2) AS country_code,
    trimBoth(state) AS state,
    name,
    lat,
    lon,
    elevation
FROM url('https://noaa-ghcn-pds.s3.amazonaws.com/ghcnd-stations.txt', Regexp, 'station_id String, lat Float64, lon Float64, elevation Float32, state String, name String')
SETTINGS format_regexp = '^(.{11})\\s+(\\-?\\d{1,2}\\.\\d{4})\\s+(\\-?\\d{1,3}\\.\\d{1,4})\\s+(\\-?\\d*\\.\\d*)\\s+(.{2})\\s(.*?)\\s{2,}.*$'

0 rows in set. Elapsed: 1.781 sec. Processed 123.18 thousand rows, 7.99 MB (69.17 thousand rows/s., 4.48 MB/s.)

For example, we now assume that our  noaa table no longer has location< a i=4>, elevation, and name fields. The temperature query of the top 5 in Portugal can now be almost solved with a subquery:

SELECT
    tempMax / 10 AS maxTemp,
    station_id,
    date,
    location,
    name
FROM noaa
WHERE station_id IN (
    SELECT station_id
    FROM stations
    WHERE country_code = 'PO'
)
ORDER BY tempMax DESC
LIMIT 5

┌─maxTemp─┬─station_id──┬───────date─┬─location──────────┬─name───────────┐
│    45.8 │ PO000008549 │ 1944-07-30 │ (-8.4167,40.2)    │ COIMBRA        │
│    45.4 │ PO000008562 │ 2003-08-01 │ (-7.8667,38.0167) │ BEJA           │
│    45.2 │ PO000008562 │ 1995-07-23 │ (-7.8667,38.0167) │ BEJA           │
│    44.5 │ POM00008558 │ 2003-08-01 │ (-7.9,38.533)     │ EVORA/C. COORD │
│    44.2 │ POM00008558 │ 2022-07-13 │ (-7.9,38.533)     │ EVORA/C. COORD │
└─────────┴─────────────┴────────────┴───────────────────┴────────────────┘

5 rows in set. Elapsed: 0.009 sec. Processed 522.48 thousand rows, 6.64 MB (59.81 million rows/s., 760.45 MB/s.)

Because the subquery uses the  stations  table's  country_code  Primary key, so query is faster. Additionally, the parent query can also use its primary key. Only a small range of these columns needs to be read, resulting in less data being read from disk and no connection cost. As members of our community have pointed out, keeping the data denormalized is beneficial in this case.

But there's a problem - we rely on location and name on weather data The anti-paradigm of a>. Assuming we don't do this, in order to avoid duplication and follow the principles of normalization and separation on the stations table, we need a full join (actually , we may retain location and name in denormal form and accept storage Cost):

SELECT
    tempMax / 10 AS maxTemp,
    station_id,
    date,
    stations.name AS name,
    (stations.lat, stations.lon) AS location
FROM noaa
INNER JOIN stations ON noaa.station_id = stations.station_id
WHERE stations.country_code = 'PO'
ORDER BY tempMax DESC
LIMIT 5

┌─maxTemp─┬─station_id──┬───────date─┬─name───────────┬─location──────────┐
│    45.8 │ PO000008549 │ 1944-07-30 │ COIMBRA        │ (40.2,-8.4167)    │
│    45.4 │ PO000008562 │ 2003-08-01 │ BEJA           │ (38.0167,-7.8667) │
│    45.2 │ PO000008562 │ 1995-07-23 │ BEJA           │ (38.0167,-7.8667) │
│    44.5 │ POM00008558 │ 2003-08-01 │ EVORA/C. COORD │ (38.533,-7.9)     │
│    44.2 │ POM00008558 │ 2022-07-13 │ EVORA/C. COORD │ (38.533,-7.9)     │
└─────────┴─────────────┴────────────┴────────────────┴───────────────────┘

5 rows in set. Elapsed: 0.488 sec. Processed 1.08 billion rows, 14.06 GB (2.21 billion rows/s., 28.82 GB/s.)

Unfortunately, this requires a full table scan and is therefore slower than our previous denormalization method. The reason for this is:

When running a JOIN, there is no optimization of the order of execution compared to other phases of the query. JOIN (search in right table) runs before filtering and aggregation in WHERE.

We also suggest dictionaries as a possible solution. Now let us show how we can use dictionaries to improve query performance when the data already follows the principle of normal form.

Create dictionary

The dictionary provides us with a representation of data stored in key-value format in memory, optimizing the efficiency of search queries. We can take advantage of this structure to improve query performance, especially when one side of the JOIN is an in-memory lookup table, JOIN queries can benefit.

Select source and key

The dictionary can currently be populated from two sources: local ClickHouse tables and HTTP URLs*. The contents of the dictionary can be configured to be reloaded periodically to reflect changes in the source data.

* We anticipate extending this feature in the future to include other sources supported in OSS.

Next, we create our dictionary using the  stations  table as the source.

CREATE DICTIONARY stations_dict
(
 `station_id` String,
 `state` String,
 `country_code` String,
 `name` String,
 `lat` Float64,
 `lon` Float64,
 `elevation` Float32
)
PRIMARY KEY station_id
SOURCE(CLICKHOUSE(TABLE 'stations'))
LIFETIME(MIN 0 MAX 0)
LAYOUT(complex_key_hashed_array())

Here PRIMARY KEY  is  station_id , intuitive represents the column on which the search will be performed. Values ​​must be unique, i.e. rows with the same primary key will be deduplicated. Other columns represent attributes. You may have noticed that we have divided locations into  lat  and  lon  , because the attribute type of the dictionary currently does not support the Point type.  LAYOUT  and  LIFETIME  are less obvious and require some explanation.

Select layout

The layout of a dictionary controls how it is stored in memory and the indexing strategy for primary keys. Each layout option has different pros and cons.

The flat  type allocates an array for the maximum key value, for example, if the maximum value is 100k, the array will also have 100k entries. This is ideal for having a monotonically increasing primary key in the source data. In this case it uses very little memory and provides 4-5 times faster access than hash-based alternatives - requiring only a simple array offset lookup. However, it is limited in that the key size cannot exceed 500k - although this can be configured by setting  max_array_size  . It is inherently less efficient for larger sparse distributions and wastes memory in this case.

For situations where you have a very large number of entries, large key values ​​and/or a sparse distribution of values, then the  flat  layout becomes Not so ideal. Well, we generally recommend hash-based dictionaries - specifically hashed_array dictionaries, which can support millions of entries efficiently. This layout is more memory efficient than the  hashed  layout and is almost as fast. For this type, there is only a hash table used to store the primary key and the value provides the offset position in the array of the specific attribute. This is in contrast to the hashed layout, which although slightly faster, requires a hash table to be allocated for each attribute - and therefore consumes more memory. In most cases we therefore recommend the  hashed_array  layout - although users should try  hashed .

All these types also require that the key can be converted to UInt64. If not, for example, they are strings, we can use complex variants of hashed dictionaries: complex_key_hashed and complex_key_hashed_array , to follow the same rules as above.

We try to demonstrate the above logic using the flowchart below to help you choose the right layout (most of the time):

picture

For our data, the primary key is of type String country_code , we choose  complex_key_hashed_array a>  type, since our dictionary has at least three attributes in each case.

Note: We also have sparse variants of the  hashed and complex_key_hashed  layouts. This layout aims to achieve O(1) time operations by splitting the primary keys into groups and incrementing ranges within them. We rarely recommend this layout, it only works if you only have one property. Although the operation is constant time, the actual constant is usually higher than the non-sparse variant. Finally, ClickHouse provides specialized layouts such as polygon and ip_trie.

Select life cycle

The dictionary DDL above also specifies  LIFETIME  for the dictionary. This specifies how often the dictionary should reread the source to refresh the data. This can be specified as a number of seconds or a range, for example, LIFETIME(300)  or  LIFETIME(MIN 300 MAX 360) . In the latter case, a random time is chosen that is evenly distributed within the range. This ensures that the load on the dictionary source is distributed over time when multiple servers are being updated. The value used in our example  LIFETIME(MIN 0 MAX 0)  means that the dictionary contents are never updated - which is appropriate in our case because Our data is static.

If your data will be updated and you need to reload the data periodically, this behavior can be controlled by the invalidate_query parameter of the returned row. If the value of this row changes between update cycles, ClickHouse knows that the data must be re-fetched. For example, this can return a timestamp or row number. Further options exist to ensure that only data that has changed since the last update is loaded - see our documentation for examples of how to use  update_field  .

use dictionary

Although our dictionary has been created, it requires a query to load the data into memory. The easiest way to do this is to issue a simple dictGet query to retrieve a single value (loading the dataset into the dictionary as a side effect) or by issuing Explicit SYSTEM RELOAD DICTIONARY  command.

SYSTEM RELOAD DICTIONARY stations_dict

0 rows in set. Elapsed: 0.561 sec.

SELECT dictGet(stations_dict, 'state', 'CA00116HFF6')

┌─dictGet(stations_dict, 'state', 'CA00116HFF6')─┐
│ BC                                             │
└────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.001 sec.

The above dictGet  example retrieves the country code P0  The  station_id  value.

Going back to our original Join query, we can restore our subquery and use a dictionary only for our location and name fields.

SELECT
    tempMax / 10 AS maxTemp,
    station_id,
    date,
    (dictGet(stations_dict, 'lat', station_id), dictGet(stations_dict, 'lon', station_id)) AS location,
    dictGet(stations_dict, 'name', station_id) AS name
FROM noaa
WHERE station_id IN (
    SELECT station_id
    FROM stations
    WHERE country_code = 'PO'
)
ORDER BY tempMax DESC
LIMIT 5

┌─maxTemp─┬─station_id──┬───────date─┬─location──────────┬─name───────────┐
│    45.8 │ PO000008549 │ 1944-07-30 │ (40.2,-8.4167)    │ COIMBRA        │
│    45.4 │ PO000008562 │ 2003-08-01 │ (38.0167,-7.8667) │ BEJA           │
│    45.2 │ PO000008562 │ 1995-07-23 │ (38.0167,-7.8667) │ BEJA           │
│    44.5 │ POM00008558 │ 2003-08-01 │ (38.533,-7.9)     │ EVORA/C. COORD │
│    44.2 │ POM00008558 │ 2022-07-13 │ (38.533,-7.9)     │ EVORA/C. COORD │
└─────────┴─────────────┴────────────┴───────────────────┴────────────────┘

5 rows in set. Elapsed: 0.012 sec. Processed 522.48 thousand rows, 6.64 MB (44.90 million rows/s., 570.83 MB/s.)

Now this is much better! The key is that we can benefit from subquery optimization by taking advantage of its  country_code  primary key. The parent query can then limit the noaa table read to only those returned station ids, again utilizing their primary keys to minimize data reads. Finally, dictGet  is used only for the last 5 lines, to retrieve name andlocation. We visualize the process below:

picture

Experienced dictionary users may try other methods. For example, we can:

  • Remove the subquery and use dictGet(stations_dict, 'country_code', station_id) = 'P0' Filter. This is no faster (about 0.5 seconds) because a dictionary lookup is required for each station. We see a similar example below.

  • Take advantage of the fact that dictionaries can be used for JOINs just like tables (see below). This presents the same challenge as the proposal above.

more complicated things

Consider the following query:

Using a list of U.S. ski resorts and their corresponding locations, we joined it with the top 1,000 weather stations with the most snow for any month in the last 5 years. By sorting by `geoDistance` and limiting the result distance to less than 20km, we select the best results for each resort and sort them by total snow volume. Please note that we also limit resorts to locations above 1800m as a broad indicator of good skiing conditions.

SELECT
    resort_name,
    total_snow / 1000 AS total_snow_m,
    resort_location,
    month_year
FROM
(
    WITH resorts AS
        (
            SELECT
                resort_name,
                state,
                (lon, lat) AS resort_location,
                'US' AS code
            FROM url('https://gist.githubusercontent.com/Ewiseman/b251e5eaf70ca52a4b9b10dce9e635a4/raw/9f0100fe14169a058c451380edca3bda24d3f673/ski_resort_stats.csv', CSVWithNames)
        )
    SELECT
        resort_name,
        highest_snow.station_id,
        geoDistance(resort_location.1, resort_location.2, station_location.1, station_location.2) / 1000 AS distance_km,
        highest_snow.total_snow,
        resort_location,
        station_location,
        month_year
    FROM
    (
        SELECT
            sum(snowfall) AS total_snow,
            station_id,
            any(location) AS station_location,
            month_year,
            substring(station_id, 1, 2) AS code
        FROM noaa
        WHERE (date > '2017-01-01') AND (code = 'US') AND (elevation > 1800)
        GROUP BY
            station_id,
            toYYYYMM(date) AS month_year
        ORDER BY total_snow DESC
        LIMIT 1000
    ) AS highest_snow
    INNER JOIN resorts ON highest_snow.code = resorts.code
    WHERE distance_km < 20
    ORDER BY
        resort_name ASC,
        total_snow DESC
    LIMIT 1 BY
        resort_name,
        station_id
)
ORDER BY total_snow DESC
LIMIT 5

​​​​​​​Before using dictionary optimization, we first replace the CTE containing the resorts with the actual table. This ensures that our data is local within the ClickHouse cluster and avoids HTTP delays in getting to the resort.

CREATE TABLE resorts
(
   `resort_name` LowCardinality(String),
   `state` LowCardinality(String),
   `lat` Nullable(Float64),
   `lon` Nullable(Float64),
   `code` LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY state

When we populate this table, we also have the opportunity to combine the state field with the stationsstations a> table alignment (we will use this later). Resorts use state names, while sites usestatecodes. To make sure they are consistent, we can map state names to codes when inserting into the resorts table. This gives us another opportunity to create a dictionary based on an HTTP source.

CREATE DICTIONARY states
(
    `name` String,
    `code` String
)
PRIMARY KEY name
SOURCE(HTTP(URL 'https://gist.githubusercontent.com/gingerwizard/b0e7c190474c847fdf038e821692ce9c/raw/19fdac5a37e66f78d292bd8c0ee364ca7e6f9a57/states.csv' FORMAT 'CSVWithNames'))
LIFETIME(MIN 0 MAX 0)
LAYOUT(COMPLEX_KEY_HASHED_ARRAY())

SELECT *
FROM states
LIMIT 2

┌─name─────────┬─code─┐
│ Pennsylvania │ PA   │
│ North Dakota │ ND   │
└──────────────┴──────┘

2 rows in set. Elapsed: 0.001 sec.

On insertion, we can use the  dictGet  function to map our state name to the resort's state code.

INSERT INTO resorts SELECT
    resort_name,
    dictGet(states, 'code', state) AS state,
    lat, 
    lon,
    'US' AS code
FROM url('https://gist.githubusercontent.com/Ewiseman/b251e5eaf70ca52a4b9b10dce9e635a4/raw/9f0100fe14169a058c451380edca3bda24d3f673/ski_resort_stats.csv', CSVWithNames)

0 rows in set. Elapsed: 0.389 sec.

Now, our original query is obviously simpler.

SELECT
    resort_name,
    total_snow / 1000 AS total_snow_m,
    resort_location,
    month_year
FROM
(
    SELECT
        resort_name,
        highest_snow.station_id,
        geoDistance(lon, lat, station_location.1, station_location.2) / 1000 AS distance_km,
        highest_snow.total_snow,
        station_location,
        month_year,
        (lon, lat) AS resort_location
    FROM
    (
        SELECT
            sum(snowfall) AS total_snow,
            station_id,
            any(location) AS station_location,
            month_year,
            substring(station_id, 1, 2) AS code
        FROM noaa
        WHERE (date > '2017-01-01') AND (code = 'US') AND (elevation > 1800)
        GROUP BY
            station_id,
            toYYYYMM(date) AS month_year
        ORDER BY total_snow DESC
        LIMIT 1000
    ) AS highest_snow
    INNER JOIN resorts ON highest_snow.code = resorts.code
    WHERE distance_km < 20
    ORDER BY
        resort_name ASC,
        total_snow DESC
    LIMIT 1 BY
        resort_name,
        station_id
)
ORDER BY total_snow DESC
LIMIT 5

┌─resort_name──────────┬─total_snow_m─┬─resort_location─┬─month_year─┐
│ Sugar Bowl, CA       │        7.799 │ (-120.3,39.27)  │     201902 │
│ Donner Ski Ranch, CA │        7.799 │ (-120.34,39.31) │     201902 │
│ Boreal, CA           │        7.799 │ (-120.35,39.33) │     201902 │
│ Homewood, CA         │        4.926 │ (-120.17,39.08) │     201902 │
│ Alpine Meadows, CA   │        4.926 │ (-120.22,39.17) │     201902 │
└──────────────────────┴──────────────┴─────────────────┴────────────┘

5 rows in set. Elapsed: 0.673 sec. Processed 580.53 million rows, 4.85 GB (862.48 million rows/s., 7.21 GB/s.)

Pay attention to the execution time and see if we can improve it further. This query still assumes that location is denormalized to our weather measurement noaa table. Now, we can read this field from the  stations_dict  dictionary. This also makes it easier for us to get the sitestate and use this state with resorts Tables are joined instead of using code. This join is smaller and faster, meaning instead of joining all sites with all US resorts, we are limiting it to resorts in the same state.

Ourresorts table is actually quite small (364 entries). Although moving it to a dictionary may not bring any real performance benefits to this query, given its size, it may represent a reasonable way to store the data. We choose  resort_name  as our primary key because this must be unique, as mentioned earlier.

CREATE DICTIONARY resorts_dict
(
    `state` String,
    `resort_name` String,
    `lat` Nullable(Float64),
    `lon` Nullable(Float64)
)
PRIMARY KEY resort_name
SOURCE(CLICKHOUSE(TABLE 'resorts'))
LIFETIME(MIN 0 MAX 0)
LAYOUT(COMPLEX_KEY_HASHED_ARRAY())

Now, we modify the query statement, using  stations_dict  where possible, and  resorts_dict It is not the primary key in the dictionary. In this case, we use JOIN syntax and the dictionary will be scanned like a table. resorts column even though it is in our  state  to connect. Note that we are still joining on the

SELECT
    resort_name,
    total_snow / 1000 AS total_snow_m,
    resort_location,
    month_year
FROM
(
    SELECT
        resort_name,
        highest_snow.station_id,
        geoDistance(resorts_dict.lon, resorts_dict.lat, station_lon, station_lat) / 1000 AS distance_km,
        highest_snow.total_snow,
        (resorts_dict.lon, resorts_dict.lat) AS resort_location,
        month_year
    FROM
    (
        SELECT
            sum(snowfall) AS total_snow,
            station_id,
            dictGet(stations_dict, 'lat', station_id) AS station_lat,
            dictGet(stations_dict, 'lon', station_id) AS station_lon,
            month_year,
            dictGet(stations_dict, 'state', station_id) AS state
        FROM noaa
        WHERE (date > '2017-01-01') AND (state != '') AND (elevation > 1800)
        GROUP BY
            station_id,
            toYYYYMM(date) AS month_year
        ORDER BY total_snow DESC
        LIMIT 1000
    ) AS highest_snow
    INNER JOIN resorts_dict ON highest_snow.state = resorts_dict.state
    WHERE distance_km < 20
    ORDER BY
        resort_name ASC,
        total_snow DESC
    LIMIT 1 BY
        resort_name,
        station_id
)
ORDER BY total_snow DESC
LIMIT 5

┌─resort_name──────────┬─total_snow_m─┬─resort_location─┬─month_year─┐
│ Sugar Bowl, CA       │        7.799 │ (-120.3,39.27)  │     201902 │
│ Donner Ski Ranch, CA │        7.799 │ (-120.34,39.31) │     201902 │
│ Boreal, CA           │        7.799 │ (-120.35,39.33) │     201902 │
│ Homewood, CA         │        4.926 │ (-120.17,39.08) │     201902 │
│ Alpine Meadows, CA   │        4.926 │ (-120.22,39.17) │     201902 │
└──────────────────────┴──────────────┴─────────────────┴────────────┘

5 rows in set. Elapsed: 0.170 sec. Processed 580.73 million rows, 2.87 GB (3.41 billion rows/s., 16.81 GB/s.)

Great, the speed has increased by more than twice! Now, the attentive reader may have noticed that we skipped a possible optimization. We can completely replace our altitude filter conditions with dictionary lookup values  elevation > 1800 , that is,  dictGet(blogs .stations_dict, 'elevation', station_id) > 1800 , thereby avoiding reading the table? In practice, this will be slower because a dictionary lookup is done for each row, which is slower than filtering the sorted elevation data - the latter benefits from the clause being moved to PREWHERE. In this case, we benefit from the counter-paradigm of the elevation data. This is consistent with the fact that we did not use  dictGet  country_code  8> is similar.

So the advice here is to test it out! If dictGet needs to be applied to a large portion of the rows in the table, such as in a conditional, then you're probably better off using ClickHouse's native data structures and indexes directly.

Final recommendations

  • The dictionary layout we have described is stored entirely in memory. Please note their usage and test any layout changes. You can track their memory overhead using the system.dictionaries table and the bytes_allocated column. This table also includes a  last_exception  column that can be used to diagnose the problem.

SELECT
    *,
    formatReadableSize(bytes_allocated) AS size
FROM system.dictionaries
LIMIT 1
FORMAT Vertical

Row 1:
──────
database:                           blogs
name:                               resorts_dict
uuid:                               0f387514-85ed-4c25-bebb-d85ade1e149f
status:                             LOADED
origin:                             0f387514-85ed-4c25-bebb-d85ade1e149f
type:                               ComplexHashedArray
key.names:                          ['resort_name']
key.types:                          ['String']
attribute.names:                    ['state','lat','lon']
attribute.types:                    ['String','Nullable(Float64)','Nullable(Float64)']
bytes_allocated:                    30052
hierarchical_index_bytes_allocated: 0
query_count:                        1820
hit_rate:                           1
found_rate:                         1
element_count:                      364
load_factor:                        0.7338709677419355
source:                             ClickHouse: blogs.resorts
lifetime_min:                       0
lifetime_max:                       0
loading_start_time:                 2022-11-22 16:26:06
last_successful_update_time:        2022-11-22 16:26:06
loading_duration:                   0.001
last_exception:
comment:
size:                               29.35 KiB
  • While dictGet is most likely the dictionary function you'll use most, other variations exist, of which dictGetOrDefault and dictHas are particularly useful. Also, be aware of type-specific functions such as dictGetFloat64.

  • The flat dictionary has a size limit of 500k entries. While this limit can be extended, consider it a reminder to consider hashed layouts.

in conclusion

In this article, we show how keeping data in a normal form can sometimes lead to faster queries, especially when using dictionaries. We provide some simple and complex examples of where dictionaries are valuable and derive some useful suggestions.

picture

contact us

Mobile number: 13910395701

Email: [email protected]

Meet all your online column analysisDatabase managementneeds

Guess you like

Origin blog.csdn.net/ClickHouseDB/article/details/132699735