Number of words in this article: 15160; estimated reading time: 38 minutes
Reviewer: Zhuang Xiaodong (Weizhuang)
introduce
In this article, we will take the opportunity to remind users of the powerful role of dictionaries in accelerating queries - especially queries containing JOIN, as well as some usage tips. Additionally, all examples in this article can be reproduced in our play.clickhouse.com environment (see blogs database).
Data introduction
Our original table structure is like this, which records more than 100 years of weather information:
CREATE TABLE noaa
(
`station_id` LowCardinality(String),
`date` Date32,
`tempAvg` Int32 COMMENT 'Average temperature (tenths of a degrees C)',
`tempMax` Int32 COMMENT 'Maximum temperature (tenths of degrees C)',
`tempMin` Int32 COMMENT 'Minimum temperature (tenths of degrees C)',
`precipitation` UInt32 COMMENT 'Precipitation (tenths of mm)',
`snowfall` UInt32 COMMENT 'Snowfall (mm)',
`snowDepth` UInt32 COMMENT 'Snow depth (mm)',
`percentDailySun` UInt8 COMMENT 'Daily percent of possible sunshine (percent)',
`averageWindSpeed` UInt32 COMMENT 'Average daily wind speed (tenths of meters per second)',
`maxWindSpeed` UInt32 COMMENT 'Peak gust wind speed (tenths of meters per second)',
`weatherType` Enum8('Normal' = 0, 'Fog' = 1, 'Heavy Fog' = 2, 'Thunder' = 3, 'Small Hail' = 4, 'Hail' = 5, 'Glaze' = 6, 'Dust/Ash' = 7, 'Smoke/Haze' = 8, 'Blowing/Drifting Snow' = 9, 'Tornado' = 10, 'High Winds' = 11, 'Blowing Spray' = 12, 'Mist' = 13, 'Drizzle' = 14, 'Freezing Drizzle' = 15, 'Rain' = 16, 'Freezing Rain' = 17, 'Snow' = 18, 'Unknown Precipitation' = 19, 'Ground Fog' = 21, 'Freezing Fog' = 22),
`location` Point,
`elevation` Float32,
`name` LowCardinality(String)
) ENGINE = MergeTree() ORDER BY (station_id, date)
Each row represents the measurement data of a certain weather station at a point in time - each row has a station_id . Taking advantage of the fact that the first two digits of station_id represent the country code, we can find the top 5 temperatures of a country by knowing its prefix and using the substring function. For example, Portugal:
SELECT
tempMax / 10 AS maxTemp,
station_id,
date,
location,
name
FROM noaa
WHERE substring(station_id, 1, 2) = 'PO'
ORDER BY tempMax DESC
LIMIT 5
┌─maxTemp─┬─station_id──┬───────date─┬─location──────────┬─name───────────┐
│ 45.8 │ PO000008549 │ 1944-07-30 │ (-8.4167,40.2) │ COIMBRA │
│ 45.4 │ PO000008562 │ 2003-08-01 │ (-7.8667,38.0167) │ BEJA │
│ 45.2 │ PO000008562 │ 1995-07-23 │ (-7.8667,38.0167) │ BEJA │
│ 44.5 │ POM00008558 │ 2003-08-01 │ (-7.9,38.533) │ EVORA/C. COORD │
│ 44.2 │ POM00008558 │ 2022-07-13 │ (-7.9,38.533) │ EVORA/C. COORD │
└─────────┴─────────────┴────────────┴───────────────────┴────────────────┘
5 rows in set. Elapsed: 0.259 sec. Processed 1.08 billion rows, 7.46 GB (4.15 billion rows/s., 28.78 GB/s.)
Unfortunately, this query requires a full table scan because it cannot take advantage of our primary key (station_id, date) .
Improve data model
Members of our community quickly came up with a simple optimization to improve the response time of the above query by reducing the amount of data read from disk. This can be accomplished by skipping the paradigm design principle and storing station_id in a separate table before modifying it into a subquery.
First, we review these recommendations so that readers can understand them. Below we create a site table and populate it directly by inserting data using the url function.
CREATE TABLE stations
(
`station_id` LowCardinality(String),
`country_code` LowCardinality(String),
`state` LowCardinality(String),
`name` LowCardinality(String),
`lat` Float64,
`lon` Float64,
`elevation` Float32
)
ENGINE = MergeTree
ORDER BY (country_code, station_id)
INSERT INTO stations
SELECT
station_id,
substring(station_id, 1, 2) AS country_code,
trimBoth(state) AS state,
name,
lat,
lon,
elevation
FROM url('https://noaa-ghcn-pds.s3.amazonaws.com/ghcnd-stations.txt', Regexp, 'station_id String, lat Float64, lon Float64, elevation Float32, state String, name String')
SETTINGS format_regexp = '^(.{11})\\s+(\\-?\\d{1,2}\\.\\d{4})\\s+(\\-?\\d{1,3}\\.\\d{1,4})\\s+(\\-?\\d*\\.\\d*)\\s+(.{2})\\s(.*?)\\s{2,}.*$'
0 rows in set. Elapsed: 1.781 sec. Processed 123.18 thousand rows, 7.99 MB (69.17 thousand rows/s., 4.48 MB/s.)
For example, we now assume that our noaa table no longer has location< a i=4>, elevation, and name fields. The temperature query of the top 5 in Portugal can now be almost solved with a subquery:
SELECT
tempMax / 10 AS maxTemp,
station_id,
date,
location,
name
FROM noaa
WHERE station_id IN (
SELECT station_id
FROM stations
WHERE country_code = 'PO'
)
ORDER BY tempMax DESC
LIMIT 5
┌─maxTemp─┬─station_id──┬───────date─┬─location──────────┬─name───────────┐
│ 45.8 │ PO000008549 │ 1944-07-30 │ (-8.4167,40.2) │ COIMBRA │
│ 45.4 │ PO000008562 │ 2003-08-01 │ (-7.8667,38.0167) │ BEJA │
│ 45.2 │ PO000008562 │ 1995-07-23 │ (-7.8667,38.0167) │ BEJA │
│ 44.5 │ POM00008558 │ 2003-08-01 │ (-7.9,38.533) │ EVORA/C. COORD │
│ 44.2 │ POM00008558 │ 2022-07-13 │ (-7.9,38.533) │ EVORA/C. COORD │
└─────────┴─────────────┴────────────┴───────────────────┴────────────────┘
5 rows in set. Elapsed: 0.009 sec. Processed 522.48 thousand rows, 6.64 MB (59.81 million rows/s., 760.45 MB/s.)
Because the subquery uses the stations table's country_code Primary key, so query is faster. Additionally, the parent query can also use its primary key. Only a small range of these columns needs to be read, resulting in less data being read from disk and no connection cost. As members of our community have pointed out, keeping the data denormalized is beneficial in this case.
But there's a problem - we rely on location and name on weather data The anti-paradigm of a>. Assuming we don't do this, in order to avoid duplication and follow the principles of normalization and separation on the stations table, we need a full join (actually , we may retain location and name in denormal form and accept storage Cost):
SELECT
tempMax / 10 AS maxTemp,
station_id,
date,
stations.name AS name,
(stations.lat, stations.lon) AS location
FROM noaa
INNER JOIN stations ON noaa.station_id = stations.station_id
WHERE stations.country_code = 'PO'
ORDER BY tempMax DESC
LIMIT 5
┌─maxTemp─┬─station_id──┬───────date─┬─name───────────┬─location──────────┐
│ 45.8 │ PO000008549 │ 1944-07-30 │ COIMBRA │ (40.2,-8.4167) │
│ 45.4 │ PO000008562 │ 2003-08-01 │ BEJA │ (38.0167,-7.8667) │
│ 45.2 │ PO000008562 │ 1995-07-23 │ BEJA │ (38.0167,-7.8667) │
│ 44.5 │ POM00008558 │ 2003-08-01 │ EVORA/C. COORD │ (38.533,-7.9) │
│ 44.2 │ POM00008558 │ 2022-07-13 │ EVORA/C. COORD │ (38.533,-7.9) │
└─────────┴─────────────┴────────────┴────────────────┴───────────────────┘
5 rows in set. Elapsed: 0.488 sec. Processed 1.08 billion rows, 14.06 GB (2.21 billion rows/s., 28.82 GB/s.)
Unfortunately, this requires a full table scan and is therefore slower than our previous denormalization method. The reason for this is:
When running a JOIN, there is no optimization of the order of execution compared to other phases of the query. JOIN (search in right table) runs before filtering and aggregation in WHERE.
We also suggest dictionaries as a possible solution. Now let us show how we can use dictionaries to improve query performance when the data already follows the principle of normal form.
Create dictionary
The dictionary provides us with a representation of data stored in key-value format in memory, optimizing the efficiency of search queries. We can take advantage of this structure to improve query performance, especially when one side of the JOIN is an in-memory lookup table, JOIN queries can benefit.
Select source and key
The dictionary can currently be populated from two sources: local ClickHouse tables and HTTP URLs*. The contents of the dictionary can be configured to be reloaded periodically to reflect changes in the source data.
* We anticipate extending this feature in the future to include other sources supported in OSS.
Next, we create our dictionary using the stations table as the source.
CREATE DICTIONARY stations_dict
(
`station_id` String,
`state` String,
`country_code` String,
`name` String,
`lat` Float64,
`lon` Float64,
`elevation` Float32
)
PRIMARY KEY station_id
SOURCE(CLICKHOUSE(TABLE 'stations'))
LIFETIME(MIN 0 MAX 0)
LAYOUT(complex_key_hashed_array())
Here PRIMARY KEY is station_id , intuitive represents the column on which the search will be performed. Values must be unique, i.e. rows with the same primary key will be deduplicated. Other columns represent attributes. You may have noticed that we have divided locations into lat and lon , because the attribute type of the dictionary currently does not support the Point type. LAYOUT and LIFETIME are less obvious and require some explanation.
Select layout
The layout of a dictionary controls how it is stored in memory and the indexing strategy for primary keys. Each layout option has different pros and cons.
The flat type allocates an array for the maximum key value, for example, if the maximum value is 100k, the array will also have 100k entries. This is ideal for having a monotonically increasing primary key in the source data. In this case it uses very little memory and provides 4-5 times faster access than hash-based alternatives - requiring only a simple array offset lookup. However, it is limited in that the key size cannot exceed 500k - although this can be configured by setting max_array_size . It is inherently less efficient for larger sparse distributions and wastes memory in this case.
For situations where you have a very large number of entries, large key values and/or a sparse distribution of values, then the flat layout becomes Not so ideal. Well, we generally recommend hash-based dictionaries - specifically hashed_array dictionaries, which can support millions of entries efficiently. This layout is more memory efficient than the hashed layout and is almost as fast. For this type, there is only a hash table used to store the primary key and the value provides the offset position in the array of the specific attribute. This is in contrast to the hashed layout, which although slightly faster, requires a hash table to be allocated for each attribute - and therefore consumes more memory. In most cases we therefore recommend the hashed_array layout - although users should try hashed .
All these types also require that the key can be converted to UInt64. If not, for example, they are strings, we can use complex variants of hashed dictionaries: complex_key_hashed and complex_key_hashed_array , to follow the same rules as above.
We try to demonstrate the above logic using the flowchart below to help you choose the right layout (most of the time):
For our data, the primary key is of type String country_code , we choose complex_key_hashed_array a> type, since our dictionary has at least three attributes in each case.
Note: We also have sparse variants of the hashed and complex_key_hashed layouts. This layout aims to achieve O(1) time operations by splitting the primary keys into groups and incrementing ranges within them. We rarely recommend this layout, it only works if you only have one property. Although the operation is constant time, the actual constant is usually higher than the non-sparse variant. Finally, ClickHouse provides specialized layouts such as polygon and ip_trie.
Select life cycle
The dictionary DDL above also specifies LIFETIME for the dictionary. This specifies how often the dictionary should reread the source to refresh the data. This can be specified as a number of seconds or a range, for example, LIFETIME(300) or LIFETIME(MIN 300 MAX 360) . In the latter case, a random time is chosen that is evenly distributed within the range. This ensures that the load on the dictionary source is distributed over time when multiple servers are being updated. The value used in our example LIFETIME(MIN 0 MAX 0) means that the dictionary contents are never updated - which is appropriate in our case because Our data is static.
If your data will be updated and you need to reload the data periodically, this behavior can be controlled by the invalidate_query parameter of the returned row. If the value of this row changes between update cycles, ClickHouse knows that the data must be re-fetched. For example, this can return a timestamp or row number. Further options exist to ensure that only data that has changed since the last update is loaded - see our documentation for examples of how to use update_field .
use dictionary
Although our dictionary has been created, it requires a query to load the data into memory. The easiest way to do this is to issue a simple dictGet query to retrieve a single value (loading the dataset into the dictionary as a side effect) or by issuing Explicit SYSTEM RELOAD DICTIONARY command.
SYSTEM RELOAD DICTIONARY stations_dict
0 rows in set. Elapsed: 0.561 sec.
SELECT dictGet(stations_dict, 'state', 'CA00116HFF6')
┌─dictGet(stations_dict, 'state', 'CA00116HFF6')─┐
│ BC │
└────────────────────────────────────────────────┘
1 row in set. Elapsed: 0.001 sec.
The above dictGet example retrieves the country code P0 The station_id value.
Going back to our original Join query, we can restore our subquery and use a dictionary only for our location and name fields.
SELECT
tempMax / 10 AS maxTemp,
station_id,
date,
(dictGet(stations_dict, 'lat', station_id), dictGet(stations_dict, 'lon', station_id)) AS location,
dictGet(stations_dict, 'name', station_id) AS name
FROM noaa
WHERE station_id IN (
SELECT station_id
FROM stations
WHERE country_code = 'PO'
)
ORDER BY tempMax DESC
LIMIT 5
┌─maxTemp─┬─station_id──┬───────date─┬─location──────────┬─name───────────┐
│ 45.8 │ PO000008549 │ 1944-07-30 │ (40.2,-8.4167) │ COIMBRA │
│ 45.4 │ PO000008562 │ 2003-08-01 │ (38.0167,-7.8667) │ BEJA │
│ 45.2 │ PO000008562 │ 1995-07-23 │ (38.0167,-7.8667) │ BEJA │
│ 44.5 │ POM00008558 │ 2003-08-01 │ (38.533,-7.9) │ EVORA/C. COORD │
│ 44.2 │ POM00008558 │ 2022-07-13 │ (38.533,-7.9) │ EVORA/C. COORD │
└─────────┴─────────────┴────────────┴───────────────────┴────────────────┘
5 rows in set. Elapsed: 0.012 sec. Processed 522.48 thousand rows, 6.64 MB (44.90 million rows/s., 570.83 MB/s.)
Now this is much better! The key is that we can benefit from subquery optimization by taking advantage of its country_code primary key. The parent query can then limit the noaa table read to only those returned station ids, again utilizing their primary keys to minimize data reads. Finally, dictGet is used only for the last 5 lines, to retrieve name andlocation. We visualize the process below:
Experienced dictionary users may try other methods. For example, we can:
-
Remove the subquery and use dictGet(stations_dict, 'country_code', station_id) = 'P0' Filter. This is no faster (about 0.5 seconds) because a dictionary lookup is required for each station. We see a similar example below.
-
Take advantage of the fact that dictionaries can be used for JOINs just like tables (see below). This presents the same challenge as the proposal above.
more complicated things
Consider the following query:
Using a list of U.S. ski resorts and their corresponding locations, we joined it with the top 1,000 weather stations with the most snow for any month in the last 5 years. By sorting by `geoDistance` and limiting the result distance to less than 20km, we select the best results for each resort and sort them by total snow volume. Please note that we also limit resorts to locations above 1800m as a broad indicator of good skiing conditions.
SELECT
resort_name,
total_snow / 1000 AS total_snow_m,
resort_location,
month_year
FROM
(
WITH resorts AS
(
SELECT
resort_name,
state,
(lon, lat) AS resort_location,
'US' AS code
FROM url('https://gist.githubusercontent.com/Ewiseman/b251e5eaf70ca52a4b9b10dce9e635a4/raw/9f0100fe14169a058c451380edca3bda24d3f673/ski_resort_stats.csv', CSVWithNames)
)
SELECT
resort_name,
highest_snow.station_id,
geoDistance(resort_location.1, resort_location.2, station_location.1, station_location.2) / 1000 AS distance_km,
highest_snow.total_snow,
resort_location,
station_location,
month_year
FROM
(
SELECT
sum(snowfall) AS total_snow,
station_id,
any(location) AS station_location,
month_year,
substring(station_id, 1, 2) AS code
FROM noaa
WHERE (date > '2017-01-01') AND (code = 'US') AND (elevation > 1800)
GROUP BY
station_id,
toYYYYMM(date) AS month_year
ORDER BY total_snow DESC
LIMIT 1000
) AS highest_snow
INNER JOIN resorts ON highest_snow.code = resorts.code
WHERE distance_km < 20
ORDER BY
resort_name ASC,
total_snow DESC
LIMIT 1 BY
resort_name,
station_id
)
ORDER BY total_snow DESC
LIMIT 5
Before using dictionary optimization, we first replace the CTE containing the resorts with the actual table. This ensures that our data is local within the ClickHouse cluster and avoids HTTP delays in getting to the resort.
CREATE TABLE resorts
(
`resort_name` LowCardinality(String),
`state` LowCardinality(String),
`lat` Nullable(Float64),
`lon` Nullable(Float64),
`code` LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY state
When we populate this table, we also have the opportunity to combine the state field with the stationsstations a> table alignment (we will use this later). Resorts use state names, while sites usestatecodes. To make sure they are consistent, we can map state names to codes when inserting into the resorts table. This gives us another opportunity to create a dictionary based on an HTTP source.
CREATE DICTIONARY states
(
`name` String,
`code` String
)
PRIMARY KEY name
SOURCE(HTTP(URL 'https://gist.githubusercontent.com/gingerwizard/b0e7c190474c847fdf038e821692ce9c/raw/19fdac5a37e66f78d292bd8c0ee364ca7e6f9a57/states.csv' FORMAT 'CSVWithNames'))
LIFETIME(MIN 0 MAX 0)
LAYOUT(COMPLEX_KEY_HASHED_ARRAY())
SELECT *
FROM states
LIMIT 2
┌─name─────────┬─code─┐
│ Pennsylvania │ PA │
│ North Dakota │ ND │
└──────────────┴──────┘
2 rows in set. Elapsed: 0.001 sec.
On insertion, we can use the dictGet function to map our state name to the resort's state code.
INSERT INTO resorts SELECT
resort_name,
dictGet(states, 'code', state) AS state,
lat,
lon,
'US' AS code
FROM url('https://gist.githubusercontent.com/Ewiseman/b251e5eaf70ca52a4b9b10dce9e635a4/raw/9f0100fe14169a058c451380edca3bda24d3f673/ski_resort_stats.csv', CSVWithNames)
0 rows in set. Elapsed: 0.389 sec.
Now, our original query is obviously simpler.
SELECT
resort_name,
total_snow / 1000 AS total_snow_m,
resort_location,
month_year
FROM
(
SELECT
resort_name,
highest_snow.station_id,
geoDistance(lon, lat, station_location.1, station_location.2) / 1000 AS distance_km,
highest_snow.total_snow,
station_location,
month_year,
(lon, lat) AS resort_location
FROM
(
SELECT
sum(snowfall) AS total_snow,
station_id,
any(location) AS station_location,
month_year,
substring(station_id, 1, 2) AS code
FROM noaa
WHERE (date > '2017-01-01') AND (code = 'US') AND (elevation > 1800)
GROUP BY
station_id,
toYYYYMM(date) AS month_year
ORDER BY total_snow DESC
LIMIT 1000
) AS highest_snow
INNER JOIN resorts ON highest_snow.code = resorts.code
WHERE distance_km < 20
ORDER BY
resort_name ASC,
total_snow DESC
LIMIT 1 BY
resort_name,
station_id
)
ORDER BY total_snow DESC
LIMIT 5
┌─resort_name──────────┬─total_snow_m─┬─resort_location─┬─month_year─┐
│ Sugar Bowl, CA │ 7.799 │ (-120.3,39.27) │ 201902 │
│ Donner Ski Ranch, CA │ 7.799 │ (-120.34,39.31) │ 201902 │
│ Boreal, CA │ 7.799 │ (-120.35,39.33) │ 201902 │
│ Homewood, CA │ 4.926 │ (-120.17,39.08) │ 201902 │
│ Alpine Meadows, CA │ 4.926 │ (-120.22,39.17) │ 201902 │
└──────────────────────┴──────────────┴─────────────────┴────────────┘
5 rows in set. Elapsed: 0.673 sec. Processed 580.53 million rows, 4.85 GB (862.48 million rows/s., 7.21 GB/s.)
Pay attention to the execution time and see if we can improve it further. This query still assumes that location is denormalized to our weather measurement noaa table. Now, we can read this field from the stations_dict dictionary. This also makes it easier for us to get the sitestate and use this state with resorts Tables are joined instead of using code. This join is smaller and faster, meaning instead of joining all sites with all US resorts, we are limiting it to resorts in the same state.
Ourresorts table is actually quite small (364 entries). Although moving it to a dictionary may not bring any real performance benefits to this query, given its size, it may represent a reasonable way to store the data. We choose resort_name as our primary key because this must be unique, as mentioned earlier.
CREATE DICTIONARY resorts_dict
(
`state` String,
`resort_name` String,
`lat` Nullable(Float64),
`lon` Nullable(Float64)
)
PRIMARY KEY resort_name
SOURCE(CLICKHOUSE(TABLE 'resorts'))
LIFETIME(MIN 0 MAX 0)
LAYOUT(COMPLEX_KEY_HASHED_ARRAY())
Now, we modify the query statement, using stations_dict where possible, and resorts_dict It is not the primary key in the dictionary. In this case, we use JOIN syntax and the dictionary will be scanned like a table. resorts column even though it is in our state to connect. Note that we are still joining on the
SELECT
resort_name,
total_snow / 1000 AS total_snow_m,
resort_location,
month_year
FROM
(
SELECT
resort_name,
highest_snow.station_id,
geoDistance(resorts_dict.lon, resorts_dict.lat, station_lon, station_lat) / 1000 AS distance_km,
highest_snow.total_snow,
(resorts_dict.lon, resorts_dict.lat) AS resort_location,
month_year
FROM
(
SELECT
sum(snowfall) AS total_snow,
station_id,
dictGet(stations_dict, 'lat', station_id) AS station_lat,
dictGet(stations_dict, 'lon', station_id) AS station_lon,
month_year,
dictGet(stations_dict, 'state', station_id) AS state
FROM noaa
WHERE (date > '2017-01-01') AND (state != '') AND (elevation > 1800)
GROUP BY
station_id,
toYYYYMM(date) AS month_year
ORDER BY total_snow DESC
LIMIT 1000
) AS highest_snow
INNER JOIN resorts_dict ON highest_snow.state = resorts_dict.state
WHERE distance_km < 20
ORDER BY
resort_name ASC,
total_snow DESC
LIMIT 1 BY
resort_name,
station_id
)
ORDER BY total_snow DESC
LIMIT 5
┌─resort_name──────────┬─total_snow_m─┬─resort_location─┬─month_year─┐
│ Sugar Bowl, CA │ 7.799 │ (-120.3,39.27) │ 201902 │
│ Donner Ski Ranch, CA │ 7.799 │ (-120.34,39.31) │ 201902 │
│ Boreal, CA │ 7.799 │ (-120.35,39.33) │ 201902 │
│ Homewood, CA │ 4.926 │ (-120.17,39.08) │ 201902 │
│ Alpine Meadows, CA │ 4.926 │ (-120.22,39.17) │ 201902 │
└──────────────────────┴──────────────┴─────────────────┴────────────┘
5 rows in set. Elapsed: 0.170 sec. Processed 580.73 million rows, 2.87 GB (3.41 billion rows/s., 16.81 GB/s.)
Great, the speed has increased by more than twice! Now, the attentive reader may have noticed that we skipped a possible optimization. We can completely replace our altitude filter conditions with dictionary lookup values elevation > 1800 , that is, dictGet(blogs .stations_dict, 'elevation', station_id) > 1800 , thereby avoiding reading the table? In practice, this will be slower because a dictionary lookup is done for each row, which is slower than filtering the sorted elevation data - the latter benefits from the clause being moved to PREWHERE. In this case, we benefit from the counter-paradigm of the elevation data. This is consistent with the fact that we did not use dictGet country_code 8> is similar.
So the advice here is to test it out! If dictGet needs to be applied to a large portion of the rows in the table, such as in a conditional, then you're probably better off using ClickHouse's native data structures and indexes directly.
Final recommendations
-
The dictionary layout we have described is stored entirely in memory. Please note their usage and test any layout changes. You can track their memory overhead using the system.dictionaries table and the bytes_allocated column. This table also includes a last_exception column that can be used to diagnose the problem.
SELECT
*,
formatReadableSize(bytes_allocated) AS size
FROM system.dictionaries
LIMIT 1
FORMAT Vertical
Row 1:
──────
database: blogs
name: resorts_dict
uuid: 0f387514-85ed-4c25-bebb-d85ade1e149f
status: LOADED
origin: 0f387514-85ed-4c25-bebb-d85ade1e149f
type: ComplexHashedArray
key.names: ['resort_name']
key.types: ['String']
attribute.names: ['state','lat','lon']
attribute.types: ['String','Nullable(Float64)','Nullable(Float64)']
bytes_allocated: 30052
hierarchical_index_bytes_allocated: 0
query_count: 1820
hit_rate: 1
found_rate: 1
element_count: 364
load_factor: 0.7338709677419355
source: ClickHouse: blogs.resorts
lifetime_min: 0
lifetime_max: 0
loading_start_time: 2022-11-22 16:26:06
last_successful_update_time: 2022-11-22 16:26:06
loading_duration: 0.001
last_exception:
comment:
size: 29.35 KiB
-
While dictGet is most likely the dictionary function you'll use most, other variations exist, of which dictGetOrDefault and dictHas are particularly useful. Also, be aware of type-specific functions such as dictGetFloat64.
-
The flat dictionary has a size limit of 500k entries. While this limit can be extended, consider it a reminder to consider hashed layouts.
in conclusion
In this article, we show how keeping data in a normal form can sometimes lead to faster queries, especially when using dictionaries. We provide some simple and complex examples of where dictionaries are valuable and derive some useful suggestions.
contact us
Mobile number: 13910395701
Email: [email protected]
Meet all your online column analysisDatabase managementneeds