ClickHouse 23.7 Release Notes

picture

Number of words in this article: 4920; estimated reading time: 13 minutes

Reviewer: Zhuang Xiaodong (Weizhuang)

Release summary

  • 31 new features added
  • Implemented 16 performance optimizations
  • Fixed 47 bugs

This article describes some of the new features that deserve our special attention. But it’s worth noting that several features are now production-ready or enabled by default. You can find them at the end of this article.

new contributor

A special welcome to all new contributors to version 23.7! ClickHouse's popularity is largely due to the efforts of contributors who contribute to the community. It's always exciting for everyone when a community thrives.

If you see your name on the list below, please contact us...we'll be on Twitter and waiting to hear from you.

Alex Cheng, AlexBykovski, Chen768959, John Spurlock, Mikhail Koviazin, Rory Crispin, Samuel Colvin, Sanjam Panda, Song Liyong, StianBerger, Vitaliy Pashkov, Yarik Briukhovetskyi, Zach Naimon, chen768959, dheerajathrey, lcjh, pedro.riera, therealnick233, timfursov, velavokr, xiao, xiaolei565, xuelei, yariks5s

Improvements to Parquet writing (Michael Kolupaev)

We've seen several ClickHouse reading improvements in the Parquet file format in recent months. In addition to parallelizing reads across row groups and filtering with metadata, we even took the time to ensure that queries on the Hugging Face dataset were optimized. We know that this file format is ubiquitous and critical for tasks like local analysis and data migration using clickhouse-local. Our continued efforts to improve our support for Parquet and our pursuit of speed have paid off in our public benchmarks.

picture

Of course, reading Parquet is only half the story. Users inevitably need to write ClickHouse data to Parquet, usually as part of a reverse ETL workflow or need to share the results of data analysis. Therefore, we are happy to announce that starting with version 23.7, Parquet writes are now 6x faster.

Let’s use the example of UK house price data to illustrate. Below, we use  clickhouse-local  and import data from a Parquet file that is already publicly hosted on S3.

CREATE TABLE uk_house_price
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2)
SETTINGS allow_nullable_key = 1 AS
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/uk-house-prices/parquet/house_prices.parquet')

0 rows in set. Elapsed: 40.550 sec. Processed 28.28 million rows, 4.67 GB (697.33 thousand rows/s., 115.15 MB/s.)

Importing this dataset using 23.6 is still very fast at almost 1.5 million rows per second.

SELECT *
FROM uk_house_price
INTO OUTFILE 'london-prices.parquet'

28276228 rows in set. Elapsed: 19.901 sec. Processed 28.20 million rows, 4.66 GB (1.42 million rows/s., 233.98 MB/s.)

Importing the same dataset using 23.7 saw a significant improvement, with the total time almost cut in half! Actual results may vary, but we've observed improvements as high as 6x.

SELECT *
FROM uk_house_price
INTO OUTFILE 'london-prices.parquet'
28276228 rows in set. Elapsed: 11.649 sec. Processed 28.24 million rows, 4.66 GB (2.42 million rows/s., 400.39 MB/s.)

Sparse columns enabled by default (Anton Popov)

ClickHouse has supported sparse columns for some time, but prior to 23.7 it needed to be explicitly enabled. This optimization aims to reduce the total data written for a column and dynamically change the encoding format when a large number of default values ​​are detected. In addition to improving compression, this also helps improve query performance and memory efficiency.

In 23.7, this feature is enabled by default. When this encoding can be applied, users should see immediate improvements in compression and performance.

When writing a data Part (either during an insert or a merge), ClickHouse calculates the ratio of the default values ​​for each column. If this exceeds the configured threshold, then only non-default values ​​will be written to the column. To preserve which rows have default values, a separate stream is written containing the encoding of the offsets. This information is combined at query time, ensuring that this optimization is completely transparent to the user. The following diagram shows an example of this:

picture

For a ① column containing sparse values ​​ s , ClickHouse only writes the non-default value to the ② column file on disk and compares it with ③A sparsely encoded file containing non-default value offsets together: for each non-default value, we store the number of default values ​​that existed directly before this non-default value. At query time, we create a ④ memory representation with direct offsets from this encoding. The sparsely encoded storage variant contains data with duplicate values.

Prior to 23.7, users needed to explicitly enable sparse columns by modifying the setting that controlled the threshold required for sparse column encoding usage - ratio_of_defaults_for_sparse_serialization. This defaults to 1.0, which effectively disables this feature. In 23.7, this value defaults to 0.9375.

While we expect sparse columns to benefit structured data, we expect it to also benefit in scenarios where users insert unstructured data, such as JSON with highly mutable keys. In these cases, the user pays almost no additional overhead for having only a few rows of columns with values ​​- which can lead to significant space savings.

While we expected improvements in our public ClickBench benchmarks, the added benefit of having this optimization enabled by default was a pleasant surprise.

picture

Experimental support for PRQL (contributed by János Benjamin Antal)

At ClickHouse, we firmly believe that SQL is the godfather of all query languages ​​and has the power to handle almost any data problem. Over time, many languages ​​have attempted to compete with or replace SQL, with varying degrees of success. New query languages ​​appear quickly and often disappear just as quickly. SQL's persistence and its adoption by many data storage systems in successive versions prove its importance. However, we also recognize the importance of interacting with users in familiar territory and recognize that some languages ​​are better suited to certain scenarios than others. If we see enough adoption and demand for a query language, we'll consider adding support and always welcome PRs from the community! Thanks to community contributions like this, ClickHouse now experimentally supports PRQL.

PRQL (Pipelined Relational Query Language) is pronounced "Prequel" and is positioned as "a simple, powerful, pipelined replacement for SQL". This pipelined syntax has become popular and has a growing community of contributors. By concatenating transformations to form a pipeline, complex SQL queries can be composed elegantly. At ClickHouse we can see this style of query building having some potential use in certain applications, especially in scenarios where users are engaging in search and discovery exercises - perhaps observability?

In addition, users can not only consult detailed documentation, but also conduct experiments in a public environment(https://prql-lang.org/playground/)< a i=2>. Let's consider a few simple examples using the UK house price data set. Suppose you want to find the highest areas in London.

from uk_house_price

filter town == 'LONDON'
group district (                
  aggregate {                      
  avg_price = average price
  }
)
sort {-avg_price}
take 1..10


SELECT
  district,
  AVG(price) AS avg_price
FROM uk_house_price
WHERE town = 'LONDON'
GROUP BY district
ORDER BY avg_price DESC
LIMIT 10

┌─district───────────────┬──────────avg_price─┐
│ CITY OF LONDON        │  2016389.321229964 │
│ CITY OF WESTMINSTER   │  1107261.809839673 │
│ KENSINGTON AND CHELSEA │ 1105730.3371717487 │
│ CAMDEN                │  752077.7613715645 │
│ RICHMOND UPON THAMES   │  644835.3877018511 │
│ HAMMERSMITH AND FULHAM │  590308.6679440506 │
│ HOUNSLOW              │  574833.3599378078 │
│ ISLINGTON             │   531522.146523729 │
│ HARLOW                │              500000 │
│ WANDSWORTH            │  464798.7692006684 │
└────────────────────────┴────────────────────┘

10 rows in set. Elapsed: 0.079 sec.

As shown, ClickHouse provides us with the compiled SQL equivalent of the PRQL query.

A potentially more challenging query for less experienced SQL users is to find the top row for a specific column in each group. For example, below we've sorted by price to find the most expensive houses in each UK postcode.

from uk_house_price
filter town == 'LONDON'
filter postcode1 != ''
select {  
  postcode1, street, price
}
group postcode1 (          
  sort {-price}
  take 1
)     
sort {-price}
take 1..10

WITH table_0 AS
  (
      SELECT
          postcode1,
          street,
          price
      FROM uk_house_price
      WHERE (town = 'LONDON') AND (postcode1 != '')
      ORDER BY
          postcode1 ASC,
          price DESC
      LIMIT 1 BY postcode1
  )
SELECT
  postcode1,
  street,
  price
FROM table_0
ORDER BY price DESC
LIMIT 10

┌─postcode1─┬─street──────────┬─────price─┐
│ W1U     │ BAKER STREET    │ 594300000 │
│ W1J     │ STANHOPE ROW    │ 569200000 │
│ SE1     │ SUMNER STREET   │ 448500000 │
│ E1      │ BRAHAM STREET   │ 421364142 │
│ EC2V    │ GRESHAM STREET  │ 411500000 │
│ SE10    │ WATERVIEW DRIVE │ 400000000 │
│ EC1Y    │ MALLOW STREET   │ 372600000 │
│ SW1H    │ BROADWAY        │ 370000000 │
│ W1S     │ NEW BOND STREET │ 366180000 │
│ EC4V    │ CARTER LANE     │ 337000000 │
└───────────┴─────────────────┴───────────┘

10 rows in set. Elapsed: 0.498 sec. Processed 25.32 million rows, 574.02 MB (50.83 million rows/s., 1.15 GB/s.)
Peak memory usage: 60.70 MiB.

The simplicity of the above query relative to the SQL equivalent is quite striking.

As a language that compiles to SQL, we're excited to see how PRQL evolves and what use cases it applies to with ClickHouse. If you find PRQL useful and have solved some problems, please let us know!

picture

contact us

Mobile number: 13910395701

Email: [email protected]

Meet all your online column analysisDatabase managementneeds

Guess you like

Origin blog.csdn.net/ClickHouseDB/article/details/132774299