StarRocks 2.1 new version features introduction

Dear StarRocks new and old users:

StarRocks recently released version 2.1. The core updates include support for Apache Iceberg appearance, release of Pipeline execution engine, support for tables with up to 10,000 columns, optimization of the performance of first Scan and Page Cache, support for SQL fingerprinting, etc.

The following is a detailed introduction, you are welcome to upgrade and use, and give more feedback!

Support for Apache Iceberg facades (in beta)

Apache Iceberg is one of the most popular solutions for building data lakes. After supporting Hive external query, StarRocks also supports direct query of data on the Apache Iceberg data lake, allowing users to achieve extremely fast analysis of the data lake without importing data.

On the TPC-H 100G test set, StarRocks achieves 3-5 times the query performance of Trino (PrestoSQL) through optimizations such as CBO optimizer, vectorized execution, and C++ Native execution .

In the future, we will further optimize the performance of querying the Apache Iceberg data lake, and also implement the function of querying other data lake solutions such as Apache Hudi.

Release the Pipeline execution engine (in public beta)

In terms of multi-core scheduling of the execution engine, the original task model adopts thread scheduling, which has two significant problems: First, in high concurrent query scenarios, data dependencies and blocking of IO operations will cause more context switches, resulting in higher scheduling. Cost; Second, the parallelism setting of complex queries is too complicated.

The newly released Pipeline execution engine adopts a more efficient coroutine scheduling mechanism, and the original data dependencies and IO operations are executed asynchronously, which reduces the cost of context switching. In high concurrency scenarios, the performance of some queries is improved by 2 times, and the CPU utilization is also significantly improved; on the SSB, TPCH and TPCDS test sets, the overall performance is also improved. At the same time, the adaptive setting of query parallelism is also implemented, and you no longer need to manually set the parallel_fragment_exec_instance_num variable   parallel_fragment_exec_instance_num  .

Supports tables with up to 10,000 columns

In the traditional data warehouse model construction, some large-width tables are often established to simplify usage and optimize query performance. Especially in scenarios such as user portraits, wide tables with hundreds or thousands of columns are also common, and some very large customers will use tables with thousands of columns. However, after importing thousands of columns of big data (hundreds of millions of rows), especially in the scenario of large strings, the original compaction method will take up a lot of memory, it is easy to trigger OOM, and the compaction speed will also become slower. a lot of.

The new version refactors the Compaction, merges multi-version data in batches by column, and optimizes the support for large strings, reducing memory usage. Finally, in the scenario of importing a large amount of data with up to 10,000 lists, the memory is saved by 10 times, and the compaction performance is also improved by 3 times.

Optimize the performance of the first Scan and Page Cache

When doing big data analysis, some cold data needs to be read from disk (the first scan), and some hot data can be read directly from memory. The query performance in the two scenarios may be several times or even dozens of times different. The core reason is the random IO of the disk. StarRocks reduces random IO by reducing the number of files, lazy loading of indexes, adjusting the file structure, etc., and improves the performance of the first Scan, especially in the HDD environment, the performance improvement is obvious.

For several SQLs (such as Q2.1/Q3.1/Q3.2/Q4.1) in the SSB-100G test set whose first Scan has a great impact on query performance, when there is a first Scan, the query performance is improved to 2 to 3 times the original.

At the same time, StarRocks has also optimized its own Page Cache strategy. In some scenarios, the original data will be stored directly without Bitshuffle encoding. Therefore, additional decoding is not required when reading Page Cache data, which greatly improves query efficiency.

Support for SQL fingerprinting

SQL fingerprint refers to an MD5 value calculated for the same type of SQL, where the same type of SQL refers to "SQL texts with different constants after normalization". By summarizing statistical analysis of SQL fingerprints and related information, we can easily understand what types of SQL exist, their frequency/resource consumption/processing time, etc., so that SQL that consumes a lot of resources and is unreasonable can be prioritized. Perform efficient analytical optimization.

SQL fingerprinting is mainly used for slow query optimization and system resource usage (CPU/memory/disk read/write, etc.) optimization. For example, if you find that the CPU/memory usage of the cluster is not too high, but the disk usage is often full, you can count some SQL fingerprints with the highest proportion of disk reads, and then further analyze the rationality of specific SQL reads to disk, you can find some optimizations point.

Other optimizations

  • The new function in the public beta supports a maximum string length of 1MB: In some complex businesses and scenarios, Schema-less analysis such as JSON is often used, and 1MB of string storage can satisfy most scenarios. Combined with related functions, It can well meet such analysis needs.

  • Support CTAS (CREATE TABLE AS SELECT) syntax: no need to create a table in advance, query data to complete ETL operations and import data into a new table; combined with some scheduling, you can easily achieve lightweight data warehouse modeling.

  • The limitation that a single JSON file does not exceed 100MB in JSON import is removed, and the JSON import performance is optimized: allowing you to easily import large JSON files and easily connect to large-traffic Kafka data.

  • The Primary Key Model supports schema change.

  • The timestamp field that supports the table creation statement is defined as DEFAULT CURRENT_TIMESTAMP.

  • Added functions ANY_VALUE, ARRAY_REMOVE, hash function SHA2.

  • The storage format that supports Hive is CSV, which optimizes the performance of reading Hive data by external means.

  • Supports importing CSV files with multiple delimiters.

  • Optimize Bitmap Index performance.

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324123091&siteId=291194637