Dear community friends, we are pleased to announce that the Apache Doris 2.0-beta version has been officially released on July 3, 2023! In the 2.0-beta version, more than 255 contributors have submitted more than 3500 optimizations and fixes for Apache Doris. Welcome to download and use!

Download link: https://doris.apache.org/download

GitHub source code: https://github.com/apache/doris/tree/branch-2.0

At the annual Doris Summit held earlier this year, we released Apache Doris' 2023 Roadmap and proposed a new vision:

We hope that users can build data analysis services in various scenarios based on Apache Doris, support online and offline business loads, high-throughput interactive analysis, and high-concurrency point queries; realize the unification of lakes and warehouses through a set of architectures, Provide seamless and extremely fast analysis services based on data lakes and various heterogeneous storages; it can also meet more diverse needs through unified management and analysis of semi-structured or even unstructured multi-modal data such as logs/text demand for data analysis.

This is the value we hope Apache Doris can bring to users. Instead of letting users weigh between multiple systems, only one system can solve most of the problems and reduce the development, operation and maintenance and usage costs brought by complex technology stacks. Maximize productivity.

In the face of the real-time analysis of massive data, the realization of this vision undoubtedly needs to overcome many difficulties, especially in dealing with the real demands of actual business scenarios:

How to ensure that the upstream data is written in real time and at high frequency while ensuring the stability of user queries;
How to ensure the continuity of online services while updating upstream data and changing table structures;
How to achieve unified storage and efficient analysis of structured and semi-structured data;
How to deal with different query loads such as point query, report analysis, ad hoc query, ETL/ELT, etc. at the same time and ensure that the loads are isolated from each other?
How to ensure the high efficiency of complex SQL statement execution, the stability of large queries, and the observability of the execution process?
How to integrate and access data lakes and various heterogeneous data sources more conveniently?
How to take into account high-performance queries while greatly reducing the cost of data storage and computing resources?
……

Adhering to the principle of " leaving ease of use to users and complexity to ourselves ", in order to overcome the above series of challenges, from theoretical basis to engineering implementation, from ideal business scenarios to extreme abnormal cases, from passing internal tests to large Large-scale production is available, and we spend more time and energy on function development, verification, continuous iteration and excellence. It is worth celebrating that after nearly half a year of development, testing and stability tuning, Apache Doris finally ushered in the official release of version 2.0-beta! And the successful release of this version has brought our vision one step closer to reality!

Blind test performance improved by more than 10 times!

Brand new query optimizer

High performance is the constant pursuit of Apache Doris. The excellent performance on public test data sets such as Clickbench and TPC-H in the past year has proved that it has achieved industry-leading performance in the execution layer and operator optimization, but there is still a certain distance between Benchmark and real business scenarios:

Benchmark is more about the abstraction, refinement and simplification of real business scenarios, and real scenarios may often face more complex query statements, which cannot be covered by tests;
Benchmark query statements can be enumerated and can be tuned in a targeted manner, while the tuning of real business scenarios is extremely dependent on the mental cost of engineers, and the tuning efficiency is often relatively low and consumes too much manpower of engineers;

Based on this, we started to develop a new query optimizer for modern architecture, and fully enabled it in Apache Doris 2.0-beta version. The new query optimizer adopts a more advanced Cascades framework, uses richer statistical information, and realizes more intelligent adaptive tuning. In most scenarios, it can achieve the ultimate query performance without any tuning and SQL rewriting , and at the same time support more complete complex SQL, and can fully support all 99 SQLs of TPC-DS.

We conducted a blind test on the execution performance of the new query optimizer. Taking TPC-H 22 SQL as an example, the new optimizer took a long time to query without any manual tuning and SQL rewriting , and the blind test performance improved More than 10 times! However, in the real business scenarios of dozens of users of version 2.0, most of the original SQL execution efficiency has been greatly improved, which really solves the pain point of manual tuning!

Reference document: https://doris.apache.org/zh-CN/docs/dev/query-acceleration/nereids

How to enable: SET enable_nereids_planner=trueIn the Apache Doris 2.0-beta version, the new query optimizer has been enabled by default, and statistical information is collected by calling the Analyze command.

Adaptive Pipeline Execution Engine

In the past, the execution engine of Apache Doris was built based on the traditional volcano model. In order to make better use of the concurrency capabilities of multi-machine and multi-core, in the past, we needed to manually set the execution concurrency (for example, manually set this parameter from the default value 1 to 8 or 16 parallel_fragment_exec_instance_num) , there are a series of problems when there are a large number of query tasks:

Large queries and small queries need to set different instance concurrency, and the system cannot make adaptive adjustments;
Instance operators occupy threads to block execution, and a large number of query tasks will cause the execution thread pool to be full, unable to respond to subsequent requests, or even logical deadlock;
The scheduling between Instance threads depends on the system scheduling mechanism, and repeated thread switching will generate additional performance overhead;
When different analysis loads coexist, there may be CPU resource contention between Instance threads, which may cause large and small queries and affect each other between different tenants;

In response to the above problems, Apache Doris 2.0 introduced the Pipeline execution model as the query execution engine. In the Pipeline execution engine, the execution of the query is driven by data to change the control flow. The blocking operator in each query execution process is split into different Pipelines. Whether each Pipeline can obtain execution thread scheduling execution depends on the pre-data is ready, so the following effects are achieved:

The Pipeline execution model disassembles the execution plan into Pipeline Tasks through blocking logic, and schedules the Pipeline Tasks into the thread pool in a time-sharing manner, realizing the asynchronization of blocking operations and solving the problem that Instances occupy a single thread for a long time.
Different scheduling strategies can be used to realize the allocation of CPU resources between large and small queries and different tenants, so as to manage system resources more flexibly.
The Pipeline execution model also adopts data pooling technology to pool the data in a single data bucket, thereby removing the limitation of the number of buckets on the number of Instances, improving the ability of Apache Doris to utilize multi-core systems, and avoiding frequent creation of threads and destruction issues.

Through the Pipeline execution engine, the query performance and stability of Apache Doris in mixed load scenarios are further improved .

Reference document: https://doris.apache.org/zh-CN/docs/dev/query-acceleration/pipeline-execution-engine

How to enable: Set enable_pipeline_engine = trueThis function will be enabled by default in Apache Doris 2.0, and BE will change the SQL execution model to the Pipeline execution mode by default when performing query execution. parallel_pipeline_task_numIt represents the number of concurrent Pipeline Tasks for SQL queries. Apache Doris is configured by default 0. At this time, Apache Doris will automatically perceive the number of CPU cores of each BE and set the concurrency to half of the number of CPU cores. Users can also adjust it according to their actual situation. For users upgrading from older versions, it is recommended that users set this parameter to parallel_fragment_exec_instance_numthe value in the older version.

Query stability is further improved

Multi-service resource isolation

With the rapid expansion of user scale, more and more users use Apache Doris to build a unified analysis platform within the enterprise. On the one hand, Apache Doris is required to undertake larger-scale data processing and analysis. On the other hand, Apache Doris is also required to deal with more challenges of analysis loads. The key lies in how to ensure that different loads can run stably in one system .

Apache Doris version 2.0 adds a Workload Manager based on the Pipeline execution engine, and manages Workloads in groups to ensure fine-grained control of memory and CPU resources.

In previous versions, Apache Doris implemented multi-tenant resource isolation through resource tags, which can avoid mutual interference between different services through node resource division, and Workload Group has implemented a more refined resource management and control method. Group is associated, you can limit the percentage of CPU and memory resources of a single Query on the BE node, and you can configure the memory soft limit of the resource group. When cluster resources are tight, several query tasks that occupy the largest memory in the group will be automatically killed to reduce the pressure on the cluster. When the cluster resources are idle, once the resources used by the Workload Group exceed the preset value, multiple Workloads will share the available idle resources of the cluster and automatically exceed the threshold, and continue to use the system memory to ensure the stable execution of query tasks.

create workload group if not exists etl_group
properties (
  "cpu_share"="10",
  "memory_limit"="30%",
  "max_concurrency" = "10",
  "max_queue_size" = "20",
  "queue_timeout" = "3000"
);

You can Showview the created Workload Group with the command, for example:

job queuing

At the same time, we also introduce the function of query queuing in Workload Group. When creating a Workload Group, you can set the maximum number of queries. Queries exceeding the maximum concurrency will be queued for execution.

max_concurrencyThe maximum number of queries allowed by the current Group, and queries that exceed the maximum concurrency will enter the queuing logic;
max_queue_sizeQuery the length of the queue. When the queue is full, new queries will be rejected;
queue_timeoutThe waiting time of the query in the queue. If the query waiting time exceeds the waiting time, the query will be rejected, and the time unit is milliseconds;

Reference document: https://doris.apache.org/zh-CN/docs/dev/admin-manual/workload-group/

Say goodbye to OOM completely

When the memory is sufficient, memory management is usually insensitive to users, but in real scenarios, there are often various extreme cases, which will bring challenges to memory performance and stability, especially in the face of huge consumption of memory resources. For complex calculations and large-scale jobs, the query failure due to memory OOM may even cause the BE process to go down.

Therefore, we have gradually unified the memory data structure, refactored MemTracker, started to support the query memory soft limit, introduced the GC mechanism after the process memory exceeds the limit, and optimized the high-concurrency query performance. In version 2.0, we introduced a brand-new memory management framework. Through effective memory allocation, statistics, and control, we basically eliminated memory hotspots and OOM-induced BE downtime problems in Benchmark, stress testing, and feedback from real user business scenarios. , even if OOM occurs, the memory location can usually be located and tuned according to the log, so that the cluster can be restored to stability, and the memory limit for query and import is also more flexible, so that users do not need to perceive memory usage when the memory is sufficient.

Through the above series of optimizations, Apache Doris version 2.0 can effectively control memory resources when dealing with complex calculations and large-scale ETL/ELT operations, and the system stability has been improved to a higher level.

Detailed introduction: https://mp.weixin.qq.com/s/Z5N-uZrFE3Qhn5zTyEDomQ

Efficient and stable data writing

Higher real-time data writing efficiency

Import performance has been further improved

Focusing on real-time analysis, we have continuously enhanced real-time analysis capabilities in the past few versions. The end-to-end real-time data writing capability is an important direction for optimization. In Apache Doris 2.0, we have further strengthened this capability. Through optimizations such as Memtable not using Skiplist, parallel downloading, and single-copy import, the import performance has been greatly improved:

Use Stream Load to make three copies of the original data of the TPC-H 144G lineitem table and import it into the Duplicate table with 48 buckets, and the throughput is increased by 100%.
Use Stream Load to make three copies of the original data of the TPC-H 144G lineitem table and import it into the Unique Key table with 48 buckets, increasing the throughput by 200%.
Use insert into select to import the Duplicate table of 48 buckets to the TPC-H 144G lineitem table, and the throughput increases by 50%.
Using insert into select to import TPC-H 144G lineitem table into Unique Key table with 48 buckets, the throughput increased by 150%.

High-frequency data writing is more stable

In the process of high-frequency data writing, the problems of small file merging and write amplification, as well as the subsequent disk I/O and CPU resource overhead are the key to restricting system stability. Therefore, in version 2.0, we introduced Vertical Compaction and Segment Compaction is used to completely solve the problem of Compaction memory and too many segment files in the writing process. The resource consumption is reduced by 90%, the speed is increased by 50%, and the memory usage is only 10% of the original.

Detailed introduction: https://mp.weixin.qq.com/s/BqiMXRJ2sh4jxKdJyEgM4A

Data table structure automatic synchronization

In previous versions, we introduced schema changes at the millisecond level, but in the latest version of Flink-Doris-Connector, we have achieved one-click synchronization of the entire database from relational databases such as MySQL to Apache Doris. In the actual test, a single synchronization task can carry the real-time parallel writing of thousands of tables, which completely bid farewell to the tedious and complicated synchronization process in the past, and the table structure and data synchronization of the upstream business database can be realized through simple commands. At the same time, when the upstream data structure changes, it can also automatically capture the Schema change and dynamically synchronize the DDL to Doris to ensure the seamless operation of the business.

Detailed introduction: https://mp.weixin.qq.com/s/Ur4VpJtjByVL0qQNy_iQBw

The primary key model supports partial column updates

In Apache Doris version 1.2, we introduced the Merg-on-Write mode of the Unique Key model, which can ensure efficient and stable query of downstream business while the upstream data is frequently written and updated, realizing real-time writing and extremely fast Unification of queries. In version 2.0, we have fully enhanced the Unique Key model. In terms of function, it supports the new partial column update capability. When multiple upstream source tables are written at the same time, it does not need to be processed into a wide table in advance. It can directly complete the Join through partial column update at the time of writing, which greatly simplifies the writing process of wide tables. .

In terms of performance, version 2.0 has greatly enhanced the large data volume writing performance and concurrent writing capability of the Unique Key model Merge-on-Write. Compared with version 1.2, the large data volume import has a performance improvement of more than 50%, and the high concurrent import has more than 50% performance improvement. 10 times the performance improvement, and through the efficient concurrent processing mechanism to completely solve the publish timeout (Error -3115) problem, and because of the efficient Compaction mechanism of Doris 2.0, there will be no too many versions (Error-235) problem. This enables Merge-on-Write to replace Merge-on-Read in a wider range of scenarios. At the same time, we also use the partial column update capability to reduce the calculation cost of UPDATE and DELETE statements, and the overall performance is improved by about 50%.

Example of usage of partial column update (Stream Load):

For example, the table structure is as follows

mysql> desc user_profile;
+------------------+-----------------+------+-------+---------+-------+
| Field           | Type           | Null | Key   | Default | Extra |
+------------------+-----------------+------+-------+---------+-------+
| id               | INT             | Yes | true | NULL   |       |
| name             | VARCHAR(10)     | Yes | false | NULL   | NONE |
| age             | INT             | Yes | false | NULL   | NONE |
| city             | VARCHAR(10)     | Yes | false | NULL   | NONE |
| balance         | DECIMALV3(9, 0) | Yes | false | NULL   | NONE |
| last_access_time | DATETIME       | Yes | false | NULL   | NONE |
+------------------+-----------------+------+-------+---------+-------+

If the user wants to batch update the user's balance and access time that have changed in the last 10s, the data can be organized in the following csv file

1,500,2023-07-03 12:00:01
3,23,2023-07-03 12:00:02
18,9999999,2023-07-03 12:00:03

Then through Stream Load, add Header partial_columns:true, and specify the column name to be imported to complete the update

curl --location-trusted -u root: -H "partial_columns:true" -H "column_separator:," -H 
"columns:id,balance,last_access_time" -T /tmp/test.csv http://127.0.0.1:48037/api/db1/user_profile/_stream_load

Extreme flexibility and separation of storage and calculation

In the past, Apache Doris helped users greatly save the cost of computing and storage resources with many designs in terms of ease of use, and we have taken a solid step forward in the future-oriented cloud-native architecture.

Starting from the trend of cost reduction and efficiency increase, user requirements for computing and storage resources can be summarized as follows:

Elasticity of computing resources: In the face of business computing peaks, resources can be quickly expanded to improve efficiency, and in computing lows, they can be quickly scaled down to reduce costs;
Lower storage costs: In the face of massive data, cheaper storage media can be introduced to reduce costs, while storage and computing are set separately without interfering with each other;
Business load isolation: Different business loads can use independent computing resources to avoid mutual resource preemption;
Unified data management and control: Unified catalog and unified data management allow for more convenient data analysis.

The storage-computing integrated architecture has the advantages of being simple and easy to maintain in scenarios with low elastic requirements, but has certain limitations in scenarios with strong elastic requirements. The essence of the storage-computing separation architecture is a technical means to solve resource elasticity. It has more obvious advantages in resource elasticity, but it has higher stability requirements for storage, and the stability of storage will further affect the stability of OLAP. Therefore, a series of mechanisms such as Cache management, computing resource management, and garbage data collection have been introduced.

In the communication with users in the Apache Doris community, we found that users' needs for storage and calculation separation can be divided into the following three categories:

At present, the simple and easy-to-use integrated storage and computing architecture is selected, and there is no demand for resource flexibility for the time being;
Lack of stable large-scale storage requires elasticity, load isolation and low cost on the basis of Apache Doris;
With stable large-scale storage, an extremely flexible architecture is required to solve the problem of rapid resource scaling, so a more thorough storage-computing separation architecture is also required;

In order to meet the needs of the first two types of users, Apache Doris 2.0 provides a storage-computing separation solution that is compatible with upgrades:

The first type is computing nodes. In version 2.0, we introduced the stateless computing node Compute Node, which is specially used for data lake analysis. Compared with the original hybrid node that integrates storage and computing, Compute Node does not save any data, and does not need to perform load balancing of data fragmentation when the cluster expands or shrinks. Therefore, in scenarios with obvious peaks such as data lake analysis, it can be flexibly expanded and quickly Join the cluster to share the computing pressure. At the same time, since user data is often stored in remote storage such as HDFS/S3, the query task will be preferentially dispatched to the Compute Node for execution during query execution to avoid preemption of computing resources between internal and external queries.

Reference document: https://doris.apache.org/zh-CN/docs/dev/advanced/compute_node

The second type is hot and cold layering. In terms of storage, hot and cold data often face different query frequency and response speed requirements, so cold data can usually be stored in lower-cost storage media. In past versions, Apache Doris supports lifecycle management of table partitions, and automatically cools hot data from SSD to HDD through background tasks, but the data on HDD is stored in multiple copies, which does not maximize the cost Savings, so there is still a lot of room for optimization for cold data storage costs. The hot and cold data tiering function was introduced in Apache Doris 2.0 . The hot and cold data tiering function enables Apache Doris to sink cold data to object storage with lower storage costs. At the same time, the storage method of cold data on object storage It also changed from multi-copy to single-copy, and the storage cost was further reduced to one-third of the original one. At the same time, the additional computing resource cost and network overhead cost due to storage were also reduced. According to actual calculations, storage costs can be reduced by up to 70%!

Reference document: https://doris.apache.org/zh-CN/docs/dev/advanced/cold_hot_separation

Subsequent computing nodes will support querying cold data and data of storage nodes, so as to realize a storage-computing separation solution that is compatible with upgrades.

In order to meet the needs of the third type of users, we will also contribute the SelectDB Cloud storage-computing separation solution back to the community. This solution has experienced the test of the production environment of hundreds of companies in terms of performance, function maturity, and system stability. We will also synchronize the actual progress of subsequent function integration in time.

A log analysis solution that is more than 10 times more cost-effective

From typical OLAP scenarios such as real-time reports and Ad-hoc in the past to more business scenarios such as ELT/ETL, log retrieval and analysis, Apache Doris is constantly expanding the boundaries of application scenarios, and the unified storage and analysis of log data is exactly what we are doing An important breakthrough in version 2.0.

In the past, the typical log storage and analysis architecture in the industry was difficult to balance high-throughput real-time writing, low-cost large-scale storage, and high-performance text retrieval and analysis at the same time, and could only make trade-offs in one or several aspects. In version 2.0 of Apache Doris, we introduced a new inverted index to meet the full-text search of string type and the equivalent and range search of common numeric/date types. At the same time, we further optimized the query performance of the inverted index to make it It is more in line with the scenario requirements of log data analysis, and combined with the past advantages in large-scale data writing and low-cost storage, it realizes a more cost-effective log analysis solution.

In the test performance of the same hardware configuration and data set, Apache Doris achieved a 4-fold increase in log data writing speed, an 80% reduction in storage space, and a 2-fold increase in query performance compared to ElasticSearch, combined with the hot and cold introduced by Apache Doris 2.0 The data layering feature improves the overall cost performance by more than 10 times.

In addition to the optimization of log analysis scenarios, in terms of complex data types, we have added a new data type Map/Struct, including support for efficient writing, storage, and analysis functions of the above types, and mutual nesting between types to better Meet the support of multimodal data analysis.

Detailed introduction: https://mp.weixin.qq.com/s/WJXKyudW8CJPqlUiAro_KQ

More comprehensive and higher performance data lake analysis capabilities

In version 1.2 of Apache Doris, we released the Multi-Catalog function, which supports automatic metadata mapping and synchronization of multiple heterogeneous data sources, and realizes the seamless connection of data lakes. Relying on many optimizations in data reading, execution engine, and query optimizer, in the standard test set scenario, the query performance of Apache Doris on the lake data is 3-5 times higher than that of Presto/Trino.

In version 2.0, we have further enhanced the data lake analysis capabilities, not only supporting more data sources, but also making many optimizations for the actual production environment of users. Compared with version 1.2, it can be used under real workload conditions Significantly improved performance.

More data source support

Supports Snapshot Query for Hudi Copy-on-Write tables and Read Optimized Query for Merge-on-Read tables, and will support Incremental Query and Time Trival later. Reference document: https://doris.apache.org/zh-CN/docs/dev/lakehouse/multi-catalog/hudi
JDBC Catalog has newly added support for Oceanbase, and currently supports nearly ten relational databases including MySQL, PostgreSQL, Oracle, SQLServer, Doris, Clickhouse, SAP HANA, Trino/Presto, and Oceanbase. Reference document: https://doris.apache.org/zh-CN/docs/dev/lakehouse/multi-catalog/jdbc

Data permission control

It supports the authentication of Hive Catalog through Apache Range, which can seamlessly connect with the user's existing permission system. At the same time, it also supports extensible authentication plug-ins to implement custom authentication methods for any Catalog. Reference document: https://doris.apache.org/zh-CN/docs/dev/lakehouse/multi-catalog/hive

The performance is further optimized, and the maximum improvement is dozens of times

Optimized the reading performance of a large number of small file scenarios and wide table scenarios. Through technologies such as full loading of small files, merging of small IOs, and data pre-reading, the reading overhead of remote storage is significantly reduced. In such scenarios, the query performance can be improved by up to dozens of times.
Optimized the reading performance of ORC/Parquet files, and doubled the query performance compared to version 1.2.

Supports local file caching of data on the lake. Local disks can be used to cache data on remote storage systems such as HDFS or object storage, and queries that access the same data can be accelerated through caching. In the case of hitting the local file cache, the performance of querying data on the lake through Apache Doris can be equal to that of internal tables of Apache Doris. This function can greatly improve the query performance of hot data on the lake. Reference document: https://doris.apache.org/zh-CN/docs/dev/lakehouse/filecache
Supports statistics collection for appearances. Like the Apache Doris internal table, users can analyze and collect statistical information of the specified external table through the Analyze statement. Combined with the new Nereids query optimizer, it can more accurately and intelligently optimize the query plan for complex SQL. Taking the TPC-H standard test data set as an example, the optimal query plan and better performance can be obtained without manually rewriting SQL. Reference document: https://doris.apache.org/zh-CN/docs/dev/lakehouse/multi-catalog/
Optimized the data writeback performance of JDBC Catalog. Through PrepareStmt and batch mode, the performance of users writing data back to MySQL, Oracle and other relational databases through INSERT INTO command and JDBC Catalog is improved by dozens of times.

High concurrent data service support

Different from complex SQL and large-scale ETL operations, in Data Serving scenarios such as bank transaction slip number query, insurance agent policy query, e-commerce historical order query, express waybill number query, etc., there will be a large number of front-line business personnel and C-end Users need to retrieve the entire row of data based on the primary key ID. In the past, such needs often required the introduction of KV systems such as Apache HBase to deal with point queries, and Redis as a cache layer to share the system pressure caused by high concurrency.

For Apache Doris built on a columnar storage engine, such point queries will amplify random read IO on tables with hundreds of columns wide, and the execution engine will also bring great benefits to the parsing and distribution of such simple SQL. The necessary additional overhead often requires a more efficient and concise execution method. Therefore, in the new version, we have introduced a new row-column hybrid storage and row-level cache, which makes it more efficient to read the entire row of data at a time and greatly reduces the number of disk accesses. At the same time, it introduces short-path optimization for point queries and skips the execution engine. And directly use the fast and efficient read path to retrieve the required data, and introduce the reuse of prepared statements to execute SQL parsing to reduce FE overhead.

Through the above series of optimizations, the Apache Doris 2.0 version has achieved an order of magnitude improvement in concurrency ! In the standard YCSB benchmark test, a single cloud server with 16 Core 64G memory and 4*1T hard disk specifications achieved a single-node 30,000 QPS concurrent performance, which is more than 20 times higher than the previous version point query concurrency! Based on the above capabilities, Apache Doris can better meet the needs of high-concurrency data service scenarios, replace HBase's capabilities in such scenarios, and reduce maintenance costs and redundant storage of data caused by complex technology stacks.

Reference document: https://doris.apache.org/zh-CN/docs/dev/query-acceleration/hight-concurrent-point-query

Detailed introduction: https://mp.weixin.qq.com/s/Ow77-kFMWXFxugFXjOPHhg

CCR cross-cluster data synchronization

In order to meet the user's needs for data synchronization between multiple clusters, in the past, it was necessary to periodically back up and restore data through the Backup/Restore command. The operation was more complicated, the data synchronization delay was high, and intermediate storage was required. In order to meet the user's needs for automatic synchronization of database tables in multiple clusters, in version 2.0-beta we have added the function of CCR cross-cluster data synchronization, which can synchronize the data changes of the source cluster to the target cluster at the database/table level to improve online services Data availability and better implementation of read-write load separation and multi-room backup.

Support Kubernetes containerized deployment

In the past, Apache Doris was based on IP communication. When deploying in the K8s environment, Pod IP drift due to host failure will cause the cluster to be unavailable. In version 2.0, we support FQDN, so that Apache Doris can be implemented without manual intervention. Node self-healing, so it can better cope with K8s environment deployment and flexible expansion and contraction.

Reference document: https://doris.apache.org/zh-CN/docs/dev/install/k8s-deploy/

Other upgrade considerations

1.2-lts can be rolled to 2.0-beta, 2.0-alpha can be upgraded to 2.0-beta with downtime;
The query optimizer switch is enabled by default enable_nereids_planner=true;
The non-vectorized code has been removed from the system, so the parameter enable_vectorized_engine will no longer be valid;
New parameter enable_single_replica_compaction;
By default, datev2, datetimev2, and decimalv3 are used to create tables, and datev1, datetimev1, and decimalv2 are not supported to create tables;
Decimalv3 is used by default in JDBC and Iceberg Catalog;
date type added AGG_STATE;
Remove the cluster column from the backend table;
For better compatibility with BI tools, datev2 and datetimev2 are displayed as date and datetime when show create table is displayed.
Added max_openfiles and swap checks in the BE startup script, so if the system configuration is unreasonable, be may fail to start;
It is forbidden to log in without a password when accessing FE on localhost;
When there is a Multi-Catalog in the system, the data of the query information schema only displays the data of the internal catalog by default;
Limit the depth of the expression tree, the default is 200;
array string return value single quotes become double quotes;
Rename the process name of Doris to DorisFE and DorisBE;

Embark on a 2.0 Journey

It has been a month and a half since the release of Apache Doris 2.0-alpha version. During this period of time, while accelerating the development of core functions and features, we have also gained personal experience and real feedback from hundreds of companies on the new version. These come from real businesses The feedback of the scene is of great help to the polishing and further improvement of the function. Therefore, the 2.0-beta version already has a better user experience in terms of functional integrity and system stability. All users who have needs for the new features of version 2.0 are welcome to deploy and upgrade.

If you have any questions during the process of researching, testing, deploying and upgrading version 2.0, please submit the questionnaire information , and core community contributors will provide 1-1 special support at that time. We also expect version 2.0 to provide more users in the community with a real-time unified analysis experience. We believe that version 2.0 of Apache Doris will become your ideal choice in real-time analysis scenarios.

Apache Doris 2.0-beta release: 10 times better blind test performance, more unified multi-scenario analysis experience