Join the Innovators, Apache Doris in 2023

At the just past Doris Summit Asia 2023, Yi Guolei, a member of Apache Doris PMC and vice president of technology of Feilun Technology, gave a keynote speech on "Walking with Innovators" and reviewed the technological breakthroughs made by Apache Doris in the past year. and community development, rethinking the challenges and opportunities in real-time analysis of massive data, and comprehensively introducing the future iteration plan and evolution direction of Apache Doris.

The following is an excerpt of Yi Guolei's speech at the summit, narrated in the first person by Yi Guolei and edited.

It was great to see so many friends gathered together and the place was almost full. I also believe that there are also many friends watching the summit live broadcast online, sharing this moment with us across the distance of space.

This year’s theme is “ Walking with Innovators ”. As usual, the theme needs to be explained at the beginning. However, I would like to put this link at the end. I hope everyone can follow my pace. I believe that after today’s sharing, everyone will have a deeper understanding of this topic.

Apache Doris in 2023

01 From “comprehensive evolution” to “milestone leap”

Looking back on the past development history of Apache Doris, if we use one keyword to describe the various gains made by Apache Doris in the past 2022, we believe it is "comprehensive evolution after thick accumulation and thin development" - released in 2022 In this version, we have fully enabled the vectorized execution engine, implemented the Merge-on-Write data update mode on the primary key model, introduced the unified data lake docking framework Multi-Catalog, and millisecond-level Schema Change and other major features. In terms of performance, Stability, ease of use, etc. have achieved comprehensive evolution.

As for the coming year 2023, we ushered in the landmark version 2.0. From the contributors and Commits data, we can see that version 2.0 has incorporated more than 4,100 PRs, an increase of 70% compared to version 1.2, and an increase of nearly 10 times compared to version 1.1 released in the same period last year. The contributions of those involved in the development of this version With the number of readers reaching 275, the release of this version also marks our achievement of a "milestone leap":

  • Introducing an adaptive parallel execution model and a new query optimizer, blind test performance is improved by 10 times, multi-table association is improved by 13 times, single table scenarios are improved by 10 times, and high concurrency point queries are improved by 20 times;
  • Expand from typical OLAP scenarios such as reports and Ad-hoc to lake-warehouse integration, high-concurrency data services, and log retrieval and analysis to support more unified and diverse analysis scenarios;
  • Supports high-throughput writing of real-time data, second-level latency, complete support for all types of data updates, and builds a more efficient, easy-to-use and stable real-time data processing and analysis link;

WechatIMG767.jpg

02 One of the most active open source big data projects in the world!

In terms of community ecology, the Apache Doris community is also becoming increasingly prosperous, as can be seen from the developer scale and developer activity indicators:

  • Apache Doris has received more than 9,800 stars on GitHub, an increase of nearly 70% compared with the same period last year, and maintains a sustained growth trend;
  • The total number of contributors has grown to nearly 580, with many new faces starting to contribute to the community every week;
  • The average monthly active contributors have stabilized at around 120, which has significantly exceeded the world's leading open source big data projects, including Spark, Elasticsearch, Trino, Druid, etc.;
  • These contributors contribute more than 160 PRs to Apache Doris every week. At the same time, the community has also established a more mature and stable CR pipeline. Each incorporated code will go through 3000+ test cases, which also allows the community to develop at an extremely fast speed. While iterating, stability is also ensured;

This series of numbers all shows that Apache Doris has now become one of the most active open source big data projects in the world .

2.png

3.png

In addition, we also see that the sources of contributors are more diversified, widely distributed among domestic database unicorns and many first-line Internet companies. We also see Alibaba Cloud, Tencent Cloud, Huawei Cloud, Baidu Smart Cloud, and Tianyi Cloud. And top cloud vendors such as Volcano Engine have also invested in community building and provided cloud data warehouse hosting services based on Apache Doris, giving open source users more choices.

4.png

03 The de facto standard in the open source real-time data warehouse field!

While technology is accelerating iteration, we have also seen that more and more users are beginning to choose to trust Apache Doris. The community has gathered more than 30,000 engineers in database and big data related fields to enjoy the ultimate analysis experience brought by Apache Doris. .

In the past, many community users had the impression that Apache Doris was mostly used by Internet companies, such as Baidu, Meituan, Xiaomi, JD.com, Tencent and other first-line Internet companies. Nowadays, the industries covered have become more and more extensive. Whether it is finance, government and enterprises, telecommunications, manufacturing, transportation, logistics, fast moving consumer goods industries, etc., many companies are applying Apache Doris in their core analysis business.

I am happy to announce that as of now, the number of Apache Doris users worldwide has exceeded 4,000 ! The vast majority of these more than 4,000 enterprise users have direct contact with us, whether they are feedback requirements, participating in testing, submitting code or sharing practical experience, they are all giving back and contributing to the community in their own way. Many of them also participated in today's summit sharing, and we also hope that their experiences in real business scenarios can inspire more people.

With such a large user scale, Apache Doris has become the first choice for users from all walks of life to use real-time data warehouses, and has become the de facto standard in the field of open source real-time data warehouses!

How we meet the challenges of real-time analytics

Since its inception, Apache Doris has been committed to solving the problem of real-time analysis of massive data. From the development history of past versions, we can also clearly feel that in order to better respond to users' challenges in real business scenarios, Apache Doris continues to evolve towards the three major trends of real-time analysis , integration and unity, and cloud native. , this is also the development direction we will focus on in 2023;

  • Real-time analysis: Achieve ultimate query performance on large-scale real-time data, including high-throughput real-time writing and real-time updating of data, as well as lower query analysis latency;
  • Integration and unification: Provide support for multiple analysis loads in one system, simplifying the operation and maintenance costs caused by complex architectures. In addition to continuing to strengthen the report analysis and ad hoc queries that Apache Doris has always been good at in the past, Hucang Federation Analysis, Analysis scenarios such as log retrieval analysis, ETL/ELT query acceleration, and high-concurrency Data Serving are also important breakthrough directions;
  • Cloud-native: Innovate cloud computing infrastructure, use the extreme elasticity of the cloud to reduce storage and computing costs, and support migration to K8s containers and other environments for deployment and operation.

6.png

01 Ultimate query performance

We also mentioned at the beginning that in Apache Doris 2.0 version, we achieved more than 10 times improvement in blind test performance. The most important part of this is the CBO query optimizer and adaptive Pipeline parallel execution model.

CBO Query Optimizer : In the past, Apache Doris mostly served online reporting businesses. In these scenarios, data was often flattened and stored into wide tables for analysis. Even if there were multi-table associations, it was often relatively simple. Therefore, the key to performance lies in scanning and Aggregation efficiency. When more and more users perform complex calculations or ELT/ETL batch data processing based on Apache Doris, it is difficult for the space-for-time method of large wide tables or pre-aggregated tables to work, requiring manual intervention for tuning and rewriting of SQL. , query performance encounters challenges. To this end, we spent a lot of time reconstructing the query optimizer and officially released it in Apache Doris 2.0. When faced with complex SQL of tens of thousands of rows or associated calculations of dozens of tables, the CBO optimizer can generate a more efficient Query Plan and improve query performance, reducing the labor consumption and mental cost caused by manual tuning.

Pipeline parallel execution model : In past versions, BE execution concurrency required manual adjustment when initiating Query, which also required manual intervention, and large and small queries would encounter resource preemption problems when executed in the same cluster. To this end, we introduced the Pipeline execution model as the query execution engine. The system automatically adjusts the execution parallelism and ensures stable execution of large and small queries. It improves Apache Doris's CPU utilization efficiency, so query performance and stability are improved in mixed load scenarios. All have been comprehensively improved.

7.png

At the same time, in Apache Doris version 2.0.0, we have introduced a new row-column mixed storage and row-level Cache, which makes reading the entire row of data at a time more efficient, greatly reducing the number of disk accesses, and introducing point query short paths. Optimize, skip the execution engine and directly use a fast and efficient read path to retrieve the required data, and introduce prepared statement reuse to execute SQL parsing to reduce FE overhead, achieving an order of magnitude improvement in concurrency capabilities.

For high-concurrency Data Serving scenarios , it has achieved a single-node concurrency performance of 30,000 QPS, which is more than 20 times higher than previous version point query concurrency capabilities.

8.png

In multi-dimensional retrieval scenarios , we also introduced inverted indexes to improve performance, and achieved significant improvements in query performance and concurrency capabilities in scenarios such as keyword fuzzy queries, equivalent queries, and range queries.

9.png

02 Real-time writing and updating

Import performance optimization : Focusing on real-time analysis, we have continuously enhanced real-time analysis capabilities in the past few versions. Among them, end-to-end real-time data writing capability is an important direction of optimization. In Apache Doris 2.0 version, we have further strengthened this ability. Through optimizations such as Memtable parallel flushing and single-copy import, real-time import performance is improved by 2-8 times.

10.png

Merge-on-Write : The Merge-on-Write data update mode of the Unique Key primary key model was originally introduced in Apache Doris version 1.2. In Apache Doris version 2.0, this capability has been further optimized and the functional stability has been greatly improved. Through the optimization of write performance, the peak write throughput of Upsert operations of 400,000 rows per second on a single node was achieved. At the same time, associated updates of data and partial column updates were introduced to achieve complete support for various update operations.

WechatIMG765.jpg

12.png

Learn more: 10x query performance improvement, design and implementation of new Unique Key | Interpretation of new features

03 More analysis scenarios

Integrated lake and warehouse : In version 1.2 of Apache Doris, we introduced the Multi-Catalog function, which supports automatic mapping and synchronization of metadata from multiple heterogeneous data sources, realizing convenient metadata and data connection. In version 2.0.0, we have further strengthened the data federation analysis capabilities, introduced more data sources, and made many performance optimizations for users' actual production environments. Query performance has been greatly improved under real workload conditions. This framework also helps us better perform cross-source data synchronization. Data can be quickly written into Doris by simply inserting into select.

Learn more: Query performance is 3-10 times improved compared to Trino/Presto! In-depth interpretation of Apache Doris extremely fast data lake analysis

Semi-structured data analysis and log retrieval analysis : In Apache Doris version 2.0.0, we provide native semi-structured data support, adding complex type Map to the existing JSON and Array, and based on Light Schema Change Function implements Schema Evolution. At the same time, the newly introduced inverted index and high-performance text analysis algorithm in version 2.0.0 comprehensively enhance the capabilities of Apache Doris in log retrieval and analysis scenarios, and can support more efficient arbitrary-dimensional analysis and full-text retrieval. Combining past advantages in large-scale data writing and low-cost storage, the new generation log retrieval and analysis platform built on Apache Doris has achieved more than a 10-fold improvement in cost performance compared to common log analysis solutions in the industry.

Learn more: How to build a new generation of log analysis platform based on Apache Doris|Solution

More refined multi-tenant and resource management solution : When a single cluster copes with multiple analysis loads, the ensuing problem is how to ensure mutual resource preemption. For this reason, in version 2.0 we introduced a resource isolation solution. Group management of Workloads ensures flexible allocation and control of memory and CPU resources. In addition, we have introduced the query queuing function. When creating a Workload Group, you can set the maximum number of queries. Queries that exceed the maximum concurrency will be queued. Wait for execution to relieve pressure on the system under high load.

13.png

04 Low cost and high availability

Reduce storage costs : In terms of storage, hot and cold data often face different frequency of queries and response speed requirements, so cold data can usually be stored in lower-cost storage media. Therefore, the hot and cold data tiering function was launched in version 2.0. The hot and cold data tiering function allows Apache Doris to sink cold data into object storage with lower storage costs. At the same time, the way cold data is stored on object storage has also changed. From multiple copies to a single copy, storage costs are further reduced to one-third of the original, while also reducing the cost of additional computing resources and network overhead due to storage. Through actual calculations, storage costs can be reduced by more than 70%.

14.png

Learn more: How Apache Doris hot and cold tiering technology can reduce storage costs by 70%? |New version features

Support deployment in public cloud/private cloud/K8s : Facing the deployment requirements of more users in public cloud, private cloud, K8s and other environments, we have developed K8s Operator, which can implement all nodes such as FE, BE, Compute Node, Broker, etc. It also supports a series of operation and maintenance tasks such as deployment, expansion and contraction, and health check. In addition, it also supports Auto Scaling of Compute Node nodes to automatically expand the capacity according to the load of the own machine. This feature has been trialled on a large scale among community users and will be officially released in subsequent versions.

15.png

Achieve cross-cluster replication : In Apache Doris 2.0.0 version, we also introduced the CCR function to synchronize data changes from the source cluster to the target cluster at the library/table level, which can better realize read-write load separation and multi-machine room backup, and It can better support cross-cluster replication and disaster recovery requirements in different scenarios.

The next step toward real-time analytics

After reviewing the progress in 2023 and building on the past, it’s time to talk about what is being done and what will be done in the future.

Positioned as a real-time data warehouse, the Apache Doris community will continue to adhere to the three major directions of real-time analysis, integration and unification, and cloud nativeization. There are many meaningful works being carried out in each direction.

01 Faster analysis performance and more real-time data writing and updating

In terms of query engines , in the upcoming version 2.1, the CBO query optimizer will implement fully automatic statistical information collection and provide rich Hint syntax. It can support manual adjustment of rules when the optimizer rules fail. We will also release Performance test report of TPC-DS. Query operator placement and multi-table materialized views are functions that have been requested by community users for a long time and will also be added in version 2.1. At the same time, we will also introduce Union All operator parallel execution to further accelerate the execution performance of ETL operations. Subsequent users will Apache Doris will perform high-volume data processing faster, more stably, and easier. We will also introduce a new Join algorithm to further double the performance of multi-table Join.

WechatIMG764.jpg

In terms of real-time data writing , we will unify the semantics of all data writing. Whether it is a relational database, data stream, local file or data lake data file, for Apache Doris, it will be unified and embodied as a relational table. Data writing is achieved through the unified semantics of insert into. At the same time, we will also simplify the data writing link and perform data writing through built-in Job scheduling to avoid introducing third-party data synchronization components. We will introduce a server-side batch accumulation mechanism to avoid small file merging problems and reduce the writing pressure of the database through server-side batch accumulation when upstream data is written frequently.

In terms of real-time data update , the Merge-on-Write mode will be enabled by default to achieve flexible updates of any column. In the future, all data models will be unified based on Merge-on-Write to reduce users' choices in various data models. .

In terms of observability , we will provide users with a new Profile to facilitate users to locate the execution status of operators. It also supports dynamic display of the progress of query tasks and can be integrated into Doris Manager for visual display. This part of the function has been developed. , will be launched soon in version 2.1.

17.png

02 Unification of more query analysis scenarios

18.png

In the integrated lake and warehouse scenario , we will fully combine the multi-table materialized view with the built-in job scheduling capabilities to extend the materialized view to multiple data sources in the data lake. Without any other components, we can achieve data processing by relying on our own scheduling capabilities. ETL operations from the lake to the data warehouse and hierarchical modeling of the data warehouse. In version 2.0, we have implemented write-back operations for JDBC data sources, and subsequent data writing will be expanded to Iceberg, Hudi, Paimon, etc., to achieve a more complete closed-loop data query and analysis.

19.png

In addition to reading data from more data sources, Apache Doris is also opening up data channels for external access. Currently, Doris uses the MySQL connection protocol for external data output. When dealing with large-scale data reading or data science scenarios (such as data science engines such as Pandas), the throughput of the MySQL protocol becomes a system bottleneck, so in subsequent versions We introduced a high-speed data reading interface based on Arrow Flight to directly transmit data through BE. During the actual test, the data throughput performance improved by more than 100 times compared with the past .

20.png

In semi-structured data analysis and log analysis scenarios , we will add inverted index support for more complex types, including Array, Map, GEO and other complex types. At the same time, in response to the demand for storage field Schema Less in log scenarios, we will introduce the Variant data type in version 2.1, which can support any type and shape of JSON format document data, and can support automatic and dynamic processing of column additions or type changes, completely Requires cumbersome DDL operations and Schema Change operations.

21.png

In terms of load management , we will continue to explore flexible mixed load management, support the creation and management of Workload Groups and adjust resource configuration through SQL, ensuring load isolation while maximizing resource utilization.

03 Cloud native and separation of storage and computing

In a previous article, we introduced that the storage and computing separation version of SelectDB Cloud will be integrated into the community, but the workload of code structure organization, compatibility modification, and integration exceeded our expectations. Fortunately, this work has come to an end. All code structure adjustments will be completed in version 2.1 of Apache Doris, and it is expected to be fully available to the community in version 2.2. By then, everyone can experience the ultimate flexibility brought by the new cloud native architecture, so stay tuned.

22.png

Walk with innovators

At the end of my speech, I would like to introduce the behind-the-scenes story of the preparations for this summit. We have been thinking about what kind of concept we should convey to all community users, but we have never found a particularly precise expression.

After reviewing the development process of Apache Doris in the past ten years from its birth to the present, we thought, isn't this a story about technological innovation?

In the era of SQL on Hadoop, Doris chooses to be independent of the Hadoop ecosystem, does not rely on HDFS for data storage, and does not rely on Zookeeper for distributed management and control. Any process can achieve online expansion and contraction and ensure high availability; in the face of different syntaxes, Among the big data components, Doris chooses to support standard SQL and is compatible with the MySQL protocol, which greatly simplifies the user threshold; based on the self-developed pre-aggregated storage engine, materialized view and MPP execution framework, it fully utilizes the parallel computing capabilities of multiple machines and multiple cores. Achieving extremely fast query performance on large-scale data... It is precisely because of insisting on technological innovation that the vitality of Apache Doris becomes more and more vigorous .

So far, we have introduced many functional innovations in Apache Doris, such as inverted index, mixed row and column storage, millisecond-level online Schema Change, Merge-on-Write merge on write, Variant data type... Every step is in Continue to lead technological innovation .

So “ walking with innovators ” here has several meanings:

  • We hope to work with open source contributors who love open source technology to bring some changes to the data world with technological innovation;
  • We hope to bring together user representatives who recognize and trust Apache Doris to inspire more people with application innovation in real scenarios;
  • We also hope to work with upstream and downstream partners and cloud service vendors to inject new vitality into the industry with product innovation and bring new choices to all users.

Choosing Apache Doris means choosing to work with many innovators .

Finally, we pay tribute to every innovator who is chasing the wind and the moon, and we look forward to working with more innovators to explore more possibilities in the data world.

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

Acho que você gosta

Origin my.oschina.net/u/5735652/blog/10141735
Recomendado
Clasificación