[Big Data] Detailed explanation of Doris’s implementation plan for building a real-time data warehouse (2): Interpretation of Doris’ core functions

This series includes:


Detailed explanation of Doris’s solution for building a real-time data warehouse (2): Interpretation of Doris’ core functions

1.Doris development history

Apache Doris isa database project developed and open sourced by Baidu . Doris began to be an internal project at Baidu in 2008. After going through five major version iterations, it was open sourced in 2017. In 2018, it entered the Apache Foundation incubation project. Doris was officially released on April 18, 20221.0, and officially graduated on June 16, 2022, becoming a top-level project of the Apache Software Foundation.

Doris database software is mainly built with two components, BE and FE . BE is the backend data access component, written in C++ language; FE is the front-end query entry and query parsing component, written in Java language.

2.Doris three models

The biggest feature of Doris is that it provides three major data models:

  • Duplicate KeyModels are also called repeatable models and detailed models . They are used in the same way as ordinary database tables. Each inserted data is retained and indexes are supported.

  • Aggregate KeyThe model is also called an aggregation model and a summary model . All fields in the table are divided into dimension columns and indicator columns, and indicator data is summarized according to dimensions, greatly reducing the amount of data.

  • Unique KeyThe model is also called a deduplication model and a unique model . It retains the latest records according to the primary key and is used to delete and modify data.

In addition, Doris also supports various external tables, including ODBC external tables, Hive external tables, ES external tables and Iceberg external tables, which are used to directly use the Doris query engine to query relational databases, Hive data warehouses, ES text retrieval and Iceberg data respectively. Lake data has greatly broadened the application boundaries of Doris database.

3.Doris data import

Although Doris has rich support for external tables, the query performance of big data is lower than that of internal tables due to network bottlenecks and the inability of external tables to support indexes. Here we will use Doris' data import capabilities. Doris' data import is atomic , which means that a batch of data either all imports successfully or all fails; it also supports fault-tolerance parameters, and data with exceptions below a certain proportion are considered successful.

Doris data import and data migration tools include Insert Into, Stream Load, Broker Load, Routine Load, Binlog Load, Spark Loadand DataX import.

Insert image description here
Priority is given to in-database data processing Insert Into, offline data import Stream Loadand DataX import are preferred, streaming data access can be selected Routine Loadand Binlog Load, and Hive data import can be selected Broker Loadand Spark Load. It can be seen that Doris supports a wide range of data sources and is very friendly to various big data ecosystem products.

Of course, we can also directly Insert Intomigrate external data with a small amount of data through external tables.

4.Doris multi-table association

Then there is Doris's multi-table correlation function. Doris supports four distribution strategies Shuffle Join, , Bucket Shuffle Join, Broadcast Joinand , which can minimize data redistribution under the MPP architecture and improve data query efficiency.Colocate Joinjoin

  • Shuffle Join redistributes all data in the two related tables.
  • Bucket Shuffle Join only needs to redistribute data from one of the two related tables.
  • Broadcast Join broadcasts the full data of one of the smaller tables in the association table.
  • Colocate Join completes data association directly locally without any data redistribution. This is an ideal state for large table data association.

Insert image description here
Each of the four data distribution strategies has different application scenarios. We need to optimize according to different data association needs to reduce the amount of redistributed data, which can reduce network consumption and improve query speed.

5.Doris core design

The core design of Doris refers to Google Mesa , Apache Impala , and OrcFile storage formats.

Insert image description here
Here I would like to focus on Doris's data storage . Doris's storage design combines the advantages of traditional MPP databases and Hadoop distributed data, and introduces a bucketconcept called . We all know that Hadoop divides the data of a table into multiple blocks according to the file size, and three copies of each block are randomly distributed to the three servers of the cluster. In traditional MPP data (such as Greenplum, Clickhouse), the data is either evenly distributed among nodes, or distributed across all nodes with a copy of each node. The former is friendly to large tables, and the latter is friendly to small tables, but both have shortcomings. The former Concurrent queries cannot be run, which wastes storage and takes a lot of time to synchronize node data. Doris, on the other hand, combines the advantages of both while discarding their shortcomings. It supports both small tables and multi-node data distribution, and large tables distributed according to the specified number of nodes. Doris's data copies can participate in calculations to disperse the pressure of concurrent queries.

  • For aggregated hotspot data tables or dimension tables that require multiple associations, we can set 3 3The number of copies is more than 3 to improve the concurrent query capability of data;
  • For large tables that require correlation or full table scanning, we set as many buckets as possible, and call multi-node synchronization during query to improve query efficiency;
  • For large tables in the ODS layer or tables where real-time data is written, I can keep only one copy to reduce disk space usage.

In addition, the data file storage format of Doris also combines the advantages of row storage and column storage. It chooses a mode based on mixed row and column storage, which also greatly improves the read and write performance. Traditional OLTP databases choose row storage to facilitate data updates and deletions, OLAP databases choose column storage to reduce the number of columns read by data queries, and row-column mixed storage combines the advantages of both and improves the flexibility of data storage. . Doris 2.0also provides support for S3 object storage, which can automatically back up cold data to object storage and supports online query, but the query speed will be reduced.

6.Doris query optimization

Finally, there is the query optimization function of Doris. Doris has done a lot of optimization in query. It mainly includes the following aspects:

  • index . The most important of these is sparse indexing. The sparse index first stores the stored data in order according to the sort key of the data block, and then every 1024 1024Maintaining an index with 1024 rows not only greatly reduces the space occupied by the index, but also enables fast scanning of data. It is a very breakthrough design. This function was also mentioned in the previous introduction about why Clickhouse is fast. In addition to prefix sparse index, Doris also supportsMinMaxindex,Bloom Filterindex,Bitmapindex, and also supportsrollupsetting indexes with multiple different field combinations. The function is simply incredible.
  • rollup and materialized views . Doris supportsrolluppre-aggregating data in advance through materialized views to reduce the amount of data queried and improve response speed.
  • Partition . Doris supports multi-level partitioning, which can reduce the scanning range of data and improve query speed through partitioning.
  • Vectorized query engine . By supporting vectorized query engines, Doris can greatly improve CPU data processing capabilities and improve query efficiency.
  • Query optimization . After Doris receives the user's query statement, it will first rewrite the SQL statement to reduce the query complexity and data scanning scope as much as possible. For example, predicate pushdown, Join Order optimization, and complex SQL rewriting.

7.Doris deals with the pain points of real-time data warehouses

Then we review the three major difficulties of real-time data warehouse: multi-table association , dimensional data change , and data failure .

  • In Doris, for multi-table correlation , we can write streaming data to the primary key table separately, and perform multi-table correlation only when querying. This can perfectly solve the problem of lost correlation caused by window inconsistency.
  • The same goes for dimensional data changes . We can perform dimensional association only when querying, abandon the large wide table model, and achieve data consistency and real-time performance without losing query efficiency.
  • Regarding the issue of data invalidation , the Doris primary key model supports deleting and modifying data according to the primary key. We can directly invalidate or delete the invalid data in the detailed data, and filter out the invalid data during query.

Insert image description here
So I say that Doris database can solve the three major pain points of real-time data warehouse.

Guess you like

Origin blog.csdn.net/be_racle/article/details/133000372