Apache Doris (1): Doris introduction and usage scenarios

Table of contents

1.Introduction to Apache Doris

2. Apache Doris usage scenarios

 2.1 Report analysis

2.2 Ad-hoc Query

​​​​​​​2.3 Construction of a unified data warehouse

​​​​​​​​​​​​​2.4 Data Lake Federated
Query


Before entering the main text, you are welcome to subscribe to the topic, like, comment, and collect the blog post, and follow IT Pindao to obtain high-quality blog content!


1.Introduction to Apache Doris

Apache Doris is a high-performance, real-time analytical database based on the MPP architecture. It is well known for its extremely fast and easy-to-use features. It only needs sub-second response time to return query results for massive data. It can not only support high concurrency Point query scenarios can also support high-throughput complex analysis scenarios . Based on this, Apache Doris can better meet usage scenarios such as report analysis, ad hoc query, unified data warehouse construction, data lake federated query acceleration, etc. Users can build user behavior analysis, AB experiment platform, log retrieval analysis, user Applications such as portrait analysis and order analysis.

Apache Doris was first born as a Palo project in Baidu's advertising reporting business. It was officially open sourced in 2017. In July 2018, it was donated to the Apache Foundation for incubation by Baidu. It was then incubated and developed by members of the incubator project management committee under the guidance of Apache mentors. Operations. Currently, the Apache Doris community has gathered more than 400 contributors from nearly 100 companies in different industries, and the number of monthly active contributors is close to 100. In June 2022, Apache Doris successfully graduated from the Apache Incubator and officially became an Apache Top-Level Project (TLP).

Apache Doris now has a wide range of user groups in China and even around the world. Up to now, Apache Doris has been used in the production environments of more than 1,000 companies around the world. Among the top 50 Internet companies in China by market capitalization or valuation, More than 80% use Apache Doris for a long time, including Baidu, Meituan, Xiaomi, JD.com, ByteDance, Tencent, NetEase, Kuaishou, Weibo, Shell, etc. At the same time, it also has rich applications in some traditional industries such as finance, energy, manufacturing, telecommunications and other fields.

The official website of Apache Doris is https://doris.apache.org.

Note: MPP: Massively Parallel Processing, massive parallel processing. Generally speaking, the MPP architecture refers to a distributed database. There are multiple nodes for data processing. Each node has an independent disk and memory. Concurrent tasks are distributed to each node to process their own data. After the calculation is completed, the results are finally collected in together to form the final result.

MPP can be seen as MPP DB and MPP architecture. For example, Hadoop architecture is MPP architecture, which are large-scale distributed processing, that is, distributed processing architecture. However, the term MPP was proposed by database manufacturers in the early days, and generally refers to distributed databases. . Therefore, understanding the concept of MPP can be understood as MPP is a high-dimensional concept. MPP can be divided into two concepts: MPP DB and MPP architecture. Hadoop or MR is the MPP architecture. MPPDB is a distributed database. Strictly speaking, Doris is an MPP . DB is just a distributed database commonly known as MPP architecture in the industry.

Apache Doris is not DorisDB. Due to various complicated reasons, DorisDB was later renamed StarRocks, which means DorisDB is the predecessor of StarRocks. Doris was originally a dedicated system to solve Baidu Fengchao's statistical reporting. With the rapid development of Baidu's business, the system has been iterated many times, and it has gradually assumed the statistical reporting and multi-dimensional analysis needs of Baidu's internal business. In 2013, Baidu upgraded Doris to the MPP framework and named the new system Palo. In 2017, the name was changed to Baidu Palo and open sourced on GitHub. When it was contributed to the Apache Foundation in 2018, due to cooperation with foreign database manufacturers The name is the same, so I chose to use the original Doris name. This is the origin of Apache Doris.

In February 2020, some students from Baidu's Doris team left to start their own businesses and built their own commercial closed-source product DorisDB based on the previous version of Apache Doris. This is the predecessor of StarRocks. For details, please refer to: https://www.sohu.com/a/488816742_827544.

​​​​​​​2. Apache Doris usage scenarios

As shown in the figure below, after various data integration and processing, the data source is usually stored in the real-time data warehouse Doris and offline lake warehouse (Hive, Iceberg, Hudi). Apache Doris is widely used in the following scenarios.

 ​​​​​​​2.1 Report Analysis

  • Real-time dashboards.
  • Reports for in-house analysts and managers.
  • Highly concurrent report analysis for users or customers (Customer Facing Analytics). For example, site analysis for website owners and advertising reports for advertisers usually require thousands of QPS for concurrency, and query latency requires millisecond-level response. JD.com, a well-known e-commerce company, uses Apache Doris in advertising reports, writing 10 billion rows of data every day, with tens of thousands of concurrent queries per QPS, and the 99th percentile query delay is 150ms.

2.2 Ad-hoc Query

Self-service analysis for analysts, the query mode is not fixed, and requires high throughput. Xiaomi has built a growth analysis platform (Growing Analytics, GA) based on Doris, which uses user behavior data to conduct business growth analysis. The average query delay is 10s, the 95th percentile query delay is within 30s, and the daily SQL query volume is tens of thousands. strip.

​​​​​​​2.3 Construction of a unified data warehouse

One platform meets the unified data warehouse construction needs and simplifies the cumbersome big data software stack. The unified data warehouse built by Haidilao based on Doris has replaced the old architecture composed of Spark, Hive, Kudu, Hbase, and Phoenix, and the architecture has been greatly simplified.

2.4 Data Lake Federated Query

Through federated analysis of data in Hive, Iceberg, and Hudi through external appearance, query performance is greatly improved while avoiding data copying.

Guess you like

Origin blog.csdn.net/qq_32020645/article/details/131355112