[Big Data] Doris’ detailed explanation of the implementation plan for building a real-time data warehouse (1): Overview of the real-time data warehouse

This series includes:


The concept of data warehouse can be traced back to the 1980s, when IBM researchers proposed the concept of commercial data warehouse. The concept of data warehouse was proposed to solve various problems related to data flow, especially the high cost problem caused by multiple data replication.

1. Development history of data warehouse

Bill Inmon, the father of data warehouse, first proposed the concept of data warehouse in his book "Building the Data Warehouse" published in 1991 . Inmon describes a data warehouse as a subject-oriented, integrated, time-varying, non-volatile collection of data used to support managers' decision-making processes . This definition continues to this day, and Bill Inmon is also known as the father of data warehousing.

Data warehouses began to enter China around 2000, initially focusing on the banking and telecommunications industries. The motivation for building data warehouses in the banking industry comes from regulatory requirements and the 1104 regulatory reporting system. The motivation for the telecommunications industry is mainly to promote provincial and municipal subsidiaries to aggregate data to the head office and build unified financial analysis reports. The application of these two industries has laid the foundation for the popularization of the data warehouse concept in China.

After 2010, with the development of big data technology, it expanded to other industries. Data warehouses are being promoted in various industries such as the Internet, retail, manufacturing, and medical industries.

2. Development of data warehouse technology

Around 2010, the data warehouse system mainly consisted of three commercial suites: database , ETL platform , and BI tools . Common databases are mainly Oracle, DB2, and Teradata, and the corresponding ETL platforms are DataStage, Informatica, and ETL Automation. The mainstream commercial BI platforms are mainly BIEE, Cognos, and BO. The above terms may be unfamiliar to today's audience, but around 2010, these were synonymous with data warehouses. I myself only joined the industry during the heyday of commercial data warehouses and participated in some projects. Except for Teradata, which announced its withdrawal from the Chinese market in March this year, I have never been exposed to it, but I have thoroughly studied and practiced other commercial components.

At the same time, with the rise of mobile Internet, Internet companies represented by BAT shouted the slogan " Go to IOE ", pushing the data warehouse from the commercial era to the open source era, and from a single architecture to a distributed architecture. The most representative ones are that Alibaba and Tencent introduced Hadoop clusters in 2009, and they have continued to iterate and upgrade and have been using them to this day. The reason why Internet companies introduce the Hadoop system is also very simple, because traditional commercial databases can no longer meet the data storage and computing needs of Internet companies. Traditional commercial databases have limited expansion capabilities, high hardware prices, and insufficient concurrent execution capabilities. Hadoop can just solve these pain points. In addition, HiveQL can meet most data development needs, so Hive data warehouse has gradually replaced commercial databases. However, open source software such as Hadoop, Hive, and Sqoop were not mature in the early stage. A large amount of technical research and development was required to improve these software, fix bugs, and optimize the performance or functions of certain modules. This process was also relatively slow, so in the early stage, everyone was interested in Hadoop. The experience with Hive is that it is difficult to use and unstable. The first time I came into contact with Hive in a project was at the end of 2016. At that time, WeBank, a subsidiary of Tencent, used Tencent's internally optimized version of Hive. Internet 0.13companies also began to introduce Kafka that year.

Everyone basically knows the rest of the story. Around 2016, with the release of Hive 2.3and Hive 3.0versions, the Hadoop system gradually became mature and stable. Hortonworks and Cloudera contributed the Tez engine, Ambari management platform and Impala engine to the Hadoop ecosystem respectively. At the same time, the rise of the in-memory computing engine Spark has given Hive new powerful impetus; Hive parent company Facebook has further open sourced the query engine Presto of the MPP framework, which has greatly improved the query capabilities of the Hive data warehouse.

3. Data warehouse related technology stack

At the current point in time, the Hive data warehouse we are talking about generally includes HDFS storage system , Yarn resource management platform , Hive metadata management , Spark computing engine , and Presto query engine by default. These constitute the technology stack of the offline data warehouse.

Here is a brief introduction to the functions and functions of each technology stack:

  • Apache Hadoop : Apache Hadoop is an open source distributed computing framework that provides scalable storage and processing capabilities for large-scale data. Its core components include Hadoop Distributed File System (HDFS), MapReduce and Yarn resource management components, which are mainly used for offline data warehouse data storage and resource scheduling.

  • Apache Hive : Apache Hive is a data warehouse infrastructure based on Hadoop, which provides a SQL-like query language (HiveQL), allowing users to query and analyze data through SQL-style syntax. It supports mapping structured data to HDFS on a Hadoop cluster, and uses computing engines such as MapReduce, Spark, and Tez for query and data processing.

  • Sqoop : A short-lived but important Hadoop component, used to extract data from a relational database to a Hive data warehouse or export data from a Hive data warehouse to a relational database. Sqoop is a very important component, but for unknown reasons, it stopped updating very early. At present, Alibaba's open source DataX is basically used to replace its functions in China.

  • Apache Spark : Apache Spark is a fast, versatile big data processing engine that can be used to build offline data warehouses and real-time data analysis systems. Spark provides high-performance data processing and analysis capabilities and supports multiple programming languages ​​such as Scala, Java and Python.

  • Apache Kylin : Apache Kylin is an open source distributed analysis engine specifically designed for building OLAP ( Online Analytical Processing ) data warehouses. It supports building multi-dimensional data models on Hadoop, providing fast query performance and high scalability. This is a big data project led by Chinese people and open source. It is also a top Apache open source project focusing on OLAP query.

  • Presto : Presto is a distributed SQL query engine that can be used to build large-scale data warehouses and data query engines. It supports executing high-performance queries on multiple data sources, including Hive, MySQL, PostgreSQL, etc.

4.OLAP query

After talking about data warehouse, let's talk about OLAP query. In the traditional data warehouse architecture, ETL is outside the data warehouse and OLAP is within the data warehouse. After the Hadoop system was introduced into the data warehouse field, the data ETL processing capabilities, cluster expansion capabilities, and data storage stability were greatly improved, but at the expense of data query capabilities. So the professional field of OLAP query was born.

In offline data warehouse technology, in addition to the previously introduced Kylin, Presto, Impala, and Druid, they are all designed to solve OLAP queries. However, these OLAP engines designed based on HDFS only accelerate the query speed and are not yet satisfactory . Or mind-blowing speed. As a result, ClickHouse and Doris emerged and became the kings in the OLAP field.

  • ClickHouse is a data management system for online analysis and query open sourced by Russia's Yandex in 2016. It is a columnar storage database based on MPP architecture and can use SQL query statements to generate data analysis results in real time. The full name of ClickHouse is Click Stream, Data WareHouse . ClickHouse is also the first open source database to implement a vectorized query engine and a database focused on OLAP queries. ClickHouse directly compresses the time consumption of OLAP queries to sub-second level.

  • Apache Doris is also an MPP architecture database product specifically used for OLAP queries. Apache Doris can not only meet a variety of data analysis needs, such as fixed historical reports, real-time data analysis, interactive data analysis, and exploratory data analysis, but also has outstanding cluster expansion capabilities and can support extremely large data sets of more than 10 PB. In addition, Apache Doris also has very strong support for real-time data, so it is a database with very strong comprehensive capabilities. Since Apache Doris is the focus of this sharing, it will only be briefly introduced here and will be expanded upon later.

Insert a knowledge point here, which is also a summary I saw when sorting out the PPT. Why is ClickHouse fast? Several points are summarized here, the most important of which are: C++ language can take advantage of hardware, select column storage for underlying data, support vectorized query engines, utilize the multi-core parallel processing capabilities of a single node, and establish first-level and second-level data when writing data. level and sparse indexes. These characteristics point out the direction for the development of emerging OLAP query databases, which are also the basic characteristics of Doris.

5.MPP architecture

Let’s talk about the third concept, MPP architecture . MPP ( Massively Parallel Processing) is the abbreviation of massively parallel processing framework. MPP is a Shared Nothing architecture, which distributes tasks to multiple servers and nodes in parallel. After the calculation on each node is completed, the results of each part are summarized together to obtain the final result.

Insert image description here
The Teradata mentioned above is the industry's earliest well-known MPP architecture database, followed by Greenplum and similar Greenplum architectures such as GBase and GaussDB, which are all MPP architectures. Presto, HAWQ, and Impala of the Hadoop system all claim to be MPP architectures, as well as the two new species mentioned earlier, Clickhouse and Doris. As the MPP family becomes larger and larger, coupled with the cross-border "robbing" of the Hadoop ecosystem, it is now difficult to define what the MPP architecture is. MPP architecture and Hadoop architecture are integrating at a certain level. The biggest difference currently lies in the allocation of computing resources. The Hadoop architecture also follows Yarn to allocate resources, while the MPP architecture uses local memory and CPU independently. Then there is the process backup and failure retry mechanism. Hadoop will retain the intermediate calculation process. Some nodes that fail in calculation have an automatic retry mechanism, while the MPP architecture is generally a one-time process. If the execution fails, the user will be informed of the execution failure. in exchange for better query performance.

6. Real-time data warehouse definition

Finally, let’s talk about real-time data warehouse. Traditional data warehouses use the low business peak period at night to complete ETL processing of data and provide data support for daily analysis during the day. Therefore, under the premise of this application, the data warehouse calculates and processes data on a daily basis by default. , commonly known T+1as Shucang. However, with the development of business and the maturity of technology, we are no longer satisfied with looking at yesterday's data today, but want to see today's data today, so the concept of real-time data warehouse came into being. Real-time data warehouse refers to a data warehouse with higher real-time data and low latency. It usually counts the business data that occurred on the same day. Real-time data warehouses generally include hour-level and minute-level delayed quasi-real-time data warehouses that are executed on an hourly basis and second-level delayed pure real-time data warehouses.

From the previous introduction, we can see that the early Hive data warehouse technology was not perfect, and even real-time OLAP queries were difficult to achieve, so it was even more difficult to achieve real-time data warehouse. So Kafka has been around for so many years, and real-time data warehouses have to wait until Flink becomes popular before they become popular. A pure real-time data warehouse is a data warehouse that can process and analyze data in near real-time. Its goal is to increase the speed of data capture, processing and analysis to near real-time levels to support real-time decision-making and insights.

Pure real-time data warehouse subverts the architecture of offline data warehouses, including data collection, data processing, data query and analysis, which all require the adoption of a new technology stack.

  • In terms of data collection , to collect data with lower latency, the common method is to read the database change log (also called CDC) or directly access the online Kafka data stream.

  • In terms of data processing , real-time data generally uses Apache Flink or Spark Streaming to complete data processing, and the intermediate process data is generally stored in Kafka.

  • In terms of real-time data query , it is generally only supported to write data summary into relational databases such as MySQL or Redis cache in order to obtain results quickly. In order to support faster queries, we can also write data to Clickhouse and Doris for query.

Although this kind of processing can achieve second-level delay of data, it sacrifices the accuracy of data and the dimension of data analysis . Although highly aggregated data can meet the use of some scenarios, it cannot further analyze and delve into the value of the data. So in most cases, we will make a compromise between real-time architecture and offline architecture, with the first half being real-time and the second half being offline. Data access is real-time, and data processing adopts offline micro-batch processing. If the trading system cannot support the CDC change log, we can even do micro-batch incremental data extraction based on the creation time and modification time of the data. The benefits of micro-batch processing are: data accuracy is higher than real-time, the technology is relatively mature, and development and operation and maintenance costs are low. We will expand on the specific implementation method in the third part.

The application scenarios of real-time data warehouse are also gradually enriched. The main ones I encountered at work are:

  • Real-time business monitoring and early warning . Avoid losses caused by failure to detect online business interruptions in time.

  • Real-time big screen . It is mainly used to monitor the achievement of performance goals during the 618 or Double Eleven promotion period.

  • Real-time robot broadcast . Through real-time data processing, relevant colleagues can be promptly informed of the day's performance progress and rankings.

  • Real-time data display on mobile terminal . It is convenient for leaders and managers to check performance completion status in real time.

  • Real-time self-service analytics . Mainly to supplement the data of the day for self-service analysis.

  • Real-time dashboard . For example, viewing transaction indicators at a five-minute granularity and comparing them with the same period makes it easier to detect business failures in a timely manner, which is more intuitive than real-time monitoring.

  • Real-time data interface . There are some scenarios where data is exposed to the outside world and the latest data needs to be provided in real time to facilitate cross-system docking. However, most of these scenarios are completed in the trading system.

  • Real-time recommendations . For example, real-time product sales rankings, etc.

7. Difficulties of real-time data warehouse

There are also many difficulties in the implementation of real-time data warehouse. I will summarize mainly three points here.

  • First, multi-table association is also called multi-streamjoin. When the data on one side is delayed, if the flow data on the other side is not within the data window range, it will not be associated. For example, if the sales order parent table and the sales details child table change data beyond the window range in the business system,jointhe data will be lost in the dual stream, resulting in the loss of change records.

  • Second, dimensional data changes . When dimensional data changes, there will be inconsistencies between historical data and newly written data. If it is an offline data warehouse, we generally use the dimension status at a certain time point in the early morning of that day as the unified dimension; if it is a purely real-time data warehouse, the data before and after the dimension change will be inconsistent, and cleaning historical data according to the new dimension will become a big problem.

  • The third point is data failure . Data invalidation includes physical deletion of data and status becoming invalid. Physical deletion of data means that the transaction system directly deletes the corresponding record. This is difficult in incremental data processing and real-time data extraction. Fortunately, the CDC log can capture the deleted record; the status becoming invalid means that it was originally valid. The data becomes invalid data, for example, the order is closed after payment. For data failure, our general approach is to generate a hedging record with a negative indicator and deduct the indicator from the summary result. However, it cannot be eliminated when calculating indicators such as the number of transactions.

The above three situations are unsolvable in the real-time data warehouse field of traditional solutions, but in the Apache Doris era, we can easily solve these pain points.

Guess you like

Origin blog.csdn.net/be_racle/article/details/132993843