Data warehouse architecture evolution

table of Contents

Data warehouse architecture evolution

Offline big data architecture

Data warehouse layering

Lambda architecture

Problems with Lambda architecture

Kappa architecture

Typical case of Kappa architecture

The reprocessing process of Kappa architecture

Comparison of Lambda architecture and Kappa architecture

Real-time data warehouse and offline data warehouse


Data warehouse architecture evolution

The concept of data warehouse was proposed by Inmon in 1990 and gave a complete construction method. With the advent of the Internet era, the amount of data has soared, and big data tools have begun to replace traditional tools in classic data warehouses. At this time, it is only a replacement of tools, and there is no fundamental difference in architecture . This architecture can be called an offline big data architecture .

Later, with the continuous improvement of business real-time requirements, people began to add an acceleration layer to the offline big data architecture, and use stream processing technology to directly complete the calculation of those indicators with high real-time requirements. This is the Lambda architecture.

Later, there were more and more real-time businesses, and more and more event-based data sources. Real-time processing has changed from a secondary part to a major part. The architecture has also been adjusted accordingly, and a real-time event processing core has appeared. Kappa architecture.

Offline big data architecture

Features:

  • The data source is imported into the offline data warehouse through offline
  • Data processing uses offline computing engines such as MapReduce, Hive, SparkSQL, etc.

Data warehouse layering

  • ODS (Operation Data Store) layer: the original data layer, which stores and loads original logs and data, and keeps the data as it is without processing.
  • DWD (Data warehouse detail) layer: clean the ODS layer data (remove null values ​​and data exceeding the limit range), dimensional degradation, desensitization, etc.
  • DWS (data warehouse service) layer: based on DWD, lightly summarized by day.
  • DWT (data warehouse Topic) layer: based on DWS, summarized by topic.
  • ADS (Application Data Store) layer: Provides data for various statistical reports.

Lambda architecture

With the development of big data applications, people gradually put forward requirements for the real-time performance of the system. In order to calculate some real-time indicators, a real-time calculation link was added to the original offline data warehouse, and the data source was streamed. Transformation (that is, sending data to the message queue), real-time calculation to subscribe to the message queue, directly complete the calculation of the indicator increment, and push it to the downstream data service. The data service layer completes the combination of offline & real-time results.

Batch processing of indicators calculated by stream processing is still calculated, and the final batch processing shall prevail, that is, the result of stream processing will be overwritten after each batch processing calculation. (This is just a compromise made by the imperfect stream processing engine).

Real-time calculation of whether the link is hierarchical or not depends on the complexity of the indicator, and each layer interacts through a message queue (mostly not hierarchical).

Problems with Lambda architecture

  • The same requirement requires the development of two sets of the same code

This is the biggest problem. The two sets of codes not only mean development difficulties (the same requirements, one is implemented on the batch processing engine, the other is implemented on the stream processing engine, and data tests must be constructed separately to ensure that the results of the two are consistent). Maintenance is more difficult. For example, two sets of codes need to be changed separately after requirements change, test results are independently tested, and they are online simultaneously.

  • Increased resource usage

The same logic is calculated twice, the overall resource occupation will increase (more than real-time calculation)

  • Realizing the difference between link and offline link data is easy to confuse the business side

For example, the business side will find that the data seen the next day is less than the data seen last night. The reason is that when the data is put into the Result Database, there are two lines of calculation: one line is ETL "run" according to a certain caliber to get more accurate batch processing results; the other line is through Streaming" "Run" and rely on real-time results from Hadoop Hive or other algorithms. Of course it sacrifices some accuracy. It can be seen that these two results from batch and real-time data are not right.

Kappa architecture

Although the Lambda architecture meets real-time requirements, it brings more development and operation and maintenance work. Its architecture background is that the stream processing engine is not perfect, and the result of stream processing is only a temporary and approximate value for reference. Later, with the emergence of stream processing engines such as Flink, stream processing technology became mature. At this time, in order to solve the problem of the two sets of codes, Jay Kreps of Linkedln proposed the Kappa architecture.

The Kappa architecture can be considered a simplified version of the Lambda architecture (as long as the batch processing part of the lambda architecture is removed)

The calculation can be done directly, or it can be layered like an offline data warehouse. Depending on the complexity of the indicator, each layer interacts through a message queue (mostly non-layered)

Typical case of Kappa architecture

The Kappa architecture to build a data warehouse is a proper real-time data warehouse. It is cumbersome to develop the stream processing code for each requirement. A better way is to use an OLAP engine. The mainstream engines are as follows (individually speaking, it is not an OLAP engine. , But with corresponding functions):

The reprocessing process of Kappa architecture

In the Kappa architecture, the real-time stream processing engine is robust. Due to data reasons, there is still a need for data reprocessing. Modified data or historical data reprocessing is completed through upstream replay (pulling data from the data source and recalculating once) .

The biggest problem with the Kappa architecture is that the throughput of streaming historical data will be lower than batch processing, but this can be compensated by increasing computing resources. The most worrying point about the Kappa architecture when reprocessing is actually not complicated. .

1. Choose a message queue with replay function that can save historical data and support multiple consumers. Set the duration of historical data storage according to your needs. For example, Kafka can save all historical data.

2. When there is a need to reprocess one or some indicators, write a new job according to the new logic, and then re-consume from the very beginning of the upstream message queue, and write the result to a new downstream table.

3. When the new job catches up with the progress, the application switches the data source and reads the new result table generated in 2.

4. Stop the old job and delete the old result table.

Comparison of Lambda architecture and Kappa architecture

Compared Lambda Kappa
real-time real time real time
Computing resources Batches and streams run at the same time, which consumes a lot of resources Only stream processing, low resource overhead
Recalculate throughput Batch processing, high throughput Flow type full volume processing, throughput is lower than batch type full volume
Difficulty in development and testing Each requirement requires two sets of codes, batch processing and stream processing, which are more difficult to develop, test, and go online. Only need to implement a set of code, the difficulty of development, testing, and launch is relatively small
Operation and maintenance cost Maintenance of two systems (engines), high operation and maintenance costs Maintain a set of systems (engines), with low operation and maintenance costs

 

 

 

 

 

 

 

 

 

 

Real-time data warehouse and offline data warehouse

  • In terms of architecture, there are obvious differences between real-time data warehouses and offline data warehouses. Real-time data warehouses are based on Kappa architecture, while offline data warehouses are based on traditional big data architecture. Lambda architecture can be considered an intermediate state between the two. At present, most of the real-time data warehouses in the industry are Lambda architecture, which is determined by demand.
  • In terms of construction methods, the real-time data warehouse and offline data warehouse basically follow the traditional data warehouse theme modeling theory to produce a wide table of facts. In addition, the join of real-time streaming data in the real-time data warehouse has hidden time semantics, which should be paid attention to during construction.
  • Finally, from the perspective of data protection, real-time data warehouses are more sensitive to changes in data volume because they must ensure real-time performance. In scenarios such as big promotions, it is necessary to do a good job of stress testing and main/backup guarantee in advance, which is a more obvious difference from offline data.

In actual business, it is often not a fully standardized Lambda architecture or Kapa architecture. It can be a mixture of the two. For example, most real-time indicators use the Kappa architecture to complete the calculation, and a small number of key indicators (such as financial related) use the Lambda architecture to re-use batch processing. Calculate, add a proofreading process.

Offline big data architecture is still more practical in many companies (high cost performance)

Guess you like

Origin blog.csdn.net/Poolweet_/article/details/109477262