Detailed explanation of real-time data warehouse

foreword

This article belongs to the column "Theoretical System of Big Data". This column is original by the author. Please indicate the source for the citation. Please point out the shortcomings and mistakes in the comment area, thank you!

For the directory structure and references of this column, please refer to the Big Data Theory System


background

With the development of society, users have higher and higher requirements for data warehouses.

Please refer to my blog about data warehouse - what is data warehouse?

More and more users expect the data warehouse to be able to:

  • Analyze real-time data and latest and historical data
  • Correlation across data domains, even if they are not traditionally stored together (e.g., real-time customer event data alongside CRM data; web sensor data alongside campaign management data)
  • The extreme scale of 'big data' with the feel and semantics of 'small data'
  • All of the above in one integrated and secure platform

The factors driving this trend are part technology, part business, and part culture.

  • On the technical side, it's cheaper and easier than ever to detect everything and send the data in real-time via messaging systems.
  • On the business side, companies and governments are digitizing and automating as much of their operations as possible so that decision-making and asset management can be more efficient.
  • Culturally, people want to be able to get the answers they need immediately without having to ask others.

Definition of real-time data warehouse

insert image description here

RTDWThe easiest way to describe a Real-time Data Warehouse (abbreviated ) is that it looks and feels like a normal data warehouse, but is still fast, even at a massive scale . It's a data warehouse modernization that lets you have "small data" semantics and performance at "big data" scale.

  • Data arrives at the warehouse faster ; think of a constant stream of millions of events per second
  • The time it takes for data to be optimally queried is faster and is queried as soon as it arrives without processing, aggregation or compression
  • Queries run much faster , with small selective queries measured in 10s or 100s of milliseconds; scan or computationally intensive large queries are processed at very high bandwidth
  • Mutation of data is fast when needed : if for any reason data needs to be corrected or updated, it can be done in-place without extensive rewriting

While this sounds simple, and might even seem trivial to some, a decade of big data developments has shown otherwise. It is difficult to maintain interactive performance with large volumes of data arriving very quickly (some of which may need to be updated), and with a large number of queries of varying patterns.

The following table provides more details on the use case characteristics that make up the RTDW.

feature A detailed description
ingest Medium to high throughput, usually streaming; optimized for insert and insert+update modes only
Inquire Optimized for point lookup, analysis, mutation, etc., with low latency and high concurrency; inflowing data can be queried immediately and optimally; streaming data can be queried together with historical data, no Lambda architecture required
data model Traditional enterprise data types; small and medium-sized models; mainly dimensionality and anti-three normal form, occasionally three normal form models

If you use a more standardized language to describe the real-time data warehouse, you can describe it like this:

Real-time data warehouse is a data warehouse that integrates real-time streaming data and batch data. It can provide enterprises with real-time business insight and decision support, and is an important part of modern data processing and analysis .

A real-time data warehouse is a storage system for storing and analyzing real-time data. Data is automatically captured as it becomes available, then immediately analyzed and correlated with stored historical data. Ultimately, users are able to acquire data faster and view and analyze it faster.


Advantages and challenges of real-time data warehouse

insert image description here

A real-time data warehouse has many advantages, including:

  • Faster Decision-Making : Real-time data warehouses enable organizations to integrate data from disparate sources, extract insights from them, and use them to make strategic business decisions faster. Organizations can refresh data from all business systems in the data warehouse in real time and set reports to run as needed. This saves the time required to generate reports and analysis, enabling more agile decision-making.
  • Increased data democratization : A real-time data warehouse promotes data democratization, enabling everyone in an organization to access the current and historical data they need to perform their responsibilities and optimize their initiatives.
  • Personalized customer experience : A real-time data warehouse provides the foundation for advanced analytics and machine learning, which are critical to delivering a personalized customer experience across all channels used by an organization.
  • Improve business agility : Real-time data warehouses improve the speed with which enterprises can respond to changes. It reduces time lags in business processes, helping organizations to be more agile and capitalize on opportunities faster. A real-time data warehouse helps companies instantly identify opportunities and adjust strategies in different aspects of the business, increasing efficiency and enabling them to meet revenue and profit targets.
  • Unlocking new business use cases : Accessing data in real time can unlock a host of use cases for businesses across all industries. It has the potential to change how businesses operate and the value they can provide to customers.

The real-time data warehouse also brings challenges:

  • One of the biggest is the performance of ETL (extract, transform, load) , the process of copying data from source systems to warehouses. ETL tools typically run in batch mode, which is time-consuming and requires warehouse downtime, making data unavailable.
  • In addition, the real-time data warehouse is also weak in the face of some complex real-time computing scenarios, such as multi-stream association .

Please refer to my blog about ETL - What is ETL? What are ELTs?


Architecture of real-time data warehouse

A real-time data warehouse usually has four components: 数据收集层, 数据存储层, 实时计算层and 实时应用层. These components work together to support the processing and analysis of event data immediately or shortly after the event. All data processing stages (data ingestion, enrichment, analytics, AI/ML-based analytics) are continuous with minimal latency and enable real-time reporting and ad hoc analytics.

A typical real-time data warehouse architecture is as follows:

insert image description here

  • Data collection layer: third-party services and collaboration systems transmit data to real-time data warehouses through message buses such as Apache Kafka/Apache Nifi; third-party data sources call APIs of real-time data warehouses; IoT systems transmit data by directly connecting and pushing data
  • Data storage layer: use Apache Kudu/Apache Druid/Amazon Redshift for real-time data storage
  • Real-time computing layer: use Apache Spark/Amazon Kinesis/Hadoop for real-time computing and analysis
  • Real-time application layer: use AI and machine learning technology to analyze and mine data, use SQL Server/Oracle BI to support query, report and ad hoc query; use Apache Impala to support real-time report and alarm.

About Hadoop, please refer to my blog - What is Hadoop?
About Apache Kafka, please refer to my blog - what is Kafka?
About Apache Druid Please refer to my blog - an article about Apache Druid
About Apache Spark Please refer to my blog - Why is Spark so awesome?


Application cases of real-time data warehouse

insert image description here

Real-time data warehouses are widely used in various industries. The following are some application cases of real-time data warehouses:

  • Real-time OLAP analysis : The real-time data warehouse can support real-time OLAP analysis, helping enterprises quickly obtain business insights and make faster decisions².
  • Real-time data kanban : real-time data warehouse can support real-time data kanban to help enterprises monitor business operation in real time, find problems in time and take measures².
  • Real-time business monitoring : The real-time data warehouse can support real-time business monitoring, helping enterprises to discover business abnormalities in a timely manner and take corresponding measures².
  • Real-time data interface service : The real-time data warehouse can support real-time data interface service, helping enterprises to quickly obtain the required data and improve business efficiency².

Technical realization of real-time data warehouse

The technical realization of real-time data warehouse usually includes the following aspects:

  • Message bus : Responsible for collecting real-time data from various data sources and transmitting it to the data storage layer.
  • Real-time storage : responsible for storing real-time data, supporting fast query and analysis.
  • Stream processing and analysis : Responsible for computing and analyzing real-time data, supporting fast query and analysis.
  • Application layer : responsible for displaying real-time calculation results to users, and supporting fast query and analysis.

At present, there are many technologies that can be used to realize real-time data warehouse. For example, Apache Flink is a popular stream processing engine that can be used to implement a real-time computing layer. TiDB is an open source distributed HTAP database that can be used to implement the data storage layer. In addition, there are many other technologies that can be used to implement real-time data warehouses, refer to the figure below, depending on your needs and preferences.

insert image description here


think

1. What is the difference between a data warehouse and a real-time data warehouse?

insert image description here

Data warehouse (Data Warehouse) and real-time data warehouse (Real-time Data Warehouse) are two different data management and analysis architectures, and they have the following differences:

  1. Data update method : The data warehouse usually adopts batch processing, and the data update cycle is longer, usually daily or weekly, while the real-time data warehouse supports real-time data stream processing, and the data update cycle is shorter, usually hourly or minutely.
  2. Data processing method : Data warehouses usually use online analytical processing (OLAP) to support complex data query and analysis, while real-time data warehouses more often use online transaction processing (OLTP) to support real-time data query and analysis.
  3. Data structure : Data warehouses usually use multidimensional data structures (such as star schema, snowflake model, etc.), while real-time data warehouses use more relational data structures.
  4. Data storage method : Data warehouse usually adopts offline storage method, and data is stored in a large data warehouse, while real-time data warehouse adopts more memory storage method, and data is stored in memory to improve the speed of data query and analysis.
  5. Data query method : Data warehouses usually use Multidimensional Query Language (MDX) for data access, while real-time data warehouses use Structured Query Language (SQL) for data access.

In short, data warehouses and real-time data warehouses are two different data management and analysis architectures, and they have some differences in data update methods, data processing methods, data structures, data storage methods, and data query methods. We need to choose a suitable data management and analysis architecture according to specific business needs and scenarios.


2. Does the real-time data warehouse need a batch processing layer?

The real-time data warehouse does not necessarily have to have a batch layer, but in some cases, the batch layer can bring some advantages to the real-time data warehouse.

In a real-time data warehouse, the real-time processing layer is usually used to process real-time data streams, while the batch processing layer is used to process historical data. The batch processing layer can perform offline processing of historical data to produce more accurate data results . For example, if you need to aggregate historical data, you can use the batch layer to aggregate historical data offline and store the results in a batch data storage system. In this way, the real-time data warehouse can support the query and analysis of real-time data and historical data at the same time.

In addition, the batch layer can also be used for data replay and failure recovery . If the real-time data processing layer fails, the batch processing layer can be used to reprocess the historical data and re-inject the result into the real-time data processing layer to ensure the integrity and correctness of the data.

In summary, in some cases, a batch layer can bring some advantages to real-time data warehouses, but it is not required. The design of a real-time data warehouse should determine whether a batch processing layer is needed based on specific business needs and scenarios.


Summarize

This paper discusses the background, definition, advantages and challenges, architecture, application cases, and technical implementation of Real-Time Data Warehouse (RTDW).

A real-time data warehouse is a modern data warehouse with small data semantics and performance on a large data scale . It can handle real-time data, recent data and historical data, and it can perform correlation analysis across data domains. The real-time data warehouse has faster data arrival and query speed, and can complete all functions on an integrated and secure platform.

The benefits of a real-time data warehouse include faster decision-making, data democratization, personalized customer experience, increased business agility and unlocking new business use cases . However, the real-time data warehouse also faces challenges such as ETL performance and complex real-time computing scenarios .

A typical real-time data warehouse architecture includes data collection layer, data storage layer, real-time computing layer and real-time application layer . The data collection layer is responsible for receiving and transmitting data, the data storage layer is used for real-time data storage, the real-time calculation layer is used for real-time calculation and analysis, and the real-time application layer is used for data analysis and mining.

The real-time data warehouse can be applied to scenarios such as real-time OLAP analysis, real-time data dashboard, real-time business monitoring and real-time data interface services. Its technical implementation usually includes message bus, real-time storage, stream processing and analysis, and application layer.

Commonly used real-time data warehouse technologies include Apache Kafka, Apache Druid, Apache Spark, Hadoop, TiDB, etc. The specific choice depends on the needs and preferences.


reference

(1) What Is Real-Time Data? An Introduction | Splunk. https://www.splunk.com/en_us/data-insider/what-is-real-time-data.html.
(2) Building a Real-Time Data Warehouse - DZone. https://dzone.com/articles/building-a-real-time-data-warehouse-with-tidb-and.
(3) What is a Data Warehouse? | IBM. https://www.ibm.com/topics/data-warehouse.
(4) Data Warehouse: Definition, Uses, and Examples | Coursera. https://www.coursera.org/articles/data-warehouse.
(5) Active and Real-Time Data Warehousing | SpringerLink. https://link.springer.com/referenceworkentry/10.1007/978-0-387-39940-9_8.
(6) An Overview of Real Time Data Warehousing on Cloudera. https://blog.cloudera.com/an-overview-of-real-time-data-warehousing-on-cloudera/.
(7) Real-Time Data Warehouse: Architecture, Tech Stack, Examples - ScienceSoft. https://www.scnsoft.com/analytics/data-warehouse/real-time.
(8) Apache Flink + TiDB: A Scale-Out Real-Time Data Warehouse for … - PingCAP. https://www.pingcap.com/blog/apache-flink-tidb-a-scale-out-real-time-data-warehouse-for-analytics-within-seconds/.
(9) From Traditional Data Warehouse To Real Time Data Warehouse - ResearchGate. https://www.researchgate.net/publication/314248372_From_Traditional_Data_Warehouse_To_Real_Time_Data_Warehouse.
(10) How To Use Real-Time Data? Key Examples And Use Cases - Forbes. https://www.forbes.com/sites/bernardmarr/2022/03/14/how-to-use-real-time-data-key-examples-and-use-cases/.
(11) Database vs. Data Warehouse: Differences, Use Cases, Examples. https://www.couchbase.com/blog/database-vs-data-warehouse/.
(12) Best Practices for Real-time Data Warehousing - Oracle. https://www.oracle.com/technetwork/middleware/data-integrator/overview/best-practices-for-realtime-data-wa-132882.pdf.
(13) Data Warehousing Modeling Techniques and Their Implementation on the … https://www.databricks.com/blog/2022/06/24/data-warehousing-modeling-techniques-and-their-implementation-on-the-databricks-lakehouse-platform.html.
(14) Five benefits of real-time data warehousing | Blog | Fivetran. https://www.fivetran.com/blog/5-benefits-of-real-time-data-warehousing.

Guess you like

Origin blog.csdn.net/Shockang/article/details/131478228