[Big Data] Data Lake: The Development Trend of Next-Generation Big Data

Data Lake: The Development Trend of Next-Generation Big Data

1. The background of data lake technology

Large domestic Internet companies generate tens, hundreds of terabytes, or even several petabytes of raw data every day. These companies usually use open source big data components to build big data platforms. The big data platform has gone through three stages : offline data platform represented by Hadoop , Lambda architecture platform , and Kappa architecture platform .

The data lake can be regarded as the latest generation of big data technology platform. In order to better understand the basic architecture of the data lake, let's take a look at the evolution process of the big data platform to understand why we need to learn data lake technology.

1.1 Offline big data platform (first generation)

insert image description here
The first stage: offline data processing components represented by Hadoop . Hadoop is a basic component for batch data processing with HDFS as the core storage and MapReduce as the basic computing model. Around HDFS and MR, in order to continuously improve the data processing capabilities of the big data platform, a series of big data components have been born, such as HBase for real-time KV operations, Hive for SQL, Pig for workflow, etc. At the same time, as everyone's performance requirements for batch processing are getting higher and higher, new computing models are constantly being proposed, and computing engines such as Tez, Spark, and Presto have been produced, and the MR model has gradually evolved into a DAG model.

In order to reduce the operation of writing intermediate results in the data processing process, computing engines such as Spark and Presto use the memory of computing nodes to cache data as much as possible, thereby improving the efficiency of the entire data process and system throughput.

1.2 Lambda architecture

With the continuous changes in data processing capabilities and processing requirements, more and more users find that no matter how the batch processing mode improves performance, it cannot meet the processing scenarios with high real-time requirements. Streaming computing engines have emerged as the times require, such as Storm, Spark Streaming, Flink, etc.

However, as more and more applications come online, everyone finds that batch processing and stream computing can be used together to meet the needs of most applications. For scenarios with high real-time requirements, a real-time stream processing platform will be built using this method Flink + Kafka. To meet the real-time needs of users. So the Lambda architecture was proposed, as shown in the figure below.

insert image description here
The core concept of the Lambda architecture is the separation of streams and batches . As shown in the figure above, the entire data flow flows into the platform from left to right. After entering the platform, it is divided into two parts, one part adopts batch processing mode, and the other part adopts streaming computing mode. Regardless of the computing mode, the final processing result is provided to the application through the service layer to ensure the consistency of access.

This kind of data architecture contains a lot of big data components, which greatly enhances the complexity and maintenance cost of the overall architecture.

1.3 Pain points of Lambda architecture

After years of development, the Lambda architecture is relatively stable and can meet the application scenarios of the past. But it has many fatal weaknesses:

insert image description here

  • High cost of data governance : The real-time computing process cannot reuse the data lineage and data quality management system of the offline data warehouse. It is necessary to re-implement a data lineage and data quality management system for real-time computing.
  • High development and maintenance costs : It is necessary to maintain two sets of offline and real-time data warehouse systems at the same time, and the same set of computing logic needs to store two sets of data. For example, to update one or several pieces of original data, it is necessary to run the offline data warehouse again, and the data update cost is very high.
  • Inconsistent data caliber : Because offline and real-time calculations use two completely different codes, due to the delayed arrival of data and the different running time of the two types of codes, the calculation results are inconsistent.

So is there any architecture that can solve the problem of Lambda architecture?

1.4 Kappa Architecture

The " stream-batch separation " processing link of the Lambda architecture increases the complexity of research and development. Therefore, some people have asked whether it is possible to use one system to solve all problems. At present, the more popular method is to do it based on flow computing. Next, let's introduce the Kappa architecture, by Flink + Kafkaconnecting the entire link in series. The Kappa architecture solves the problems of inconsistent computing engines between the offline processing layer and the real-time processing layer in the Lambda architecture, high development and operation and maintenance costs, and inconsistent computing results.

insert image description here
The scheme of Kappa architecture is also called stream-batch integration scheme. We borrow it Flink + Kafkato build a flow-batch integration scenario, but if we need to do further analysis on the ODS layer data, we need to access the Flink computing engine to write the data to Kafka at the DWD layer, and also write part of the result data to Kafka at the DWS layer. Kappa architecture is not perfect, it also has many pain points.

1.5 Pain points of Kappa architecture

insert image description here

  • Weak data backtracking ability : Kafka has weak ability to support complex demand analysis. When faced with more complex data analysis, it is necessary to write the data of DWD and DWS layers into ClickHouse, ES, MySQL or Hive for further analysis. This undoubtedly brings complexity to the link. The bigger problem is that when doing data backtracking, the data backtracking ability is very weak due to the complexity of the link.
  • Weak OLAP analysis ability : Since Kafka is a sequential storage system, there is no way for the sequential storage system to directly perform OLAP analysis on it. For example, optimization strategies such as predicate pushdown are relatively difficult to implement on the sequential storage platform (Kafka). difficult things.
  • The timing of data is challenged : Kappa architecture is heavily dependent on message queues. We know that the accuracy of message queues is strictly dependent on the order of its upstream data. However, the more layers of data in message queues, the more likely they are to be out of order. big. Usually, the data in the ODS layer is absolutely accurate. When the data in the ODS layer is written to the DWD layer after calculation, it will be out of order. DWD to DWS is more likely to be out of order. This kind of data inconsistency problem is very serious.

1.6 Summary of Big Data Architecture Pain Points

From the traditional Hadoop architecture to the Lambda architecture, and from the Lambda architecture to the Kappa architecture, the evolution of the big data platform infrastructure gradually includes all kinds of data processing capabilities required by applications, but these platforms still have many pain points.

insert image description here
Is there a storage technology that can not only support efficient data backtracking, support data update, but also realize batch stream reading and writing of data, and can also realize data access at the level of minutes to seconds?

1.7 Real-time Data Warehouse Construction Requirements

This is also an urgent need for real-time data warehouse construction. In fact, some problems encountered in the Kappa architecture can be solved by upgrading the Kappa architecture. Next, we will mainly share the current hot data lake technology.

insert image description here
So is there such an architecture that can not only meet the real-time requirements, but also meet the requirements of offline computing, and can also reduce the cost of development and operation and maintenance, and solve the pain points encountered in the Kappa architecture built by message queue? The answer is yes, which will be discussed in detail later in the article.

2. The data lake helps to solve the pain points of the data warehouse

2.1 Continuous improvement of data lake concept

A data lake is a centralized repository that can store both structured and unstructured data. Business data can be stored as-is (without first structuring the data) and run different types of analytics, from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decision making.

insert image description here

2.1.1 Storing original data

  • The data lake needs to have enough storage capacity to store all the company's data.
  • Data lakes can store various types of data, including structured, semi-structured (XML, Json, etc.) and unstructured data (pictures, videos, audios).
  • The data in the data lake is a complete copy of the original business data, and these data maintain their original appearance in the business system.

2.1.2 Flexible underlying storage function

In actual use, the data in the data lake is usually not frequently accessed. In order to achieve acceptable cost performance, a cost-effective storage engine is usually selected for data lake construction.

  • Provide ultra-large-scale storage for big data, as well as scalable large-scale data processing capabilities.
  • S3Distributed storage platforms such as // can OSSbe used HDFSas storage engines.
  • Supports data structure formats such as Parquet, Avro, and ORC.
  • Can provide data cache acceleration function.

2.1.3 Rich computing engine

From data batch computing, streaming computing, interactive query analysis to machine learning, all kinds of computing engines belong to the category that data lakes should cover. With the combination of big data and artificial intelligence technology, various machine learning/deep learning algorithms have been continuously introduced. For example, the TensorFlow/ PyTorchframework has supported reading sample data from S3/ OSS/ for machine learning training. HDFSTherefore, for a qualified data lake project, the pluggability of computing and storage engines is a basic capability that data lakes must possess.

2.1.4 Perfect data management

  • Data lakes need to have comprehensive metadata management capabilities . Including data source, data format, connection information, data schema, authority management and other capabilities.
  • Data lakes need to have comprehensive data lifecycle management capabilities . Not only can the original data be stored, but also the intermediate result data of various analysis and processing needs to be able to be saved, and the analysis and processing process of the data can be completely recorded to help users trace the generation process of any piece of data completely.
  • Data lakes need to have comprehensive data acquisition and data release capabilities . The data lake needs to be able to support various data sources, obtain full/incremental data from related data sources, and then standardize storage. Data lakes can push data to appropriate storage engines to meet different application access requirements.

2.2 Architecture of open source data lake

The LakeHouse architecture has become the hottest trend in the evolution of the current architecture. It can directly access the stored data management system, which combines the main advantages of the data warehouse. LakeHouse is built based on a storage-computing separation architecture. The biggest problem with the separation of storage and computing lies in the network , especially for data warehouses with high frequency access, network performance is crucial. There are many options for implementing Lakehouse, such as Delta, Hudi, Iceberg. Although the focus of the three is different, they all have the general functions of a data lake, such as: unified metadata management, support for multiple computing and analysis engines, and support for high-level analysis and separation of computing and storage.

So what does an open source data lake architecture generally look like? Here I drew an architecture diagram, which is mainly divided into four layers:

insert image description here

2.2.1 Distributed file system

The first layer is the distributed file system. For users who choose cloud technology, they usually choose S3 and Alibaba Cloud to store data; users who like open source technology generally use their own HDFS to store data.

2.2.2 Data acceleration layer

The second layer is the data acceleration layer. The data lake architecture is a typical storage and computing separation architecture, and the performance loss of remote reading and writing is very large. Our common practice is to cache frequently accessed data (hot data) locally on the computing node, so as to realize the separation of hot and cold data . The advantage of doing this is to improve the read and write performance of data and save network bandwidth. We can choose open source Alluxioor Alibaba Cloud Jindofs.

2.2.3 Table format layer

The third layer is the Table format layer, which encapsulates data files into tables with business meanings, and the data itself provides table-level semantics such as ACID, Snapshot, schema, and so on. partitionThis layer can choose one of the three musketeers of open source data lakes Delta, Hudi. , , are a technology for building data lakes, they are not data lakes themselves .IcebergDeltaHudiIceberg

2.2.4 Calculation Engine

The fourth layer is various data calculation engines. Including Spark, Flink, Hive, Presto, etc., these computing engines can all access the same table in the data lake.

3. Comparison of data lake and data warehouse concepts

3.1 Comparison between data lake and data warehouse

Let me talk to you about the essence of the data lake as I understand it. If you don’t understand the essence of a new thing, it will be difficult for you to control it. The picture below shows everything.

insert image description here
After having a basic understanding of the concept of a data lake, we need to further clarify what basic characteristics a data lake needs to have, especially what characteristics a data lake has compared with a data warehouse. Let's refer to the official comparison table of AWS data warehouse and data lake comparison.

Every company needs a data warehouse and a data lake because they serve different needs and use cases:

  • A data warehouse is a database optimized for analyzing relational data from transactional systems and line-of-business application systems. Define the data structure and Schema in advance to provide fast SQL query. Raw data undergoes a series of ETL transformations to provide users with a trusted "single data result".
  • A data lake is different because it stores not only relational data from line-of-business application systems, but also non-relational data from mobile applications, IoT devices, and social media. When capturing data, there is no need to define a data structure or schema in advance. This means that data lakes can store all types of data without elaborate data structures. Different types of analytics (such as SQL queries, big data analytics, full-text search, real-time analytics, and machine learning) can be used on the data.
characteristic database data lake
data Relational data from transactional systems, operational databases, and line-of-business applications Non-relational and relational data from IoT devices, websites, mobile apps, social media and enterprise applications
Schema Design before data warehouse implementation (write-in Schema) Write at analysis time (read schema)
value for money Faster query results come with higher storage costs Faster query results require lower storage costs
data quality Highly regulated data that can be used as a basis for important facts Any data that may or may not be regulated (such as raw data)
user business analyst Data scientists, data developers and business analysts (using regulatory data)
analyze Batch Reporting, BI and Visualization Machine Learning, Predictive Analytics, Data Discovery and Analytics

The above table introduces the differences between data lakes and traditional data warehouses. Next, we will further analyze the characteristics of data lakes from the two levels of data storage and computing.

3.2 Write Mode and Read Mode

3.2.1 Write mode

The hidden logic behind the "write-in schema" of the data warehouse is that before the data is written, the schema of the data must be confirmed, and then the data is imported. The advantage of this is that the business and data can be well combined ; The disadvantage is that the data warehouse is not flexible enough when the business model is not clear and it is still in the exploratory stage.

3.2.2 Read mode

The data lake emphasizes " readable schema ", and the underlying logic behind it is that business uncertainty is normal: since we cannot predict business development and changes, then we maintain a certain degree of flexibility. Defer structural design, so that the entire infrastructure has the ability to make data fit the business "on demand". Therefore, data lakes are more suitable for development and innovative enterprises.

3.3 Data Warehouse Development Process

insert image description here
The data lake adopts a flexible and fast "time-reading mode". Under the wave of digital transformation, it really helps enterprises complete technological transformation, complete data precipitation, and deal with the endless data demand problems under the rapid development of enterprises.

3.4 Architecture scheme of data lake

The data lake can be considered as a new generation of big data infrastructure. In this architecture, whether it is data stream processing or batch processing, data storage is unified Icebergon . Obviously, this architecture can solve the pain points of Lambda architecture and Kappa architecture:

insert image description here

3.4.1 Solve the problem of small amount of data stored in Kafka

At present, the basic idea of ​​all data lakes is a file management system based on HDFS, so the data volume can be very large.

3.4.2 Support OLAP query

Similarly, the data lake is implemented based on HDFS. Only the current OLAP query engine needs some adaptations to perform OLAP queries on the middle layer data.

3.4.3 Data Governance Integration

After the batch stream data is stored on HDFS, S3 and other media, the same data lineage and data quality management system can be reused.

3.4.4 Stream batch architecture unification

Compared with the Lambda architecture, the data lake architecture has a unified schema and unified data processing logic, and users no longer need to maintain two copies of data.

3.4.5 Consistent Data Statistics

Due to the adoption of a unified stream-batch integrated computing and storage scheme, data consistency is guaranteed.

3.5 Which is better

Data lakes and data warehouses can’t be said to be better or worse. Everyone has their merits and can complement each other’s advantages. Let me draw a picture here for your understanding:

insert image description here

  • The metadata of the lake and the warehouse are seamlessly connected and complement each other. The model of the data warehouse is fed back to the data lake (becoming part of the original data), and the structured application of the lake is deposited into the data warehouse.
  • Unified development of lakes and warehouses, data stored in different systems can be managed uniformly through the platform.
  • For the data of the data lake and the data warehouse, according to the needs of business development, it is determined which data is stored in the data warehouse and which data is stored in the data lake, thereby forming the integration of the lake and the warehouse.
  • The data is in the lake, the model is in the warehouse, and the transformation is repeated.

4. Data lake helps upgrade data warehouse architecture

4.1 The goal of building a data lake

Data Lake Technology Icebergcurrently supports three file formats: Parquet, Avro, ORC. As shown in the figure below, Icebergthe capabilities it possesses are summarized as follows. These capabilities are crucial for building the integration of lakes and warehouses.

insert image description here

  • The data storage layer adopts a standard and unified data storage model.
  • Build quasi-real-time data construction T + 1to ensure data timeliness.
  • Data traceability is more convenient and operation and maintenance costs are lower.

4.2 Quasi-real-time data access

Data lake technology Icebergnot only supports read-write separation, but also supports concurrent reading, incremental reading, small file merging, and can also support second-level to minute-level delays. Based on these advantages, we try to use these functions to build a real-time full link based on IcebergFlink Real-time data warehouse architecture with batch-flow integration.

As shown in the figure below, Icebergeach commitoperation changes the visibility of the data, for example, changing the data from invisible to visible. During this process, near real-time data recording can be realized.

insert image description here

4.3 Real-time data warehouse - data lake analysis system

When building an offline data warehouse, data access operations must first be performed, such as regularly extracting data with the offline scheduling system, and then after a series of ETL operations, and finally writing the data into the Hive table. The delay in this process is relatively large. Therefore, with the help of Iceberg's table structure, Flink or Spark Streaming can be used to achieve near real-time data access to reduce data latency.

Based on the above functions, let's review the Kappa architecture discussed above. We already know the pain points of the Kappa architecture. Since Iceberg can be used as an excellent table format, it can also support Streaming Reader and Streaming Sink. So, is it possible to consider replacing Kafka with Iceberg?

The underlying storage of Iceberg is cheap storage like HDFS or S3, and supports storage structures such as Parquet, ORC, and Avro. OLAP analysis can be performed on the result data of the middle layer. Based on the Streaming Reader function of Iceberg Snapshot, it can greatly reduce the delay of offline tasks from the day level to the hour level, and transform it into a near real-time data lake analysis system.

insert image description here
In the middle processing layer, you can use Presto to perform some simple SQL queries. Because Iceberg supports Streaming Read, you can also directly access Flink in the middle layer of the system, and use Flink directly in the middle layer to do some batch processing or streaming computing tasks. , output the intermediate results to the downstream after further calculation.

4.4 Advantages and disadvantages of Iceberg replacing Kafka

insert image description here
In general, the advantages of Iceberg replacing Kafka mainly include:

  • Realize stream-batch unification at the storage layer
  • The middle layer supports OLAP analysis
  • Perfect support for efficient backtracking
  • storage cost reduction

Of course, there are certain shortcomings, such as:

  • Data latency shifts from real-time to near-real-time
  • Interfacing with other data systems requires additional development work

4.5 Solve MySQL data synchronization problem through Flink CDC

Iceberg provides a unified data lake storage table format and supports multiple computing engines (including Spark, Presto, Hive) for data analysis; it can generate purely columnar data files, and columnar files are very suitable for OLAP operations; Iceberg is based on The Snapshot design mode supports incremental reading of data; the Iceberg interface has a high degree of abstraction and good compatibility, and is independent of both the upper-level computing engine and the lower-level storage engine, which makes it convenient for users to define their own business logic.

Upload the data together with the CDC flag directly appendto Iceberg, and at mergethe time of , perform a one-time comparison of these incremental data with the full amount of last data in accordance with a certain organizational format and a certain efficient calculation method merge. The advantage of this is that it supports near-real-time import and real-time data reading; the Flink SQL of this computing solution natively supports CDC ingestion, and does not require additional business field design.

insert image description here

5. Development prospect of data lake technology

Data lakes may be the bright spot in the next big data technology revolution. We need to seize the opportunity, seize the opportunity, and learn about data lakes together. But my suggestion is still "learn without using", why do you say that? Example: in 2018 2018At the beginning of 2018 , we launched Flink in a swarm, and then upgraded the version every month. It's just a pain in the ass. Therefore, we wait for the big Internet companies to fill in the pits, and then we will directly launch the data lake in a short, flat and fast manner, but we must learn.

6. Summary

Through this article, we have a basic understanding of what a data lake is, why we need to learn a data lake, and what practical problems it can solve. Later we will continue to focus on why Iceberg is chosen as the data lake solution.

Guess you like

Origin blog.csdn.net/be_racle/article/details/132591394