Talking about Big Data - Data Lake Cognition

        Introduction: Data lake is a hot concept at present, and many enterprises are building or planning to build their own data lake. But before planning to build a data lake, it is crucial to figure out what a data lake is, clarify the basic components of a data lake project, and then design the basic architecture of a data lake.

Table of contents

Data Lake Definition

Wikipedia

AWS

Microsoft

Definition summary

Basic characteristics of a data lake

data

fidelity

flexible

manageable

Traceable

computing

computing engine

storage engine

Data Lake Basic Architecture

Data Lake Architecture Diagram

The basic process of data lake construction

Data Warehouse Construction Process

data analysis

model abstraction

data access

Converged Governance

business support

Data Lake Construction Process

data analysis

Technology Selection

data access

Converged Governance

summary


Data Lake Definition

Wikipedia

        A data lake is a type of system or storage that stores data in its natural/raw format, usually object blocks or files. A data lake is typically a single store for the entire volume of data in an enterprise. Full data includes a copy of the original data generated by the original system and transformed data generated for a variety of tasks, including reporting, visualization, advanced analytics, and machine learning. The data lake includes structured data (rows and columns), semi-structured data (such as CSV, logs, XML, JSON), unstructured data (such as email, documents, PDF, etc.) and binary data from relational databases. Data (e.g. images, audio, video).

        A data swamp is a degraded, unmanaged data lake that is either inaccessible or does not provide sufficient value to users.

AWS

        A data lake is a centralized repository that allows you to store all structured and unstructured data at any scale. You can store data as is (without first structuring it) and run different types of analytics – from dashboards and visualizations to big data processing, real-time analytics and machine learning to guide better decision making.

Microsoft

        The data lake includes all the capabilities that enable developers, data scientists, and analysts to store and process data more easily. These capabilities allow users to store data of any size, type, and speed, and can be cross-platform and cross-language. Do all types of analysis and processing. While helping users accelerate the application of data, the data lake eliminates the complexity of data collection and storage, and also supports batch processing, streaming computing, and interactive analysis. Data lakes work with existing IT investments in data management and governance to keep data consistent, manageable and secure. It can also seamlessly integrate with existing business databases and data warehouses to help expand existing data applications.

Definition summary

        According to the opinions of a hundred schools of thought, there are actually many definitions of data lakes, but they basically revolve around a few characteristics. Let's summarize them now.

  • The data lake needs to provide sufficient data storage capacity, which stores all the data in an enterprise/organization.
  • Data lakes can store massive amounts of data of any type, including structured, semi-structured, and unstructured data.
  • Data in a data lake is raw data, a complete copy of business data. The data in the data lake maintains their original appearance in the business system.
  • Data lakes need to have comprehensive data management capabilities (perfect metadata), and can manage various data-related elements, including data sources, data formats, connection information, data schema, and authority management.
  • Data lakes need to have diverse analysis capabilities, including but not limited to batch processing, streaming computing, interactive analysis, and machine learning; at the same time, they also need to provide certain task scheduling and management capabilities.
  • Data lakes need to have comprehensive data lifecycle management capabilities. Not only need to store the original data, but also need to be able to save the intermediate results of various analysis and processing, and record the analysis and processing process of the data completely, which can help users trace the generation process of any piece of data in detail.
  • Data lakes need to have comprehensive data acquisition and data release capabilities. The data lake needs to be able to support a variety of data sources and obtain full/incremental data from related data sources; then standardize storage. The data lake can push the results of data analysis and processing to an appropriate storage engine to meet different application access requirements.

        Therefore, the data lake should be an evolving and scalable infrastructure for big data storage, processing, and analysis; data-oriented, to achieve full acquisition and full storage of any source, any speed, any scale, and any type of data , multi-mode processing and full life cycle management; and through the interaction and integration with various external heterogeneous data sources, it supports various enterprise-level applications.


Basic characteristics of a data lake

data

fidelity

        In the data lake, an "exactly the same" complete copy of the data in the business system will be stored. The difference from the data warehouse is that a copy of the original data must be saved in the data lake, and no matter the data format, data mode, or data content should be modified. In this regard, the data lake emphasizes the preservation of the "authentic" business data. At the same time, a data lake should be able to store data of any type/format.

flexible

        Since there is no way to predict business changes, then simply keep the data in the most original state, and once needed, the data can be processed according to requirements. Therefore, data lakes are more suitable for innovative enterprises and enterprises with rapid business changes and development. At the same time, users of data lakes also have higher requirements. Data scientists and business analysts (with certain visualization tools) are the target customers of data lakes.

manageable

        There are two types of data in a data lake: raw data and processed data. The data in the data lake will continue to accumulate and evolve. Therefore, there will be high requirements for data management capabilities, which should at least include the following data management capabilities: data source, data connection, data format, and data schema (library/table/column/row). At the same time, the data lake is a unified data storage place in a single enterprise/organization, so it also needs to have certain authority management capabilities.

Traceable

        A data lake is a storage place for all data in an organization/enterprise. It needs to manage the entire life cycle of data, including the entire process of data definition, access, storage, processing, analysis, and application. To implement a powerful data lake, it is necessary to be able to trace the access, storage, processing, and consumption process of any piece of data in between, and to be able to clearly reproduce the complete process of data generation and flow.

computing

computing engine

        From batch processing, streaming computing, interactive analysis to machine learning, all kinds of computing engines belong to the category that data lakes should include.

        In general, batch computing engines are used for data loading, conversion, and processing; streaming computing engines are used for real-time computing; and interactive analysis engines may be required for some exploratory analysis scenarios.

        As the combination of big data technology and artificial intelligence technology is getting closer, various machine learning/deep learning algorithms are also being introduced continuously. For example, the TensorFlow/PyTorch framework already supports reading sample data from HDFS/S3/OSS for training.

storage engine

        In actual use, the data in the data lake is usually not accessed frequently, and related applications are mostly used for exploratory data applications. In order to achieve acceptable cost performance, data lake construction usually chooses Relatively cheap storage engines (such as S3/OSS/HDFS/OBS), and work together with external storage engines when needed to meet diverse application requirements.


Data Lake Basic Architecture

I saw such an interesting question and answer:

Q: Why is the data lake called a data lake instead of a data river or a data sea?

A: "River" emphasizes fluidity. The river will eventually flow into the sea, and enterprise-level data needs long-term precipitation, so it is more appropriate to call it "lake" than "river". At the same time, lake water is naturally layered , to meet different ecosystem requirements, which is consistent with the needs of enterprises. "Hot" data is on the upper layer, which is convenient for applications to use at any time; warm data and cold data are located in different storage media in the data center to achieve a balance between data storage capacity and cost .

        The reason why it is not called "sea" is that the sea is boundless, while the "lake" has boundaries, and this boundary is the business boundary of the enterprise/organization; therefore, data lakes require more data management and authority management capabilities.

        Another important reason for the name "lake" is that data lakes require fine governance. A data lake that lacks control and governance will eventually degenerate into a "data swamp", making it impossible for applications to effectively access data and make the data stored in it lose value.

        Within enterprises/organizations, it has become a consensus that data is an important asset; in order to make better use of data, enterprises/organizations need to store data assets as they are for a long time , effectively manage and centralize governance , and be business-oriented, providing unified data Views, data models and data processing results

        For a typical data lake, it is the same as a big data platform in that it also has the storage and computing capabilities required to process ultra-large-scale data, and can provide multi-mode data processing capabilities; the enhancement point is that the data lake provides more For perfect data management capabilities

Data Lake Architecture Diagram

The following figure shows the reference architecture of a data lake system

In most data lake practices, it is recommended to use distributed systems such as S3/OSS/OBS/HDFS as the unified storage of data lakes.

The basic process of data lake construction

         The data lake is a more complete big data processing infrastructure support facility than the traditional big data platform, and the perfect data lake is a technology that is closer to the customer's business. All features included in the data lake and beyond the existence of the big data platform, such as metadata, data asset catalog, rights management, data life cycle management, data integration and data development, data governance and quality management, etc., are for better It is closer to the business and more convenient for customers to use.

Data Warehouse Construction Process

data analysis

        The initial work of building a data lake is to conduct a comprehensive investigation and investigation of internal data, including data sources, data types, data forms, data models, total data volume, and data increments. An implicitly important task at this stage is to further sort out the organizational structure of the enterprise with the help of data investigation work, and clarify the relationship between data and organizational structure. It lays the foundation for subsequent clarification of user roles, permission design, and service methods of the data lake.

model abstraction

        According to the business characteristics of enterprises/organizations, sort out and classify various types of data, divide the data into domains, form metadata for data management, and build a general data model based on metadata.

data access

        According to the results of the first step, determine the data source to be connected. According to the data source, determine the necessary data access technology capabilities, and complete the data access technology selection. The data to be accessed includes at least: data source metadata, original data metadata, and original data. All kinds of data are classified and stored according to the results formed in the second step.

Converged Governance

        Simply put, it is to use various computing engines provided by the data lake to process data, form various intermediate data/result data, and properly manage and store them. The data lake should have comprehensive data development, task management, and task scheduling capabilities, and record the data processing process in detail. In the process of governance, more data models and indicator models will be needed.

business support

        On the basis of the general model, each business department customizes its own detailed data model, data usage process, and data access service

        In many cases, the business is in the process of trial and error and exploration, and it is not clear where the future direction is at all, so it is impossible to extract a general data model; without a data model, all subsequent operations will be impossible to talk about. Many fast-growing companies feel that one of the important reasons why the data warehouse/data middle platform cannot be implemented and cannot meet the needs.

Data Lake Construction Process

data analysis

        It is still necessary to find out the basic situation of the data, including data source, data type, data form, data mode, data volume, and data increment. But, that's all there is to it. The data lake is to save the original data in full, so there is no need for in-depth design in advance.

Technology Selection

        According to the situation of the data, determine the technical selection of the data lake construction. In fact, this step is also very simple, because there are many common practices in the industry regarding the technology selection of data lakes. The basic principles are: "separation of computing and storage", "elasticity", and "independent expansion". The storage type is a distributed object storage system (such as S3/OSS/OBS); on the computing engine, it is recommended to focus on batch processing requirements and SQL processing capabilities, because in practice, these two types of capabilities are the key to data processing. Regarding stream computing Engines will be discussed later. Whether it is computing or storage, it is recommended to give priority to the form of serverless.

data access

        Determine the data source to be connected, and complete the full data extraction and incremental connection.

Converged Governance

        Start with data application, clarify the requirements in the application, and gradually form the data that can be used by the business in the process of data ETL; at the same time, form the data model, index system and corresponding quality standards. Data lakes emphasize the storage of raw data and the exploratory analysis and application of data, but this is by no means to say that data lakes do not need data models; on the contrary, the understanding and abstraction of business will greatly promote the development of data lakes And applications, data lake technology enables data processing and modeling, retains great agility, and can quickly adapt to business development and changes.

       The data lake adopts a more "agile" construction method.

summary

        As the infrastructure for next-generation big data analysis and processing, data lakes need to go beyond traditional big data platforms. The possible development focus of the data lake in the future may be:

  • Cloud native, storage and computing separation, multi-modal computing engine support, etc.;

  • Data management capabilities, data source management, data category management, processing flow arrangement, task scheduling, data traceability, data governance, quality management, authority management, etc.;

  • Perfect data integration and data development capabilities, a complete, visualized, scalable integrated development environment;

  • With the deep integration and integration with the business, more and more industry data lake solutions will emerge in the future, forming a healthy development and interaction with data scientists and data analysts.

——————

Welcome to like, collect, comment and exchange

Reprint please indicate the source, thank you~

Guess you like

Origin blog.csdn.net/qq_52213943/article/details/130389895
Recommended