What is a data lake? What is the use? Easy to understand

Author: Thomas Yuehanpanka · Misra
Source: large data DT

Guide: The birth of the data lake concept stems from some challenges faced by enterprises, such as how data should be processed and stored. In the beginning, the management of various types of applications by enterprises has gone through a relatively natural evolutionary cycle.

At the beginning, each application generates and stores a large amount of data, and these data cannot be used by other applications. This situation leads to the generation of data islands . Then a data mart came into being. The data generated by the application is stored in a centralized data warehouse, and the relevant data can be exported as needed and transmitted to the departments or individuals in the enterprise that need the data.

However, the data mart only solves part of the problem . The remaining issues, including data management, data ownership and access control, all need to be resolved urgently, as companies seek to obtain a higher ability to use effective data.

In order to solve the various problems mentioned above, companies have a strong demand to build their own data lake . The data lake can store not only traditional types of data, but also any other types of data, and can perform further processing and processing on top of them. Analyze and produce final output for consumption by various programs.

01 What is a data lake

If you need to give a definition to a data lake, you can define it as follows: a data lake is a large warehouse that stores various raw data of an enterprise, and the data in it can be accessed, processed, analyzed, and transmitted .

The data lake obtains original data from multiple data sources of the enterprise, and for different purposes, the same original data may also have multiple data copies that meet specific internal model formats. Therefore, the data processed in the data lake may be any type of information, from structured data to completely unstructured data.

Enterprises have high hopes for the data lake, hoping that it can help users quickly obtain useful information and use this information for data analysis and machine learning algorithms to gain insights related to the operation of the enterprise.

The relationship between the data lake and the enterprise

Data lakes can bring a variety of capabilities to enterprises. For example, it can realize centralized management of data. On top of this, enterprises can dig out many capabilities that they did not possess before .

In addition, the data lake combined with advanced data science and machine learning technology can help companies build more optimized operating models, and can also provide companies with other capabilities, such as predictive analysis, recommendation models, etc. These models can stimulate the follow-up of corporate capabilities increase.

There are many abilities hidden in enterprise data. However, people cannot use them to improve the business performance of enterprises until important data can be used by people with business data insights.
Insert picture description here

02 How does the data lake help companies

For a long time, companies have been trying to find a unified model to represent all entities in the company. This task is extremely challenging for many reasons, some of which are listed below:

  • An entity may have multiple representations in the enterprise, so there may not be a complete model to uniformly represent the entity.
  • Different enterprise applications may deal with entities based on specific business goals, which means that certain business processes will be adopted or excluded when dealing with entities.
  • Different applications may adopt different access modes and storage structures for each entity.

These problems have plagued companies for many years and hindered the standardization of business processing, service definitions, and terminology naming.

From the perspective of the data lake, we are looking at this issue in another way. Using the data lake, a better unified data model is implicitly realized without worrying about substantive impact on business procedures . These business procedures are "experts" in solving specific business problems. The data lake represents the entity as "full" as possible based on the full amount of data captured from all systems related to the entity owner.

Because it is better and more complete in terms of physical representation, the data lake has indeed brought tremendous help to enterprise data processing and management, enabling enterprises to have more insights about enterprise growth and helping enterprises achieve their business goals.

It is worth mentioning that Martin Fowler wrote a very interesting article, in this article, he made a concise description of some key aspects of the enterprise data lake, you can refer to the following link:

https://martinfowler.com/bliki/DataLake.html
Insert picture description here

Advantages of Data Lake

Enterprises will generate massive amounts of data in their multiple business systems. As the size of the enterprise increases, enterprises also need to process these data across multiple systems more intelligently.

One of the most basic strategies is to use a separate domain model, which can accurately describe the data and represent the most valuable part of the overall business. These data refer to the aforementioned corporate data.

Of course, companies that have well-defined corporate data also have some methods to manage data. Therefore, changes to corporate data definitions can maintain consistency, and the internal companies are also very clear about how the system shares this information.

In this case, the system is divided into data owner and data consumer . For enterprise data, there needs to be a corresponding owner. The owner defines how the data is obtained by other consumer systems, and the consumer system plays the role of a consumer.

Once an enterprise has a clear definition of data and systems, a large amount of enterprise information can be used through this mechanism. A common implementation strategy of this mechanism is to provide a unified enterprise data model by building an enterprise-level data lake. In this mechanism, the data lake is responsible for capturing data, processing data, analyzing data, and providing data services for consumer systems .

The data lake can help enterprises in the following ways:

  • Realize data governance and data lineage.
  • Realize business intelligence through the application of machine learning and artificial intelligence technology.
  • Predictive analysis, such as domain-specific recommendation engines.
  • Information tracking and consistency guarantee.
  • Generate new data dimensions based on historical analysis.
  • Having a centralized data center that can store all enterprise data is conducive to the realization of a data service optimized for data transmission.
  • Help organizations or companies make more flexible decisions about corporate growth.

In this section, we discuss what capabilities the data lake should have. The follow-up will discuss and comment on how the data lake works and how to understand its working mechanism.

03 How does the data lake work?

In order to accurately understand what benefits a data lake can bring to an enterprise, it is particularly important to understand the working mechanism of the data lake and what components are required to build a fully functional data lake. Before diving into the details of the data lake architecture, let's first understand the data life cycle in the context of the data lake .

At a higher level, the data life cycle in the data lake is shown in Figure 2-1.
Insert picture description here
▲Figure 2-1 The life cycle of the data lake

The above-mentioned life cycle can also be referred to as multiple different stages of data in the data lake. The data and analysis methods required for each stage are also different. Data processing and analysis can be processed either in batch mode or in near-real-time mode.

The implementation of the data lake needs to support these two processing methods at the same time, because different processing methods serve different scenarios. The choice of processing method (batch processing or near real-time processing) also depends on the amount of calculation of data processing or analysis tasks, because many complex calculations cannot be completed in near real-time processing mode, and in some cases, longer processing cannot be accepted cycle.

Similarly, the choice of storage system also depends on data access requirements. For example, if you want to store data to facilitate access to the data through SQL queries, the selected storage system must support the SQL interface.

If data access requires a data view, it involves storing the data in a corresponding form, that is, the data can be provided as a view and provide convenient manageability and accessibility.

An increasingly important trend that has emerged recently is to provide data through services , which involves exposing data to the outside world on a lightweight service layer. Every public service must accurately describe the service function and provide data to the outside world. This model also supports service-based data integration so that other systems can consume data provided by data services.

When data flows into the data lake from the collection point, its metadata is captured and managed in terms of data traceability, data lineage, and data security according to the data sensitivity in its life cycle.

Data lineage is defined as the life cycle of the data, including the origin of the data and how the data moves over time. It describes what changes have occurred in the data during various processes, helps to provide visibility into the data analysis pipeline, and simplifies error traceability.

Traceability is the ability to verify the history, location, or application of data items through identification records.

--Wikipedia

04 The difference between a data lake and a data warehouse

In many cases, the data lake is considered equivalent to the data warehouse. In fact, data lakes and data warehouses represent different goals that companies want to achieve. The key differences between the two are shown in Table 2-1.

Data lake database
It can handle all types of data, such as structured data, unstructured data, semi-structured data, etc. The type of data depends on the original data format of the data source system. Only structured data can be processed for processing, and these data must be consistent with the pre-defined model of the data warehouse.
Have enough computing power to process and analyze all types of data, and the analyzed data will be stored for users to use. Process structured data and convert them into multi-dimensional data or reports to meet the needs of subsequent advanced reports and data analysis.
Data lakes usually contain more relevant information, which has a high probability of being accessed, and can mine new operational requirements for enterprises. Data warehouses are usually used to store and maintain long-term data, so data can be accessed on demand.

▲Table 2-1 The key difference between data lake and data warehouse

From Table 2-1, the difference between data lake and data warehouse is obvious . However, the roles of the two in the enterprise are complementary, and the emergence of the data lake should not be considered to replace the data warehouse. After all, the roles of the two are completely different .

05 How to construct a data lake

Different organizations have different preferences, so they build data lakes in different ways. The construction method is related to factors such as business, processing flow and existing systems.

A simple data lake implementation is almost equivalent to defining a central data source, and all systems can use this central data source to meet all data requirements. Although this method may be simple and cost-effective, it may not be a very practical method for the following reasons:

  • This approach is only feasible when these organizations restart to build their information systems.
  • This method cannot solve the problems associated with the existing system.
  • Even if the organization decides to build a data lake in this way, there is a lack of clear responsibility and separation of concerns.
  • Such systems usually try to complete all the work at once, but eventually fall apart as data transactions, analysis, and processing requirements increase.

A better strategy for building a data lake is to treat the enterprise and its information system as a whole, classify data ownership relationships, and define a unified enterprise model .

Although this method may have process-related challenges and may require more effort to define system elements, it still provides the required flexibility, control, and clear data definition, as well as the relationship between different system entities in the enterprise. Segregation of concerns between.

Such a data lake can also have independent mechanisms to capture, process, and analyze data, and provide data services for consumer applications.

Guess you like

Origin blog.csdn.net/qq_32727095/article/details/114321329