What is a data lake? Key technologies of data lake (1)

With the development of data lakes, it is currently facing many technical challenges, and various technical problems must be continuously improved and solved. Data lake is a paradigm of current big data technology research, through which researchers deal with various challenges encountered by big data technology. Breakthroughs in these key technologies continue to improve big data technology and enrich data lake technology, highlighting the key role of these advanced technologies and solutions. This article will be divided into four aspects: data storage, data ingestion, data organization, and data exploration to discuss the key technologies of the data lake.

data storage

Data storage has always been the core issue and fundamental issue of the data lake concept. In a data lake environment, the choice of storage system is not only related to storage cost, scalability and security, but also affects the effectiveness and flexibility of data access. As the underlying infrastructure, the storage system in the data lake architecture plays a crucial role in the entire data processing process.

Many data lake implementers are concerned with the problem of cheaply storing raw data of various types. Among data lake storage systems, the most widely used is Hadoop's distributed file storage system HDFS, which can cheaply store many types of data, including semi-structured (e.g., CSV, XML, JSON) and unstructured (e.g., graphs and videos) data.

In terms of storage method, the data lake can use a single storage system or multiple storage systems. A single storage system only supports one type of database, such as CLAMS storage system, Personaldatalake, etc. all fall into this category. Multi-storage systems integrate multiple data storage configurations that access various heterogeneous data. For example, both the Constance system and the SQRE architecture use multi-storage database systems including relationships, documents, graphics and other types to store the acquired raw data. Multi-storage systems are the inevitable result of data lakes storing massive heterogeneous data.

Another development feature is the hybrid use of relational and NoSQL storage. This storage method effectively enhances the value of relational databases in big data analysis, such as Google Data Lake's DatasetSearch, CoreDB, CoreKG and other products and architecture systems. Microsoft's Azure Data Lake Storage System (ADLS) is a cloud storage service that deeply integrates relational databases and distributed storage technology (HDFS). ADLS adopts a hierarchical storage structure, achieving the best trade-off between cost and performance by exaggerating storage layer access methods, and improving security.

Cloud storage is an important development trend in data lake storage, especially some commercial data lakes are built on cloud storage, including AWS, ADLS, Alibaba Cloud Storage, Tencent Cloud Storage, etc. Compared with the local environment, the advantages of data lakes such as cheap storage, multi-user, and scalability are more obvious in the cloud environment.

data ingestion

Data ingestion is the process of moving data from various heterogeneous data sources into a data lake. In the big data industry, many rich data ingestion tools have been provided, and data lakes can leverage these tools to implement the data ingestion phase.

Data ingestion is more than just a simple copy and paste of data, it is a complex and important stage that must ensure that the ingested data is findable, accessible, interoperable and reusable at all times. The most important task in this process is to maintain the metadata structure of the ingested data to prevent the data from becoming unusable after entering the data lake.

Metadata acquisition

Extracting metadata is a major challenge during the data ingestion phase. In order to adapt to the uncertainty of heterogeneous data sources, it is crucial to adopt flexible and extensible metadata structures. GEMMS is a flexible and scalable data lake metadata management system. The system is able to extract metadata from heterogeneous data sources and store the metadata in an extensible metamodel. First, metadata attributes are stored in the form of key-value pairs, then the structure of the original data (such as matrices, trees, graphs, etc.) is identified through structural metadata, and finally connected to the semantic model in the form of additional semantic data.

The Constance system is an intelligent data lake system. In order to extract as much metadata as possible during the data ingestion phase, the system uses the Structural Metadata Discovery (SMD) component, but this component mainly solves the metadata problem of semi-structured data sources. Data structure refining problem. Sawadogo et al. proposed a method to extract the metadata structure of text documents in the data lake, which made up for the problem of extracting metadata from unstructured data. Datamaran is an algorithm suitable for transforming complex log files in a data lake environment, which automatically extracts metadata structures from semi-structured log data in an unsupervised manner. The Datamaran algorithm solves problems such as data boundary determination, data field determination, complex structures, redundant structures, and semantic structures.

Metadata modeling

Since Gartner proposed the data swamp problem, many researchers have tried to solve the problem through metadata management. Metadata is regarded as the key to describing and guiding massive data in data lakes. Metadata management involves data source management, data ingestion process, data accuracy, data security, data set correlation, etc., and metadata modeling technology is the main content of metadata management.

Research results on data lake metadata models are rich, and multiple metadata models have emerged. In order to display the acquired metadata, the Constance system uses the Semantic Metadata Matching (SMM) component of graph modeling technology, which includes semantic modeling, attribute annotation, linked records, semantic enrichment and other functions.

Identifying various types of metadata is an important challenge in implementing a common metadata model. The MEDAL model divides metadata types into intraobject (Intraobject), inter-object (Inter-object), and global (Global) metadata, and explains in detail key attributes such as semantic data, data version, data blood relationship, and similarity. Diamantini et al. divided metadata into reference business metadata, operational metadata, and technical metadata, and enhanced the metadata representation based on network and semantic-driven modeling methods.

In addition, there are the HANDLE model proposed by Eichler et al. and the goldMEDAL model proposed by Scholly et al., which are relatively complete metadata models at this stage. The design of the data lake metadata model is closely related to the entire data life cycle process of the data lake. The metadata at each stage has the characteristics and functions of that stage.

data maintenance

The data ingested into the data lake is very large and complex. In order to achieve the purpose of data analysis, it is very necessary to effectively maintain these massive raw data. Tasks in the data maintenance phase include preparing data, discovering relevant data sets, data integration, cleaning data, etc.

data organization 

Massive big data organizations face many challenges, including limitations of manual intervention, data processing efficiency, relevant data discovery, and heterogeneous data conversion. The effect of data organization directly affects data usage and analysis, and is one of the key data processing links in the data lake. In the field of big data technology research, data organization issues are the most active research area and are also the key technologies of data lakes that many researchers focus on.

In a data lake environment, manual organization of data has become impossible, so the primary problem that data organizations need to solve is automation. Kayak is a framework that helps data scientists define and optimize data preparation pipelines. In this system, data consumers can customize data discovery pipelines according to their needs. The system often provides an approximate result to improve pipeline execution efficiency and reduce data preparation time by quickly previewing the original results. Despite this, some scholars have proposed the necessity of manual intervention in the data organization process. Brackenbury et al. have demonstrated the importance of manual intervention in the data discovery process through experiments.

Metadata management also plays an important role in the data maintenance process. GOODS is a system designed for organizing datasets in the Google Data Lake. GOODS collects metadata of relevant data sets during the process of data pipeline creation, access, and update of data sets, and manages and organizes data sets through this metadata directory. Alserafi et al. [56] focused on duplicate data sets, related data sets (i.e., "joinable" data attributes between data sets) and unrelated data sets in the data lake, and through the end-to-end content metadata management process, the data was Organizations provide a systematic approach.

Data discovery is one of the most talked about areas in the data organization process and a concern for many data scientists. Similarity is the most important field in data discovery technology. Brackenbury et al. proposed a similarity comparison framework based on data essence, origin, current characteristics and other dimensions, providing a research basis for data similarity discovery. In order to allow non-IT experts to discover data according to their needs, BARENTS creates a data preparation partition in the data lake through the ontology method. In this partition, users can customize the data preparation process according to their needs. In order to improve the efficiency of discovering correlation data sets, Nargesian et al. proposed a Markov navigation model, which can calculate the probability of discovering related tables of topics of interest. Machine learning also plays a key role in discovering data correlations. DLN [59] is a system that builds and uses correlation models to build Cosmos (Microsoft Data Lake) data graphs. The model trains relevant data column characteristics through machine learning, and then combines metadata characteristics to build correlation models.

The long-term accumulation of data semantics in the data lake continues to change over time. Coupled with the heterogeneity of data formats and the huge amount of data collection, it is difficult to extract value from the data lake without flexible and changeable schema management. Klettke et al. attributed the flexible and changeable schema problem in the data lake to the schema evolution process, extracted the schema version sequence in the data lake, and established the mapping relationship between schema versions to solve the problem of restoring the history of schema evolution.

In a data lake environment, automation technology and metadata technology are very important. In particular, metadata management technology plays a very critical role in data organization. In the data organization problem, early data correlation discovery technology and topic-based data navigation technology are current research hotspots, and technologies such as semantics, ontology, machine learning, and graphs play a key role. At present, the scope of data organization research in data lakes is relatively complex, including data pipelines, data cleaning, data correlation, and data pattern evolution, and many researchers combine data organization and data exploration. It can be found that the researchers' division of data processing stages in the data lake is not clear enough, which also proves from another perspective the characteristics of the data lake that analysis needs directly drive data maintenance.

Linked tabular dataset discovery

In a data lake that is already loaded with massive data, it is meaningless and unnecessary to integrate or query all the data in the data lake. On the contrary, effectively and accurately discovering data relevant to the current topic is the focus of many data lake users. Linked data set discovery technology solves the problem of users spending a lot of time discovering data, and is an important part of solving big data integration problems. Many research works on data set discovery technology focus on tabular data, because tabular data is currently the main way in which internal data sets exist in enterprises, including web tables, spreadsheets, CSV files, and relational databases.

In order to quickly discover related table data, you can use the enterprise knowledge graph (EKG) to capture the relationships between data sets and provide users with guidance between different data resources. AURUM is a data set discovery system implemented based on EKG, and EKG solves the performance problem of massive data matching in the data lake through a two-step algorithm. In addition, in order to facilitate analysts to find relevant data sets that belong to the same topic, kNN can detect similar data set groupings and underlying structures covering related analysis topics, and pre-define the topic categories of interest in the data lake.

The DS-Prox technique has been extended in the literature and attribute-level proximity measures have been proposed to find the most appropriate measure to assign similarity between pairs of datasets. JOSIE uses a top-k overlapping set similarity search algorithm, has the ability to adapt to data distribution, and can perform data discovery tasks in different data lakes. Juneau is a framework that can measure the relevance of data tables. The framework returns the most relevant data tables through row and column overlap, source relationships, similarity and other measurements. Starmie is a table data association search framework in the data lake. This framework captures rich semantic information in tabular data through comparative learning methods, significantly improving search efficiency and matching. However, related data set discovery techniques based on overlap measures cannot adapt to the problem of different representation and semantics of tabular data in a data lake environment. Dong et al. solve this problem through a block verification method based on pivot filtering in the PEXESO framework, but are limited to query records embedded as high-dimensional vectors and direction conditions based on similar predicate connections. In addition, the PEXESO framework uses a partitioning technology to solve the problem that the data lake cannot be loaded into main memory. Helal proposed a data set discovery platform based on knowledge graphs, which turns schemaless data sets into schema data sets and solves related table data discovery problems through scalable and queryable knowledge graphs.

​Research on table-related data discovery technology is relatively rich. From the early list overlapping technology to discover correlations to correlation discovery based on metadata, high latitude, knowledge graphs, machine learning, etc., not only has the effect of similarity discovery been greatly improved, It has been improved and solved the problem of flexible and changeable heterogeneous data in the data lake environment. However, in the existing literature on tabular data correlation, research on the problem of flexible and changeable data is still insufficient. In particular, the evaluation and experiments on this issue are very limited, which requires further in-depth analysis and discussion by researchers.

Guess you like

Origin blog.csdn.net/WhiteCattle_DATA/article/details/132859767