What is a data lake

Data Lake is an important concept in big data systems, and its main features are:

1. Centrally store all raw data

Data lakes attempt to store all available raw data, including structured, semi-structured, and unstructured data.

2. Flexible and scalable architecture

The data lake uses a flat distributed file system to store data, and this architecture has strong scalability.

3. Multiple data formats

Data lakes can store data in multiple formats, including log, CSV, JSON, video, and other formats.

4. Unified metadata management

Use metadata to register and manage data sources, including key attribute data such as data definitions and tags.

5. Externally exposed query interface

The data lake provides a query interface to the outside world, uses SQL to query data, and can plug and unplug different query engines.

6. Dynamic Exploration Analysis

Data scientists can dynamically explore and analyze the data lake to discover the value of data.

7. Cost advantage

Compared with traditional data warehouses, data lakes have lower storage costs.

Data lakes enable companies to store large amounts of raw data and perform exploratory analysis at a lower cost.

The main differences between a data lake and a data warehouse are:

1. Data form
A data warehouse stores extracted, transformed, loaded (ETL) data sets, usually structured data. Data lakes store all raw data, including structured, semi-structured, and unstructured data.

2. Data processing time
The data warehouse requires a predefined schema before data is imported, and data cleaning and transformation are performed before data loading. Data lakes allow raw data to be loaded directly, and data processing can be done lazily.

3. Storage method
The data warehouse uses a relational database to store data. Data lakes use distributed file systems such as HDFS to store data.

4. Query performance
The data warehouse is based on a predefined schema, and the query performance is high. Data lake query performance is poor and ad hoc queries are required.

5. Data granularity
The data warehouse stores aggregated fine-grained data. Data lakes store the most primitive data with finer granularity.

6. Scalability
The data warehouse has poor scalability and requires complex migration. The data lake is based on a distributed file system and can be scaled horizontally.

7. Data consistency
Data warehouse data consistency is high. Because the data lake contains various raw data, the consistency is poor.

8. Business goals
Data warehouses are better suited for standard reporting and analysis. Data lakes are better suited for data exploration and machine learning.

Guess you like

Origin blog.csdn.net/diannao720/article/details/132458952
Recommended