Data lake storage and query

Author: Zen and the Art of Computer Programming

1 Introduction

Data Lake is composed of a large amount of unstructured, semi-structured, and non-sequential data, and has the value of massive data. Data lake storage and query are an important function for enterprises to use big data for decision support. Data lake storage and query can store and query unstructured, semi-structured, and non-sequential data on HDFS (Hadoop Distributed File System), and analyze the data through SQL or MapReduce to obtain business value. This article mainly introduces the relevant knowledge and technology of data lake storage and query.

2. Explanation of basic concepts and terms

2.1 Hadoop

Hadoop is an open source framework for distributed computing and storage. It provides high fault-tolerance, high reliability, and scalable storage, and can run MapReduce tasks to process massive data sets. Hadoop is divided into two modules: HDFS (Hadoop Distributed File System) and MapReduce. HDFS is responsible for storing massive data, while MapReduce is used for distributed computing.

2.2 Hive

Hive is a data warehouse tool based on Hadoop that can map structured data files into database tables and provide SQL statement-driven data query functions. Hive provides a SQL-like language called HiveQL, which allows users to directly query data using standard SQL syntax without learning complex MapReduce commands. Hive provides a Unix-like file system, namely HDFS, and can define table structures on it, and then generate corresponding MapReduce jobs based on these tables to implement data storage, query, statistics and other functions. Hive can use table names, column names and expressions to specify query conditions.

2.3 Im

Guess you like

Origin blog.csdn.net/universsky2015/article/details/132313548
Recommended