Big Data | The connection and difference between Hadoop, HDFS, HIVE, HBase, Spark

1、Hadoop

  • Hadoop is an open source distributed computing framework for storing and processing large-scale data sets. It provides a scalable distributed file system (HDFS) and a distributed computing framework (MapReduce), which can perform parallel computing on a large number of cheap hardware.

2、HDFS

  • HDFS (Hadoop Distributed File System) is a distributed file system of Hadoop. It is designed for storing and managing large-scale datasets in a cluster. HDFS partitions data into blocks and replicates these blocks to different computing nodes to provide fault tolerance and high availability.
  • As far as I know, most companies generally save the data required by the model, such as files in csv/libsvm format, as Hive tables and store them on HDFS.

3、HIVE

  • HIVE is a Hadoop-based data warehouse infrastructure that provides a SQL-like query language (HiveQL) for querying and analyzing data stored on Hadoop. Hive can map structured data to HDSF on Hadoop's distributed file system, and provide high-level abstraction, enabling users to use SQL-like syntax for query and analysis.
  • Built on top of HDFS, Hive can be regarded as a translator in essence, translating HiveSQL language into MapReduce program or Spark program.
  • As far as I know, most companies generally save the data required by the model, such as files in csv/libsvm format, as Hive tables and store them on HDFS. Generally, TFRecords of tensorflow is used to read data on HDFS on a large scale. Tensorflow provides a solution: spark-tensorflow-connector, which supports saving spark DataFrame format data directly as TFRecords format data. Next, I will take you to understand the principle, composition and how to generate TFRecords files of TFRecord.

4、HBase

HBase is a distributed, scalable, column-oriented NoSQL database built on top of Hadoop. It provides real-time read and write access to large-scale data sets, and is characterized by high reliability and high performance. HBase is suitable for applications that require random, fast access to large-scale data.

5、Spark

  • Spark is a fast and general-purpose big data processing engine that can perform distributed data processing and analysis. Compared with Hadoop's MapReduce, Spark has higher performance and richer functions. Spark supports multiple programming languages ​​(such as Scala, Java, and Python (pyspark)), and provides a rich set of APIs, including libraries for data processing, machine learning, and graph computing.
  • As far as I know, most companies will use pyspark for distributed processing of data preprocessing + model reasoning, such as model distributed reasoning (tensorflow and torch only support distributed training, not distributed prediction).







Reference:

  • [1] ChatGPT
  • [2] Me

Guess you like

Origin blog.csdn.net/weixin_43646592/article/details/130191099