Big Data Related Components

1. HDFS

HDFS is the core component of Hadoop. Files on HDFS are divided into blocks for storage. The default block size is 64M . Blocks are logical units for file storage and processing.

HDFS is a structure of Master and Slave. There are several roles of NameNode, SecondaryNameNode, and DataNode.

NameNode : is the Master node, manages data block mapping, handles client read and write requests, configures replica policies, and manages HDFS namespaces;

SecondaryNameNode : It is a cold backup of NameNode, sharing the workload of NameNode, merging fsimage and fsedits and then sending them to NameNode, and regularly synchronizing metadata image files and modification logs. When NameNode fails, the backup becomes normal.

DataNode : It is a Slave node, which is responsible for storing the data block sent by the client, performing read and write operations of the data block, and sending heartbeat information to the NameNode regularly.

Features:

  • Data redundancy, hardware fault tolerance, each data block has three backups;

  • Streaming data access, data writing is not easy to modify;

  • Suitable for storing large files, small files will increase the pressure on the NameNode.

  • Suitable for batch reading and writing of data, with high throughput;

  • Not suitable for interactive applications, low latency is difficult to meet;

  • Suitable for writing once and reading multiple times , sequential reading and writing;

  • Concurrent writing of the same file by multiple users is not supported.

2. MapReduce

The working principle of MapReduce can be summed up in one sentence: divide and conquer, and then reduce, that is, decompose a large task into multiple small tasks (map) , execute in parallel, and combine the results (reduce) .

The entire MapReduce process is roughly divided into Map-->Shuffle (sorting)-->Combine (combination)-->Reduce.

3. YARN

YARN is the resource management system in Hadoop 2.0. Its basic design idea is to split the JobTracker in MRv1 into two independent services:

  • The global resource manager ResourceManager is responsible for resource management and allocation of the entire system

  • Each application-specific ApplicationMaster is responsible for the management of a single application.

4. Hive

Hive is a data warehouse built on Hadoop HDFS. It can map structured data files into a database table and provide SQL-like query functions. Its essence is to convert SQL into MapReduce programs.

Hive tables are actually HDFS directories/files.

5. Hbase

is a leading No-SQL database that stores data on HDFS and is a column-oriented database

6. Spark

It is a fast and general-purpose large-scale data processing engine (developed based on Scala)

1. spark rdd: elastic distributed data set

2. spark sql: structured data processing module in Spark. Use the interface provided by Spark SQL

3. spark streaming: spark real-time computing engine

4. spark graphx: spark graph database

5. spark mLlib: spark machine learning

7. Flink

Apache Flink is an open source stream processing framework for distributed, high-performance, ready-to-use, and accurate stream processing applications

Guess you like

Origin blog.csdn.net/qq837993702/article/details/128771806