Hadoop series-Introduction to Hadoop

1. What is Hadoop

Hadoop is a software framework for distributed processing of large amounts of data. Perform data processing in an efficient, reliable, and scalable way. It mainly includes three parts: Hdfs, MapReduce, and Yarn. Broadly speaking, Hadoop refers to an ecosystem, including software such as HBase, Hive, Zookeeper, Spark, Kafka, Flume, etc.

Two, what is HDFS

The full name of HDFS: Hadop Distribute FileSystem
is a file system running on a hardware cluster to store very large files in a streaming data access mode.
There are three types of HDFS nodes: NameNode, SecondaryNameNode, and DataNode.
①NameNode: The HDFS daemon process manages the namespace of the file system, maintains the entire file system tree, and all files and directories in the entire tree, that is, metadata. At the same time, it also manages the file copy configuration strategy, data block mapping information, and handles client read and write requests .

②DataNode: NameNode issues commands and DataNode executes commands. The main task of DataNode is to store actual data blocks and perform read/write operations on data blocks.

③SecondaryNameNode: It is not a hot backup of the NameNode and cannot be used to replace the NameNode. Its main task is to help the NameNode merge Fsimage and Edits and push them to the NameNode; when the NameNode goes down, the SecondaryNameNode can restore the information of the NameNode work.

④Client client: file segmentation, interact with NameNode/DataNode, make some changes to stored data, etc.

Three, what is MapReduce

①MapReduce is a programming framework for distributed computing programs . Its core function is to integrate the business logic code written by users and its own default components into a complete distributed computing program.
The core idea of ​​MapRecue :
Distributed computing is divided into two parts: MapTask and ReduceTask ; in the MapTask phase, each MapTask task runs independently and does not interfere with each other; in the ReduceTask phase, its operation depends on the results of the MapTask phase, MapTask They are also concurrent and are not related to each other; the MapReduce programming model can only contain one Map phase and one Reduce phase.

Guess you like

Origin blog.csdn.net/Cxf2018/article/details/109426372