hadoop architecture

Hadoop Architecture

HDFS and MapReduce are the core of Hadoop. Hadoop relies on HDFS to achieve distributed storage support, and uses MapReduce to perform parallel computing processing.

HDFS Architecture

HDFS adopts the master-slave structure model, that is, the Master-Slaver mode. The HDFS cluster consists of one NameNode and multiple DataNodes. NameNode central server (Master): maintains the file system tree and the file directory in the entire tree, and is responsible for the management of the entire data cluster.

DataNodes are distributed on different racks (Slaver): under the scheduling of the client or the NameNode, data blocks are stored and retrieved, and a list of stored data blocks is periodically sent to the NameNode.

Rack: HDFS cluster consists of a large number of DataNodes distributed on multiple racks. Nodes between different racks communicate through switches. HDFS uses rack-aware policies to enable NameNode to determine the rack ID to which each DataNode belongs, using Replica storage strategy to improve data reliability, availability and network bandwidth utilization.

Data block (block): The most basic storage unit of HDFS, the default is 64M, and the user can set the size by himself.

Metadata: refers to the attribute information of files and directories in the HDFS file system. When HDFS is implemented, the backup mechanism of image file (Fsimage) + log file (EditLog) is adopted. The contents of the image file of the file include: modification time, access time, data block size, and storage location information of the data blocks constituting the file. The contents of the mirror file of the directory include: modification time, access control permissions and other information. The log file records the update operation of HDFS.

When the NameNode starts, it merges the contents of the mirror file and log file in memory. Update the metadata in memory to the latest state.

User data: Most of the HDFS storage is user data, which is stored on the DataNode in the form of data blocks.

In HDFS, TCP protocol is used for communication between NameNode and DataNode. The DataNode sends a heartbeat to the NameNode every 3s. After every 10 heartbeats, a data block is sent to the NameNode reporting its own information, with this information the NameNode is able to reconstruct the metadata and ensure that there are enough copies of each data block.hdfs architecture diagram

Architecture of MapReduce

  • Distributed Programming Architecture

  • Data-centric, more emphasis on throughput

  • Divide and conquer (distribute the operation of large-scale data sets to each sub-node under the management of a master node to complete, and then integrate the intermediate results of each node to obtain the final output)

  • Map decomposes a task into multiple subtasks

  • Reduce processes the decomposed multi-tasks separately, and aggregates the results into a final result

The Master-Slaver structure is also used.

4 entities:

  • Client

  • JobTraker

  • TaskTraker (task node)

  • HDFS (input, output data, configuration information, etc.) Job (Job): Inside Hadoop, Job is used to represent the collection of all jar files and classes required by the running MapReduce program, and these files are eventually integrated into a jar file, submit this jar file to JobTraker, and the MapReduce program will be executed.

Task: MapTask and ReduceTask

key/value pair

The input and output of Map() and Reduce() functions are in the form of <key, value>

After the input data stored in HDFS is parsed, it is input into the MapReduce() function in the form of key-value pairs for processing, and a series of key-value pairs are output as intermediate results. In the Reduce phase, the intermediate data with the same key value are merged form the final result. Lifecycle: 1. Submit job

- Before the job is submitted, the job needs to be configured;

-Program code, mainly MapReduce program written by oneself;

-Configure the input and output paths and whether the output is compressed;

- After the configuration is completed, submit it through the JobClient;

Job scheduling algorithm:

FIFO scheduler (default), fair scheduler, capacity scheduler

2. Assignment of tasks

-The communication and assignment of tasks between TaskTracker and JobTracker are done through the heartbeat mechanism;

-TaskTracker will actively ask JobTracker if there is a job to do, if you can do it yourself, then it will apply for a job task, which can make Map or Reduce task;

3. Task execution

-TaskTraker transfers code and configuration information to the local;

- Start JVM running tasks for each Task separately

4. Status update

- During the running process of the task, it will first report its status to the TaskTracker, and then the TaskTracker will report it to the JobTracker;

- Mission progress is achieved through counters;

-JobTracker will mark the job as successful only after receiving the last job to run.image

Learn from https://blog.csdn.net/u013063153/article/details/53114989

MapReduce data flow diagram

The user-defined map function receives an input key/value, and then generates a set. MapReduce combines all the values ​​of the same key into a set. reduce combines these values. Usually we provide the value to reduce through an iterator, so that it can handle large amounts of data that cannot all fit into memory.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324923938&siteId=291194637