Big Data Interview Series --Hadoop

The three core Hadoop:
the HDFS (distributed storage system)
the MapReduce (distributed computing system)
YARN (Distributed Resource Scheduling)

Several clusters to build a model .Hadoop
1. Standalone Mode: directly extract installed, the distributed storage system does not exist
2. pseudo-distributed: and the NameNode DataNode attached to the same node, can not reflect the advantages of distributed processing.
3. fully distributed: a master node, multiple nodes from a disadvantage that if the primary node goes down, the cluster can not be used.
4. High Availability mode: Multiple master node, multiple nodes from the same time there is only one master node Foreign Service, serving as the primary node fails, any other one master node can replace it, you must keep all the primary node in real time consistent data, the master node disadvantage pressure prone downtime.
5. Federal modes: Multiple master node, multiple nodes from the same time multiple master nodes are responsible for the section from node to provide services.
The actual production of large clusters highly available and commonly used way federal model combines deployment.

Two .HDFS heartbeat mechanism
when NameNode and DataNode start, they will maintain communication, DataNode periodically sends a heartbeat to report NameNode (The default interval is 3 seconds), so that NameNode to keep abreast of the state of health of DataNode, 10 times in a row when NameNode when the (default) report a heartbeat is not received by a DataNode, the NameNode will take the initiative to send DataNode inspection, examination allowable transmission 2, each inspection default time is 300 seconds, to determine when the node 2 DataNode no examination results downtime.
Determine a default node is down time DataNode 10 3S + 2 5min = 630S

Strategy three .HDFS perception rack
default copy of the three, the first copy is stored on the local node, any node different chassis and a second copy on the first copy, the second and the third copy in with any other node rack, both to ensure local access rate data, but also to ensure the safety copy of the data as possible.

Four .HDFS upload (write) Download (read) mechanism
Upload mechanism
1. Client (client) to send the file upload request to NameNode;
2.NameNode will conduct a series of tests, such as user permissions, the parent directory exists, if the file the same name, it will create a successful record as a file, or to the client thrown
3.NameNode checked by the client (client) begins the file segmentation (128M to a Block, the file into a plurality of Block), the client (client) to send an upload request block1 NameNode, NameNode received upload after the file information, view DataNode information, DataNode storage node determines the information back to the client (client).
4. The client (client) to get information DataNode established Pipeline (channel) nearest DataNode, Pipeline buffer is established successfully started to form a 64kb each packet, and then from the cache to the local file system of DataNode, then DataNode from the backup data packets to another DataNode, successively until all the backup is complete.
5. When the entire block1 uploaded, DataNode will notify the client (client) block1 uploaded successfully.
3.4.5 Repeat step 6. The remaining blocks until the whole file is uploaded.
7. When the file is uploaded client (client) will notify NameNode, NameNode update metadata.

Download mechanism
1. The client (client) sending a file download request to NameNode.
2.NameNode permissions, if the file exists and a series of inspection, examination did not throw an error directly by the end customer, NameNode sends the location information stored in file size, number of copies, etc. to the client after check.
3. After receiving the information transmitted NameNode, the client will govern the selection of recent DataNode, Block first download, the download is complete CRC check will do, if the client fails to download the report NameNode, from another DataNode Download this block, and NameNode will record this DataNode, after the download and upload avoid performing in this DataNode.
4. After you have successfully order to download other block, the client (client) will be fed back to the download is complete NameNode.

IV. Simply under the MapReduce programming model
first map task to read data from the local file system, key-value is converted into the form of a set of key-value pairs, hadoop using built-in data types, such as Text, Longwritable like.
The set input of the key mapper for business process, converted to the key-value needs re-output.
After a partition partitioning will be used by default is hashpartitioner, zoning rules can be customized by getPartition hashpartitioner method of rewriting.
After will sort key for sorting, grouping operation GROUPING same packet combined output value of the key, you may be used herein in the custom data type override to customize Comparator method WritableComparator collation, compara override method defined packets from RawComparator rule.
After a combiner reduction operation, is a local reduce pretreated to reduce the workload shuffle, reducer of.
Reduce task will be used for each data collection network will be reduce, and finally to save the data or display, the end of the whole job.

The detailed operation of five .MapReduce and yarn of
the article written in great detail, for reference.
https://segmentfault.com/a/1190000020617543?utm_source=tag-newest

VI. The process needs to start at the start Hadoop clusters running Hadoop and what role
1.NameNode hadoop it is the primary server, the file system name space management and access to files stored in the cluster, the preservation of metadata.
It is not redundant 2.SecondaryNameNode namenode daemon, but to provide cycle checkpoint and cleanup tasks. Help NN merger editslog, NN reduce start-up time.
3.DataNode It is responsible for managing the storage node is connected to the (a cluster can have multiple nodes). Each storage node running a data datanode daemon.
4.ResourceManager (JobTracker) JobTracker responsible for scheduling work on DataNode. Each DataNode have a TaskTracker, they perform the actual work.
5.NodeManager (TaskTracker) mission

Part of the role of each part of the seven .YARN
YARN is a resource management system Hadoop2.0 newly introduced version, currently YARN supports a variety of computing frameworks such as MapReduce Storm Spark Flink and other
1.ResourceManager
management and allocation of resources is responsible for the entire cluster of
concrete action:
1) a client request
2) to start or monitoring ApplicationMaster
. 3) monitor the NodeManager
) monitoring and scheduling resources. 4
2.ApplicationMaster (AM)
application instance ApplicationMaster YARN management operation in each. ApplicationMaster responsible for coordinating the resources from the ResourceManager, by NodeManager monitoring container implementation and use of resources (CPU, memory and other resource allocation)
specific role:
1) Cut responsible for data points
2) application application resource and assigned to the internal tasks
3) the task of monitoring and fault-tolerant
3.NodeManager (NM)
the NodeManager YARN manage each node in the cluster. NodeManager provided for each node in the cluster service, from supervision and management of a life-long container resources to monitor and track the health node
specific role:
1) manage resources on a single node
2) process commands from the ResourceManager
3) from the processing of ApplicationMaster command
4.Container
Container resources in the abstract YARN, it encapsulates the multi-dimensional resource on a node, such as memory, CPU, disk, network, etc., when the AM application resources to RM, RM resources for the return of AM is represented by Container . YARN assigned a Container for each task, and the task can use the resources of the Container described
specific role:
1) multi-task operating environment dimensional abstraction, encapsulation like CPU, memory resources, and environment variables, start command, etc. tasks related to the operation of information

Published 27 original articles · won praise 9 · views 20000 +

Guess you like

Origin blog.csdn.net/I_Demo/article/details/103727266