hadoop details -> continuously updated

Hdfs:

hdfs writing process:

 

 

  1. The client requests namenode upload files through DistributedFileSystem
  2. Namenode check, such as a parent path of the file itself, whether to allow upload
  3. Namenode the appropriate information to the client whether to allow upload
  4. Uploading a block request block
  5. Namenode based on metadata information to determine, on which datanode need to upload, returned datanode list, datanode return nodes based on the number of replicas.
  6. Client establishes the tunnel through FSDataOutputStream, clients first establish a channel with datanode1, data1 and data2, data2 and data3 established channel
  7. Reply message in response to the channel
  8. Block block upload, the client form data queue (block units of division in pachage) to pakage units (default size is 64K) upload
  9. The client first be preached data1, data1 first stored in memory, and then written to disk. Data1 the package passed data2, data2 data3 pass
  10. 10. pachage response message in response, response to the client, remove from the package the data queue

                    When the upload is complete upload another block of the block when the block begins execution from step 4  

  11. Finally, the client notification namenode upload completed

hdfs reading process:

  

 

  1. The client communicates with the through DistributedFileSystem namenode, request to download the file
  2. Namenode by looking own metadata information, obtaining location information file block corresponding to the block and its response to the client
  3. The client via the network topology to select a Datanode (principle of proximity), the read request, the read request time by FSDataInputStream
  4. Clients to read the package as a unit, the first single written to the client's local cache (memory), and then sync to disk.

 

Yarn:

    The basic functions of resourcemanager summary:

  • Interact with the client, to process requests from client applications such as the operation of the query
  • Launch and manage each application ApplicationMaster, and as Container ApplicationMaster apply for the first time it will fail to start and restart
  • NodeManager management, and resource node receiving health from NodeManager the report, and issued a command NodeManager management resources, such as kill off a container
  • Resource management and scheduling, accepting applications from ApplicationMaster resources, and its allocation. This is his most important only.
MapReduce1 Yarn
Jobtrack Resource Manager (Resourcemanager), applicationMaster, timeline server
Tasktrack   Node Manager (nodemanager)
slot Container (container)

 

 

MapReduce:

 

Guess you like

Origin www.cnblogs.com/luminous-Xin/p/11568478.html